数据集:
Fraser/wiki_sentences
A dataset of all english sentences in Wikipedia.
Taken from the OPTIMUS project. https://github.com/ChunyuanLI/Optimus/blob/master/download_datasets.md
The dataset is 11.8GB so best to load it using streaming:
from datasets import load_dataset dataset = load_dataset("Fraser/wiki_sentences", split='train', streaming=True)