数据集:
machelreid/m2d2
预印本库:
arxiv:2210.07370许可:
cc-by-nc-4.0From the paper " M2D2: A Massively Multi-domain Language Modeling Dataset ", (Reid et al., EMNLP 2022)
Load the dataset as follows:
import datasets dataset = datasets.load_dataset("machelreid/m2d2", "cs.CL") # replace cs.CL with the domain of your choice print(dataset['train'][0]['text'])
Please cite this work if you found this data useful.
@article{reid2022m2d2, title = {M2D2: A Massively Multi-domain Language Modeling Dataset}, author = {Machel Reid and Victor Zhong and Suchin Gururangan and Luke Zettlemoyer}, year = {2022}, journal = {arXiv preprint arXiv: Arxiv-2210.07370} }