数据集:

machelreid/m2d2

中文

M2D2: A Massively Multi-domain Language Modeling Dataset

From the paper " M2D2: A Massively Multi-domain Language Modeling Dataset ", (Reid et al., EMNLP 2022)

Load the dataset as follows:

import datasets

dataset = datasets.load_dataset("machelreid/m2d2", "cs.CL") # replace cs.CL with the domain of your choice

print(dataset['train'][0]['text'])

Domains

  • Culture_and_the_arts
  • Culture_and_the_arts__Culture_and_Humanities
  • Culture_and_the_arts__Games_and_Toys
  • Culture_and_the_arts__Mass_media
  • Culture_and_the_arts__Performing_arts
  • Culture_and_the_arts__Sports_and_Recreation
  • Culture_and_the_arts__The_arts_and_Entertainment
  • Culture_and_the_arts__Visual_arts
  • General_referece
  • General_referece__Further_research_tools_and_topics
  • General_referece__Reference_works
  • Health_and_fitness
  • Health_and_fitness__Exercise
  • Health_and_fitness__Health_science
  • Health_and_fitness__Human_medicine
  • Health_and_fitness__Nutrition
  • Health_and_fitness__Public_health
  • Health_and_fitness__Self_care
  • History_and_events
  • History_and_events__By_continent
  • History_and_events__By_period
  • History_and_events__By_region
  • Human_activites
  • Human_activites__Human_activities
  • Human_activites__Impact_of_human_activity
  • Mathematics_and_logic
  • Mathematics_and_logic__Fields_of_mathematics
  • Mathematics_and_logic__Logic
  • Mathematics_and_logic__Mathematics
  • Natural_and_physical_sciences
  • Natural_and_physical_sciences__Biology
  • Natural_and_physical_sciences__Earth_sciences
  • Natural_and_physical_sciences__Nature
  • Natural_and_physical_sciences__Physical_sciences
  • Philosophy
  • Philosophy_and_thinking
  • Philosophy_and_thinking__Philosophy
  • Philosophy_and_thinking__Thinking
  • Religion_and_belief_systems
  • Religion_and_belief_systems__Allah
  • Religion_and_belief_systems__Belief_systems
  • Religion_and_belief_systems__Major_beliefs_of_the_world
  • Society_and_social_sciences
  • Society_and_social_sciences__Social_sciences
  • Society_and_social_sciences__Society
  • Technology_and_applied_sciences
  • Technology_and_applied_sciences__Agriculture
  • Technology_and_applied_sciences__Computing
  • Technology_and_applied_sciences__Engineering
  • Technology_and_applied_sciences__Transport
  • alg-geom
  • ao-sci
  • astro-ph
  • astro-ph.CO
  • astro-ph.EP
  • astro-ph.GA
  • astro-ph.HE
  • astro-ph.IM
  • astro-ph.SR
  • astro-ph_l1
  • atom-ph
  • bayes-an
  • chao-dyn
  • chem-ph
  • cmp-lg
  • comp-gas
  • cond-mat
  • cond-mat.dis-nn
  • cond-mat.mes-hall
  • cond-mat.mtrl-sci
  • cond-mat.other
  • cond-mat.quant-gas
  • cond-mat.soft
  • cond-mat.stat-mech
  • cond-mat.str-el
  • cond-mat.supr-con
  • cond-mat_l1
  • cs.AI
  • cs.AR
  • cs.CC
  • cs.CE
  • cs.CG
  • cs.CL
  • cs.CR
  • cs.CV
  • cs.CY
  • cs.DB
  • cs.DC
  • cs.DL
  • cs.DM
  • cs.DS
  • cs.ET
  • cs.FL
  • cs.GL
  • cs.GR
  • cs.GT
  • cs.HC
  • cs.IR
  • cs.IT
  • cs.LG
  • cs.LO
  • cs.MA
  • cs.MM
  • cs.MS
  • cs.NA
  • cs.NE
  • cs.NI
  • cs.OH
  • cs.OS
  • cs.PF
  • cs.PL
  • cs.RO
  • cs.SC
  • cs.SD
  • cs.SE
  • cs.SI
  • cs.SY
  • cs_l1
  • dg-ga
  • econ.EM
  • econ.GN
  • econ.TH
  • econ_l1
  • eess.AS
  • eess.IV
  • eess.SP
  • eess.SY
  • eess_l1
  • eval_sets
  • funct-an
  • gr-qc
  • hep-ex
  • hep-lat
  • hep-ph
  • hep-th
  • math-ph
  • math.AC
  • math.AG
  • math.AP
  • math.AT
  • math.CA
  • math.CO
  • math.CT
  • math.CV
  • math.DG
  • math.DS
  • math.FA
  • math.GM
  • math.GN
  • math.GR
  • math.GT
  • math.HO
  • math.IT
  • math.KT
  • math.LO
  • math.MG
  • math.MP
  • math.NA
  • math.NT
  • math.OA
  • math.OC
  • math.PR
  • math.QA
  • math.RA
  • math.RT
  • math.SG
  • math.SP
  • math.ST
  • math_l1
  • mtrl-th
  • nlin.AO
  • nlin.CD
  • nlin.CG
  • nlin.PS
  • nlin.SI
  • nlin_l1
  • nucl-ex
  • nucl-th
  • patt-sol
  • physics.acc-ph
  • physics.ao-ph
  • physics.app-ph
  • physics.atm-clus
  • physics.atom-ph
  • physics.bio-ph
  • physics.chem-ph
  • physics.class-ph
  • physics.comp-ph
  • physics.data-an
  • physics.ed-ph
  • physics.flu-dyn
  • physics.gen-ph
  • physics.geo-ph
  • physics.hist-ph
  • physics.ins-det
  • physics.med-ph
  • physics.optics
  • physics.plasm-ph
  • physics.pop-ph
  • physics.soc-ph
  • physics.space-ph
  • physics_l1
  • plasm-ph
  • q-alg
  • q-bio
  • q-bio.BM
  • q-bio.CB
  • q-bio.GN
  • q-bio.MN
  • q-bio.NC
  • q-bio.OT
  • q-bio.PE
  • q-bio.QM
  • q-bio.SC
  • q-bio.TO
  • q-bio_l1
  • q-fin.CP
  • q-fin.EC
  • q-fin.GN
  • q-fin.MF
  • q-fin.PM
  • q-fin.PR
  • q-fin.RM
  • q-fin.ST
  • q-fin.TR
  • q-fin_l1
  • quant-ph
  • solv-int
  • stat.AP
  • stat.CO
  • stat.ME
  • stat.ML
  • stat.OT
  • stat.TH
  • stat_l1
  • supr-con supr-con

Citation

Please cite this work if you found this data useful.

@article{reid2022m2d2,
  title   = {M2D2: A Massively Multi-domain Language Modeling Dataset},
  author  = {Machel Reid and Victor Zhong and Suchin Gururangan and Luke Zettlemoyer},
  year    = {2022},
  journal = {arXiv preprint arXiv: Arxiv-2210.07370}
}