数据集:

donfu/oa-stackexchange

中文

Stackexchange Instructions for OpenAssistant

This dataset is taken from https://archive.org/details/stackexchange .

There's a single parquet file combining all stackexchange sites. The threads have been filtered as follows: only threads with an accepted answer, for which both the question and response is less than 1000 characters have been choosen. Other answers, or questions without accepted answers, or long entries have been droppped.

Each row consists of

  • INSTRUCTION
  • RESPONSE
  • SOURCE («stackexchange-ai«)
  • METADATA (tags, question_score, answer_score).

Original extraction code by https://github.com/b-mc2

How to Reproduce this Dataset

  • Download all XML files from the stackexchange archive into the xml/ folder
  • ./download.py
    
  • Process the XML, filter conversations and convert to OA format into parquet/ folder
  • ./process.py
    
  • Run stats on all files in the parquet/ folder
  • ./stats.py
    
  • Combine all parquet files into one large stackexchange.parquet file
  • ./combine.py
    
  • Upload to huggingface hub, you'll first need use huggingface-cli login
  • ./upload.py
    

    Statistics

    • 3dprinting: 1,006
    • academia: 6,956
    • ai: 1,169
    • android: 11,591
    • anime: 3,688
    • apple: 32,603
    • arduino: 3,725
    • askubuntu: 78,472
    • astronomy: 2,425
    • aviation: 4,945
    • avp: 1,949
    • beer: 387
    • bicycles: 4,835
    • bioacoustics: 70
    • bioinformatics: 903
    • biology: 5,344
    • bitcoin: 7,456
    • blender: 25,527
    • boardgames: 4,538
    • bricks: 1,457
    • buddhism: 911
    • cardano: 670
    • chemistry: 7,430
    • chess: 2,185
    • chinese: 4,897
    • christianity: 1,248
    • civicrm: 3,221
    • codegolf: 943
    • codereview: 2,171
    • coffee: 350
    • cogsci: 645
    • computergraphics: 540
    • conlang: 101
    • cooking: 7,951
    • craftcms: 4,533
    • crafts: 438
    • crypto: 4,425
    • cs: 9,478
    • cseducators: 71
    • cstheory: 2,196
    • datascience: 5,045
    • dba: 16,850
    • devops: 961
    • diy: 14,400
    • drones: 190
    • drupal: 24,090
    • dsp: 4,470
    • earthscience: 922
    • ebooks: 323
    • economics: 2,120
    • electronics: 41,717
    • elementaryos: 1,769
    • ell: 30,428
    • emacs: 7,140
    • engineering: 2,314
    • english: 42,415
    • eosio: 626
    • es_stackoverflow: 21,475
    • esperanto: 617
    • ethereum: 9,603
    • expatriates: 973
    • expressionengine: 3,638
    • fitness: 1,833
    • freelancing: 338
    • french: 5,193
    • gamedev: 9,678
    • gaming: 44,899
    • gardening: 4,492
    • genealogy: 487
    • german: 6,715
    • gis: 30,249
    • graphicdesign: 10,563
    • ham: 790
    • hardwarerecs: 647
    • health: 804
    • hermeneutics: 782
    • hinduism: 1,036
    • history: 1,776
    • homebrew: 2,357
    • hsm: 484
    • interpersonal: 199
    • iot: 331
    • iota: 292
    • islam: 1,496
    • italian: 1,356
    • ja_stackoverflow: 9,734
    • japanese: 13,862
    • joomla: 1,875
    • judaism: 6,156
    • korean: 754
    • languagelearning: 135
    • latin: 1,387
    • law: 3,475
    • lifehacks: 934
    • linguistics: 1,507
    • literature: 582
    • magento: 20,537
    • martialarts: 364
    • materials: 338
    • math: 501,019
    • matheducators: 316
    • mathematica: 19,529
    • mathoverflow_net_7z: 23,803
    • mechanics: 4,735
    • meta: 34,161
    • meta_askubuntu: 2,076
    • meta_mathoverflow_net_7z: 333
    • meta_serverfault: 823
    • meta_stackoverflow: 12,641
    • meta_superuser: 1,748
    • moderators: 39
    • monero: 1,443
    • money: 7,996
    • movies: 6,789
    • music: 5,740
    • musicfans: 781
    • mythology: 271
    • networkengineering: 4,637
    • opendata: 1,117
    • opensource: 805
    • or: 586
    • outdoors: 1,503
    • parenting: 815
    • patents: 582
    • pets: 1,081
    • philosophy: 1,505
    • photo: 6,386
    • physics: 35,386
    • pm: 982
    • poker: 431
    • politics: 1,903
    • portuguese: 658
    • proofassistants: 87
    • pt_stackoverflow: 27,650
    • puzzling: 11,959
    • quant: 3,303
    • quantumcomputing: 1,604
    • raspberrypi: 6,794
    • retrocomputing: 1,016
    • reverseengineering: 1,606
    • robotics: 1,020
    • rpg: 9,517
    • ru_stackoverflow: 106,714
    • rus: 8,210
    • russian: 1,960
    • salesforce: 27,962
    • scicomp: 1,403
    • scifi: 15,174
    • security: 11,733
    • serverfault: 81,229
    • sharepoint: 24,934
    • sitecore: 2,691
    • skeptics: 1,043
    • softwareengineering: 10,526
    • softwarerecs: 3,032
    • solana: 602
    • sound: 2,031
    • space: 3,145
    • spanish: 3,049
    • sports: 1,715
    • sqa: 1,944
    • stackapps: 702
    • stackoverflow: 4,269,779
    • stats: 23,102
    • stellar: 373
    • substrate: 812
    • superuser: 128,488
    • sustainability: 240
    • tex: 42,808
    • tezos: 635
    • tor: 887
    • travel: 9,957
    • tridion: 1,769
    • ukrainian: 577
    • unix: 54,338
    • ux: 7,403
    • vegetarianism: 151
    • vi: 4,360
    • webapps: 10,159
    • webmasters: 9,413
    • windowsphone: 1,110
    • woodworking: 677
    • wordpress: 24,270
    • workplace: 4,104
    • worldbuilding: 2,766
    • writers: 1,957

    license: cc-by-sa-4.0 // See https://archive.org/details/stackexchange for details