数据集:

joelito/Multi_Legal_Pile

源数据集:

original

批注创建人:

other

语言创建人:

found

大小:

10M<n<100M

计算机处理:

multilingual
中文

Dataset Card for MultiLegalPile: A Large-Scale Multilingual Corpus for the Legal Domain

Dataset Summary

The Multi_Legal_Pile is a large-scale multilingual legal dataset suited for pretraining language models. It spans over 24 languages and five legal text types.

Supported Tasks and Leaderboards

The dataset supports the tasks of fill-mask.

Languages

The following languages are supported: bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv

Dataset Structure

It is structured in the following format: type -> language -> jurisdiction.jsonl.xz

type is one of the following:

  • caselaw
  • contracts
  • legislation
  • other
  • mc4_legal

mc4_legal is a subset of the other type but is listed separately so it can be easily excluded since it is less permissively licensed than the other types.

Use the dataset like this:

from datasets import load_dataset

config = 'en_contracts'  # {language}_{type}
dataset = load_dataset('joelito/Multi_Legal_Pile', config, split='train', streaming=True)

'config' is a combination of language and text_type, e.g. 'en_contracts' or 'de_caselaw'. To load all the languages or all the text_types, use 'all' instead of the language or text_type (e.g., ' all_legislation').

Data Instances

The file format is jsonl.xz and there is one split available ("train").

The complete dataset (689GB) consists of four large subsets:

  • Native Multi Legal Pile (112GB)
  • Eurlex Resources (179GB)
  • Legal MC4 (106GB)
  • Pile of Law (292GB)
Native Multilingual Legal Pile data
Language Text Type Jurisdiction Source Size (MB) Words Documents Words/Document URL License
0 bg legislation Bulgaria MARCELL 8015 308946116 82777 3732 https://elrc-share.eu/repository/browse/marcell-bulgarian-legislative-subcorpus-v2/946267fe8d8711eb9c1a00155d026706d2c9267e5cdf4d75b5f02168f01906c6/ CC0-1.0
1 cs caselaw Czechia CzCDC Constitutional Court 11151 574336489 296652 1936 https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-3052 CC BY-NC 4.0
2 cs caselaw Czechia CzCDC Supreme Administrative Court 11151 574336489 296652 1936 https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-3052 CC BY-NC 4.0
3 cs caselaw Czechia CzCDC Supreme Court 11151 574336489 296652 1936 https://lindat.mff.cuni.cz/repository/xmlui/handle/11372/LRT-3052 CC BY-NC 4.0
4 da caselaw Denmark DDSC 3469 210730560 89702 2349 https://huggingface.co/DDSC CC BY 4.0 and other, depending on the dataset
5 da legislation Denmark DDSC 10736 653153146 265868 2456 https://huggingface.co/DDSC CC BY 4.0 and other, depending on the dataset
6 de caselaw Germany openlegaldata 31527 1785439383 596800 2991 https://de.openlegaldata.io/ ODbL-1.0
7 de caselaw Switzerland entscheidsuche 31527 1785439383 596800 2991 https://entscheidsuche.ch/ See description
8 de legislation Germany openlegaldata 8934 512840663 276034 1857 https://de.openlegaldata.io/ ODbL-1.0
9 de legislation Switzerland lexfind 8934 512840663 276034 1857 https://www.lexfind.ch/fe/de/search No information provided
10 fr caselaw Switzerland entscheidsuche 18313 1170335690 435569 2686 https://entscheidsuche.ch/ See description
11 fr caselaw Belgium jurportal 18313 1170335690 435569 2686 https://juportal.be/home/welkom See description
12 fr caselaw France CASS 18313 1170335690 435569 2686 https://echanges.dila.gouv.fr/OPENDATA/CASS/ Open Licence 2.0
13 fr caselaw Luxembourg judoc 18313 1170335690 435569 2686 https://justice.public.lu/fr.html See description
14 it caselaw Switzerland entscheidsuche 6483 406520336 156630 2595 https://entscheidsuche.ch/ See description
15 en legislation Switzerland lexfind 36587 2537696894 657805 3857 https://www.lexfind.ch/fe/de/search No information provided
16 en legislation UK uk-lex 36587 2537696894 657805 3857 https://zenodo.org/record/6355465 CC BY 4.0
17 fr legislation Switzerland lexfind 9297 600170792 243313 2466 https://www.lexfind.ch/fe/fr/search No information provided
18 fr legislation Belgium ejustice 9297 600170792 243313 2466 https://www.ejustice.just.fgov.be/cgi/welcome.pl No information provided
19 it legislation Switzerland lexfind 8332 542579039 227968 2380 https://www.lexfind.ch/fe/it/search No information provided
20 nl legislation Belgium ejustice 8484 550788527 232204 2372 https://www.ejustice.just.fgov.be/cgi/welcome.pl No information provided
21 hu legislation Hungary MARCELL 5744 264572303 86862 3045 https://elrc-share.eu/repository/browse/marcell-hungarian-legislative-subcorpus-v2/a87295ec8d6511eb9c1a00155d0267065f7e56dc7db34ce5aaae0b48a329daaa/ CC0-1.0
22 pl legislation Poland MARCELL 5459 299334705 89264 3353 https://elrc-share.eu/repository/browse/marcell-polish-legislative-subcorpus-v2/dd14fa1c8d6811eb9c1a00155d026706c4718ddc9c6e4a92a88923816ca8b219/ CC0-1.0
23 pt caselaw Brazil RulingBR 196919 12611760973 17251236 731 https://github.com/diego-feijo/rulingbr No information provided
24 pt caselaw Brazil CRETA 196919 12611760973 17251236 731 https://www.kaggle.com/datasets/eliasjacob/brcad5?resource=download&select=language_modeling_texts.parquet CC BY-NC-SA 4.0
25 pt caselaw Brazil CJPG 196919 12611760973 17251236 731 https://esaj.tjsp.jus.br/cjsg/consultaCompleta.do?f=1 No information provided
26 ro legislation Romania MARCELL 10464 559092153 215694 2592 https://elrc-share.eu/repository/browse/marcell-romanian-legislative-subcorpus-v2/2da548428b9d11eb9c1a00155d026706ce94a6b59ffc4b0e9fb5cd9cebe6889e/ CC0-1.0
27 sk legislation Slovakia MARCELL 5208 280182047 76760 3650 https://elrc-share.eu/repository/browse/marcell-slovak-legislative-subcorpus-v2/6bdee1d68c8311eb9c1a00155d0267063398d3f1a3af40e1b728468dcbd6efdd/ CC0-1.0
28 sl legislation Slovenia MARCELL 6057 365513763 88651 4123 https://elrc-share.eu/repository/browse/marcell-slovenian-legislative-subcorpus-v2/e2a779868d4611eb9c1a00155d026706983c845a30d741b78e051faf91828b0d/ CC-BY-4.0
total all all all 1297609 xxx 81214262514 57305071 1417
Eurlex Resources

See Eurlex Resources for more information.

Legal-MC4

See Legal-MC4 for more information.

Pile-of-Law

See Pile-of-Law for more information.

Language Type Jurisdiction Source Size (MB) Tokens Documents Tokens/Document Part of Multi_Legal_Pile
en all all all 503712 50547777921 9872444 5120 yes
en caselaw EU echr 298 28374996 8480 3346 yes
en caselaw Canada canadian_decisions 486 45438083 11343 4005 yes
en caselaw US dol_ecab 942 99113541 28211 3513 no
en caselaw US scotus_oral_arguments 1092 108228951 7996 13535 no
en caselaw US tax_rulings 1704 166915887 54064 3087 no
en caselaw US nlrb_decisions 2652 294471818 32080 9179 no
en caselaw US scotus_filings 4018 593870413 63775 9311 yes
en caselaw US bva_opinions 35238 4084140080 839523 4864 no
en caselaw US courtlistener_docket_entry_documents 139006 12713614864 1983436 6409 yes
en caselaw US courtlistener_opinions 158110 15899704961 4518445 3518 yes
en contracts -- tos 4 391890 50 7837 no
en contracts US cfpb_creditcard_contracts 188 25984824 2638 9850 yes
en contracts US edgar 28698 2936402810 987926 2972 yes
en contracts US atticus_contracts 78300 7997013703 650833 12287 yes
en legislation US fre 2 173325 68 2548 no
en legislation US frcp 4 427614 92 4647 no
en legislation US eoir 62 6109737 2229 2741 no
en legislation -- constitutions 66 5984865 187 32004 yes
en legislation US federal_register 424 39854787 5414 7361 yes
en legislation US uscode 716 78466325 58 1352867 yes
en legislation EU euro_parl 808 71344326 9672 7376 no
en legislation US cfr 1788 160849007 243 661930 yes
en legislation US us_bills 3394 320723838 112483 2851 yes
en legislation EU eurlex 3504 401324829 142036 2825 no
en legislation US state_codes 18066 1858333235 217 8563747 yes
en other -- bar_exam_outlines 4 346924 59 5880 no
en other US ftc_advisory_opinions 4 509025 145 3510 no
en other US olc_memos 98 12764635 1384 9223 yes
en other -- cc_casebooks 258 24857378 73 340512 no
en other -- un_debates 360 31152497 8481 3673 no
en other -- r_legaladvice 798 72605386 146671 495 no
en other US founding_docs 1118 100390231 183664 546 no
en other US oig 5056 566782244 38954 14550 yes
en other US congressional_hearings 16448 1801110892 31514 57152 no

Data Fields

[More Information Needed]

Data Splits

[More Information Needed]

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

TODO add citation

Contributions

Thanks to @JoelNiklaus for adding this dataset.