数据集:
lexlms/lex_files
语言:
en计算机处理:
monolingual大小:
1M<n<10M语言创建人:
found批注创建人:
no-annotation源数据集:
extended预印本库:
arxiv:2305.07507许可:
cc-by-nc-sa-4.0The LeXFiles is a new diverse English multinational legal corpus that we created including 11 distinct sub-corpora that cover legislation and case law from 6 primarily English-speaking legal systems (EU, CoE, Canada, US, UK, India). The corpus contains approx. 19 billion tokens. In comparison, the "Pile of Law" corpus released by Hendersons et al. (2022) comprises 32 billion in total, where the majority (26/30) of sub-corpora come from the United States of America (USA), hence the corpus as a whole is biased towards the US legal system in general, and the federal or state jurisdiction in particular, to a significant extent.
Corpus | Corpus alias | Documents | Tokens | Pct. | Sampl. (a=0.5) | Sampl. (a=0.2) |
---|---|---|---|---|---|---|
EU Legislation | eu-legislation | 93.7K | 233.7M | 1.2% | 5.0% | 8.0% |
EU Court Decisions | eu-court-cases | 29.8K | 178.5M | 0.9% | 4.3% | 7.6% |
ECtHR Decisions | ecthr-cases | 12.5K | 78.5M | 0.4% | 2.9% | 6.5% |
UK Legislation | uk-legislation | 52.5K | 143.6M | 0.7% | 3.9% | 7.3% |
UK Court Decisions | uk-court-cases | 47K | 368.4M | 1.9% | 6.2% | 8.8% |
Indian Court Decisions | indian-court-cases | 34.8K | 111.6M | 0.6% | 3.4% | 6.9% |
Canadian Legislation | canadian-legislation | 6K | 33.5M | 0.2% | 1.9% | 5.5% |
Canadian Court Decisions | canadian-court-cases | 11.3K | 33.1M | 0.2% | 1.8% | 5.4% |
U.S. Court Decisions [1] | court-listener | 4.6M | 11.4B | 59.2% | 34.7% | 17.5% |
U.S. Legislation | us-legislation | 518 | 1.4B | 7.4% | 12.3% | 11.5% |
U.S. Contracts | us-contracts | 622K | 5.3B | 27.3% | 23.6% | 15.0% |
Total | lexlms/lex_files | 5.8M | 18.8B | 100% | 100% | 100% |
[1] We consider only U.S. Court Decisions from 1965 onwards (cf. post Civil Rights Act), as a hard threshold for cases relying on severely out-dated and in many cases harmful law standards. The rest of the corpora include more recent documents.
[2] Sampling (Sampl.) ratios are computed following the exponential sampling introduced by Lample et al. (2019).
Additional corpora not considered for pre-training, since they do not represent factual legal knowledge.
Corpus | Corpus alias | Documents | Tokens |
---|---|---|---|
Legal web pages from C4 | legal-c4 | 284K | 340M |
@inproceedings{chalkidis-garneau-etal-2023-lexlms, title = {{LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development}}, author = "Chalkidis*, Ilias and Garneau*, Nicolas and Goanta, Catalina and Katz, Daniel Martin and Søgaard, Anders", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics", month = june, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/2305.07507", }