Dataset Card for "LexFiles"

Dataset Summary

The LeXFiles is a new diverse English multinational legal corpus that we created including 11 distinct sub-corpora that cover legislation and case law from 6 primarily English-speaking legal systems (EU, CoE, Canada, US, UK, India). The corpus contains approx. 19 billion tokens. In comparison, the "Pile of Law" corpus released by Hendersons et al. (2022) comprises 32 billion in total, where the majority (26/30) of sub-corpora come from the United States of America (USA), hence the corpus as a whole is biased towards the US legal system in general, and the federal or state jurisdiction in particular, to a significant extent.

Dataset Specifications

Corpus	Corpus alias	Documents	Tokens	Pct.	Sampl. (a=0.5)	Sampl. (a=0.2)
EU Legislation	eu-legislation	93.7K	233.7M	1.2%	5.0%	8.0%
EU Court Decisions	eu-court-cases	29.8K	178.5M	0.9%	4.3%	7.6%
ECtHR Decisions	ecthr-cases	12.5K	78.5M	0.4%	2.9%	6.5%
UK Legislation	uk-legislation	52.5K	143.6M	0.7%	3.9%	7.3%
UK Court Decisions	uk-court-cases	47K	368.4M	1.9%	6.2%	8.8%
Indian Court Decisions	indian-court-cases	34.8K	111.6M	0.6%	3.4%	6.9%
Canadian Legislation	canadian-legislation	6K	33.5M	0.2%	1.9%	5.5%
Canadian Court Decisions	canadian-court-cases	11.3K	33.1M	0.2%	1.8%	5.4%
U.S. Court Decisions [1]	court-listener	4.6M	11.4B	59.2%	34.7%	17.5%
U.S. Legislation	us-legislation	518	1.4B	7.4%	12.3%	11.5%
U.S. Contracts	us-contracts	622K	5.3B	27.3%	23.6%	15.0%
Total	lexlms/lex_files	5.8M	18.8B	100%	100%	100%

[1] We consider only U.S. Court Decisions from 1965 onwards (cf. post Civil Rights Act), as a hard threshold for cases relying on severely out-dated and in many cases harmful law standards. The rest of the corpora include more recent documents.

[2] Sampling (Sampl.) ratios are computed following the exponential sampling introduced by Lample et al. (2019).

Additional corpora not considered for pre-training, since they do not represent factual legal knowledge.

Corpus	Corpus alias	Documents	Tokens
Legal web pages from C4	legal-c4	284K	340M

Citation

Ilias Chalkidis*, Nicolas Garneau*, Catalina E.C. Goanta, Daniel Martin Katz, and Anders Søgaard. LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development. 2022. In the Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics. Toronto, Canada.

@inproceedings{chalkidis-garneau-etal-2023-lexlms,
    title = {{LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development}},
    author = "Chalkidis*, Ilias and 
              Garneau*, Nicolas and
              Goanta, Catalina and 
              Katz, Daniel Martin and 
              Søgaard, Anders",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics",
    month = june,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2305.07507",
}

作者:

lexlms

数据集大小:

12.28 GB