数据集:
pile-of-law/pile-of-law
任务:
填充掩码语言:
en计算机处理:
monolingual大小:
10M<n<100M语言创建人:
found批注创建人:
no-annotation预印本库:
arxiv:2207.00220许可:
cc-by-nc-sa-4.0We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.
See paper for details.
Mainly English, but some other languages may appear in some portions of the data.
courtListener_docket_entry_documents : Docket entries in U.S. federal courts, including filed briefs from CourtListener RECAP archive.
courtListener_opinions : U.S. court opinions from CourtListener (synchronized as of 12/31/2022).
atticus_contracts : Unannotated contracts from the Atticus Project.
federal_register : The U.S. federal register where agencies file draft rulemaking.
bva_opinions : Bureau of Veterans Appeals opinions.
us_bills : Draft Bills from the United States Congress.
cc_casebooks : Educational Casebooks released under open CC licenses.
tos : Unannotated Terms of Service contracts.
euro_parl : European parliamentary debates.
nlrb_decisions : Decisions from the U.S. National Labor Review Board.
scotus_oral_arguments : U.S. Supreme Court Oral Arguments
cfr : U.S. Code of Federal Regulations
state_codes : U.S. State Codes
scotus_filings : Briefs and filings with the U.S. Supreme Court.
exam_outlines : Exam outlines available openly on the web.
edgar : Contracts filed with the SEC and made available on the SEC's Edgar tool.
cfpb_creditcard_contracts : Credit Card Contracts compiled by the U.S. Consumer Finance Protection Bureau.
constitutions : The World's constitutions.
congressional_hearings : U.S. Congressional hearing transcripts and statements.
oig : U.S. Office of Inspector general reports.
olc_memos : U.S. Office of Legal Counsel memos.
uscode : The United States Code (laws).
founding_docs : Letters from U.S. founders.
ftc_advisory_opinions : Advisory opinions by the Federal Trade Commission.
echr : European Court of Human Rights opinions.
eurlex : European Laws.
tax_rulings : Rulings from U.S. Tax court.
un_debates : U.N. General Debates
fre : U.S. Federal Rules of Evidence
frcp : U.S. Federal Rules of Civil Procedure
canadian_decisions : Canadian Court Opinions from ON and BC.
eoir : U.S. Executive Office for Immigration Review Immigration and Nationality Precedential Decisions
dol_ecab : Department of Labor Employees' Compensation Appeals Board decisions after 2006
r_legaladvice : Filtered data from the r/legaladvice and r/legaladviceofftopic subreddits in the format. Title: [Post Title] Question: [Post Content] Topic: [Post Flair] Answer #[N]: [Top Answers]...
acus_reports : Reports from the Administrative Conference of the United States from 2010-2022.
ed_policy_guidance : Policy guidance documents from the U.S. Department of Education (2001-2022).
uspto_office_actions : Office Actions from the U.S. Patent and Trademark Office from 2019-2022.
icj-pcij : International Court of Justice and Permanent Court of International Justice opinions.
hhs_alj_opinions : Opinions from the U.S. Department of Health and Human Services Administrative Law Judges from 1985-2019.
sec_administrative_proceedings : Significant pleadings, orders and decisions for administrative proceedings from the U.S. Securities and Exchange Commission from 2005-2022.
fmshrc_bluebooks : Bluebooks from the U.S. Federal Mine Safety and Health Review Commission from 1979 (March) - 2022 (August).
resource_contracts : Resource Contracts collected by ResourceContracts.org
medicaid_policy_guidance : Policy guidance documents from the U.S. Department of Health and Human Services (1994-2022).
irs_legal_advice_memos : Legal Advice Memos and Chief Counsel Notices from the U.S. Internal Revenue Service.
doj_guidance : Guidance documents from the U.S. Department of Justice (2020-2022).
1/23 update : Data updated in 2023 included: syncing courtListener opinions, adding ACUS reports, USPTO office actions, Ed Policy Guidance, HHS ALJ opinions, SEC administrative proceedings, FMSHRC Bluebooks, Resource Contracts, and ICJ/PCIJ legal opinions. We also fixed OLC opinions which had some formatting inconsistencies and merged exam outlines into one file, adding some additional exam outlines.
On-disk sizes might vary due to caching and compression, but should be approximately as follows as of 1/7/2023.
% xz --list data/*.xz Strms Blocks Compressed Uncompressed Ratio Check Filename 183 181 9,631.2 KiB 35.0 MiB 0.268 CRC64 data/train.acus_reports.jsonl.xz 1 1 1,024.1 MiB 6,804.7 MiB 0.150 CRC64 data/train.atticus_contracts.0.jsonl.xz 1 1 1,024.1 MiB 6,781.1 MiB 0.151 CRC64 data/train.atticus_contracts.1.jsonl.xz 1 1 1,024.1 MiB 6,790.1 MiB 0.151 CRC64 data/train.atticus_contracts.2.jsonl.xz 1 1 1,024.1 MiB 6,759.2 MiB 0.152 CRC64 data/train.atticus_contracts.3.jsonl.xz 1 1 139.9 MiB 925.0 MiB 0.151 CRC64 data/train.atticus_contracts.4.jsonl.xz 1 1 1,564.6 MiB 12.5 GiB 0.123 CRC64 data/train.bva.jsonl.xz 1 1 29.8 MiB 154.3 MiB 0.193 CRC64 data/train.canadian_decisions.jsonl.xz 1 1 18.5 MiB 82.6 MiB 0.224 CRC64 data/train.cc_casebooks.jsonl.xz 1 1 3,427.3 KiB 67.2 MiB 0.050 CRC64 data/train.cfpb_cc.jsonl.xz 1 1 72.7 MiB 582.6 MiB 0.125 CRC64 data/train.cfr.jsonl.xz 1 1 1,056.1 MiB 4,941.9 MiB 0.214 CRC64 data/train.congressional_hearings.jsonl.xz 1 1 3,272.4 KiB 21.3 MiB 0.150 CRC64 data/train.constitutions.jsonl.xz 1 1 1,024.1 MiB 13.0 GiB 0.077 CRC64 data/train.courtlistenerdocketentries.0.jsonl.xz 1 1 1,024.3 MiB 13.3 GiB 0.075 CRC64 data/train.courtlistenerdocketentries.1.jsonl.xz 1 1 1,024.1 MiB 12.4 GiB 0.080 CRC64 data/train.courtlistenerdocketentries.2.jsonl.xz 1 1 635.2 MiB 8,671.6 MiB 0.073 CRC64 data/train.courtlistenerdocketentries.3.jsonl.xz 1 1 953.7 MiB 4,575.7 MiB 0.208 CRC64 data/train.courtlisteneropinions.0.jsonl.xz 1 1 953.7 MiB 4,356.2 MiB 0.219 CRC64 data/train.courtlisteneropinions.1.jsonl.xz 1 1 953.7 MiB 4,315.6 MiB 0.221 CRC64 data/train.courtlisteneropinions.10.jsonl.xz 1 1 953.7 MiB 4,650.3 MiB 0.205 CRC64 data/train.courtlisteneropinions.11.jsonl.xz 1 1 953.7 MiB 4,836.3 MiB 0.197 CRC64 data/train.courtlisteneropinions.12.jsonl.xz 1 1 953.7 MiB 4,644.9 MiB 0.205 CRC64 data/train.courtlisteneropinions.13.jsonl.xz 1 1 953.7 MiB 4,657.5 MiB 0.205 CRC64 data/train.courtlisteneropinions.14.jsonl.xz 1 1 539.2 MiB 2,621.8 MiB 0.206 CRC64 data/train.courtlisteneropinions.15.jsonl.xz 1 1 953.7 MiB 4,335.3 MiB 0.220 CRC64 data/train.courtlisteneropinions.2.jsonl.xz 1 1 953.7 MiB 4,352.0 MiB 0.219 CRC64 data/train.courtlisteneropinions.3.jsonl.xz 1 1 953.7 MiB 4,575.9 MiB 0.208 CRC64 data/train.courtlisteneropinions.4.jsonl.xz 1 1 953.7 MiB 4,382.6 MiB 0.218 CRC64 data/train.courtlisteneropinions.5.jsonl.xz 1 1 953.7 MiB 4,352.3 MiB 0.219 CRC64 data/train.courtlisteneropinions.6.jsonl.xz 1 1 953.7 MiB 4,462.4 MiB 0.214 CRC64 data/train.courtlisteneropinions.7.jsonl.xz 1 1 953.7 MiB 4,604.0 MiB 0.207 CRC64 data/train.courtlisteneropinions.8.jsonl.xz 1 1 953.7 MiB 4,612.0 MiB 0.207 CRC64 data/train.courtlisteneropinions.9.jsonl.xz 335 335 6,047.4 KiB 24.1 MiB 0.245 CRC64 data/train.doj_guidance.jsonl.xz 1 1 41.1 MiB 305.6 MiB 0.135 CRC64 data/train.dol_ecab.jsonl.xz 1 1 19.1 MiB 100.5 MiB 0.190 CRC64 data/train.echr.jsonl.xz 508 507 1,502.0 KiB 4,716.7 KiB 0.318 CRC64 data/train.ed_policy_guidance.jsonl.xz 1 1 1,372.0 MiB 9,032.6 MiB 0.152 CRC64 data/train.edgar.jsonl.xz 1 1 3,896.6 KiB 18.6 MiB 0.205 CRC64 data/train.eoir.jsonl.xz 1 1 140.3 MiB 1,154.7 MiB 0.121 CRC64 data/train.eurlex.jsonl.xz 1 1 51.4 MiB 239.4 MiB 0.215 CRC64 data/train.euro_parl.jsonl.xz 1 1 355.3 KiB 1,512.5 KiB 0.235 CRC64 data/train.examoutlines.jsonl.xz 1 1 20.7 MiB 131.7 MiB 0.157 CRC64 data/train.federal_register.jsonl.xz 396 396 43.9 MiB 175.7 MiB 0.250 CRC64 data/train.fmshrc.jsonl.xz 1 1 73.4 MiB 341.7 MiB 0.215 CRC64 data/train.founding_docs.jsonl.xz 1 1 324.2 KiB 1,459.4 KiB 0.222 CRC64 data/train.frcp.jsonl.xz 1 1 116.1 KiB 484.9 KiB 0.239 CRC64 data/train.fre.jsonl.xz 1 1 297.3 KiB 1,245.0 KiB 0.239 CRC64 data/train.ftc_advisory_opinions.jsonl.xz 2,084 2,083 13.4 MiB 42.2 MiB 0.318 CRC64 data/train.hhs_alj.jsonl.xz 1 1 29.5 MiB 157.4 MiB 0.188 CRC64 data/train.ijc.jsonl.xz 442 442 7,904.4 KiB 35.8 MiB 0.216 CRC64 data/train.irs_legal_advice_memos.jsonl.xz 658 658 3,403.1 KiB 10.6 MiB 0.314 CRC64 data/train.medicaid_policy_guidance.jsonl.xz 1 1 170.7 MiB 788.9 MiB 0.216 CRC64 data/train.nlrb_decisions.jsonl.xz 1 1 218.4 MiB 1,580.3 MiB 0.138 CRC64 data/train.oig.jsonl.xz 1 1 5,857.4 KiB 31.5 MiB 0.182 CRC64 data/train.olc_memos.jsonl.xz 1 1 58.6 MiB 234.5 MiB 0.250 CRC64 data/train.r_legaldvice.jsonl.xz 1,639 1,639 43.7 MiB 188.1 MiB 0.232 CRC64 data/train.resource_contracts.jsonl.xz 1 1 242.6 MiB 1,241.6 MiB 0.195 CRC64 data/train.scotus_docket_entries.jsonl.xz 1 1 68.5 MiB 323.2 MiB 0.212 CRC64 data/train.scotus_oral.jsonl.xz 10,805 10,805 40.7 MiB 118.4 MiB 0.344 CRC64 data/train.sec.jsonl.xz 1 1 705.0 MiB 5,019.9 MiB 0.140 CRC64 data/train.state_code.jsonl.xz 1 1 75.2 MiB 540.8 MiB 0.139 CRC64 data/train.taxrulings.jsonl.xz 1 1 273.6 KiB 1,318.5 KiB 0.207 CRC64 data/train.tos.jsonl.xz 1 1 22.6 MiB 108.1 MiB 0.209 CRC64 data/train.undebates.jsonl.xz 1 1 167.6 MiB 1,119.6 MiB 0.150 CRC64 data/train.us_bills.jsonl.xz 1 1 25.3 MiB 196.1 MiB 0.129 CRC64 data/train.uscode.jsonl.xz 1 1 1,713.2 MiB 33.7 GiB 0.050 CRC64 data/train.uspto_oab.jsonl.xz 54 54 2,960.9 KiB 11.0 MiB 0.264 CRC64 data/validation.acus_reports.jsonl.xz 1 1 1,024.1 MiB 6,797.1 MiB 0.151 CRC64 data/validation.atticus_contracts.0.jsonl.xz 1 1 374.6 MiB 2,471.7 MiB 0.152 CRC64 data/validation.atticus_contracts.1.jsonl.xz 1 1 523.0 MiB 4,258.9 MiB 0.123 CRC64 data/validation.bva.jsonl.xz 1 1 9.8 MiB 50.5 MiB 0.195 CRC64 data/validation.canadian_decisions.jsonl.xz 1 1 4,281.5 KiB 19.1 MiB 0.219 CRC64 data/validation.cc_casebooks.jsonl.xz 1 1 1,532.6 KiB 19.6 MiB 0.077 CRC64 data/validation.cfpb_cc.jsonl.xz 1 1 23.3 MiB 190.4 MiB 0.122 CRC64 data/validation.cfr.jsonl.xz 1 1 347.4 MiB 1,620.7 MiB 0.214 CRC64 data/validation.congressional_hearings.jsonl.xz 1 1 1,102.4 KiB 6,733.0 KiB 0.164 CRC64 data/validation.constitutions.jsonl.xz 1 1 1,024.1 MiB 10.7 GiB 0.094 CRC64 data/validation.courtlistenerdocketentries.0.jsonl.xz 1 1 473.7 MiB 5,225.2 MiB 0.091 CRC64 data/validation.courtlistenerdocketentries.1.jsonl.xz 1 1 953.7 MiB 4,391.3 MiB 0.217 CRC64 data/validation.courtlisteneropinions.0.jsonl.xz 1 1 953.7 MiB 4,406.9 MiB 0.216 CRC64 data/validation.courtlisteneropinions.1.jsonl.xz 1 1 953.8 MiB 4,436.7 MiB 0.215 CRC64 data/validation.courtlisteneropinions.2.jsonl.xz 1 1 953.7 MiB 4,476.9 MiB 0.213 CRC64 data/validation.courtlisteneropinions.3.jsonl.xz 1 1 953.7 MiB 4,618.0 MiB 0.207 CRC64 data/validation.courtlisteneropinions.4.jsonl.xz 1 1 238.5 MiB 1,147.4 MiB 0.208 CRC64 data/validation.courtlisteneropinions.5.jsonl.xz 100 100 1,778.7 KiB 7,371.5 KiB 0.241 CRC64 data/validation.doj_guidance.jsonl.xz 1 1 13.8 MiB 101.5 MiB 0.136 CRC64 data/validation.dol_ecab.jsonl.xz 1 1 4,132.1 KiB 20.8 MiB 0.194 CRC64 data/validation.echr.jsonl.xz 174 173 490.5 KiB 1,564.9 KiB 0.313 CRC64 data/validation.ed_policy_guidance.jsonl.xz 1 1 453.6 MiB 2,978.9 MiB 0.152 CRC64 data/validation.edgar.jsonl.xz 1 1 1,340.0 KiB 6,294.8 KiB 0.213 CRC64 data/validation.eoir.jsonl.xz 1 1 49.1 MiB 393.7 MiB 0.125 CRC64 data/validation.eurlex.jsonl.xz 1 1 17.0 MiB 79.0 MiB 0.215 CRC64 data/validation.euro_parl.jsonl.xz 1 1 103.7 KiB 547.9 KiB 0.189 CRC64 data/validation.examoutlines.jsonl.xz 1 1 7,419.0 KiB 45.7 MiB 0.158 CRC64 data/validation.federal_register.jsonl.xz 120 120 13.5 MiB 53.9 MiB 0.250 CRC64 data/validation.fmshrc.jsonl.xz 1 1 25.3 MiB 113.2 MiB 0.224 CRC64 data/validation.founding_docs.jsonl.xz 1 1 63.5 KiB 248.8 KiB 0.255 CRC64 data/validation.frcp.jsonl.xz 1 1 58.4 KiB 226.7 KiB 0.257 CRC64 data/validation.fre.jsonl.xz 1 1 117.4 KiB 419.1 KiB 0.280 CRC64 data/validation.ftc_advisory_opinions.jsonl.xz 722 721 4,900.2 KiB 15.1 MiB 0.318 CRC64 data/validation.hhs_alj.jsonl.xz 1 1 10.0 MiB 52.3 MiB 0.191 CRC64 data/validation.ijc.jsonl.xz 161 161 3,791.0 KiB 17.7 MiB 0.209 CRC64 data/validation.irs_legal_advice_memos.jsonl.xz 214 214 1,101.1 KiB 3,411.1 KiB 0.323 CRC64 data/validation.medicaid_policy_guidance.jsonl.xz 1 1 55.8 MiB 257.8 MiB 0.217 CRC64 data/validation.nlrb_decisions.jsonl.xz 1 1 80.0 MiB 603.7 MiB 0.132 CRC64 data/validation.oig.jsonl.xz 1 1 1,826.2 KiB 9,874.6 KiB 0.185 CRC64 data/validation.olc_memos.jsonl.xz 1 1 19.7 MiB 78.7 MiB 0.251 CRC64 data/validation.r_legaldvice.jsonl.xz 584 584 15.3 MiB 63.5 MiB 0.241 CRC64 data/validation.resource_contracts.jsonl.xz 1 1 86.4 MiB 422.5 MiB 0.204 CRC64 data/validation.scotus_docket_entries.jsonl.xz 1 1 23.1 MiB 109.0 MiB 0.212 CRC64 data/validation.scotus_oral.jsonl.xz 3,559 3,559 13.0 MiB 37.7 MiB 0.344 CRC64 data/validation.sec.jsonl.xz 1 1 371.8 MiB 2,678.4 MiB 0.139 CRC64 data/validation.state_code.jsonl.xz 1 1 24.8 MiB 177.4 MiB 0.140 CRC64 data/validation.taxrulings.jsonl.xz 1 1 92.7 KiB 381.6 KiB 0.243 CRC64 data/validation.tos.jsonl.xz 1 1 7,705.6 KiB 35.5 MiB 0.212 CRC64 data/validation.undebates.jsonl.xz 1 1 53.8 MiB 356.3 MiB 0.151 CRC64 data/validation.us_bills.jsonl.xz 1 1 15.2 MiB 117.5 MiB 0.129 CRC64 data/validation.uscode.jsonl.xz 1 1 885.5 MiB 11.2 GiB 0.077 CRC64 data/validation.uspto_oab.jsonl.xz ------------------------------------------------------------------------------- 22,839 22,833 41.0 GiB 291.5 GiB 0.141 CRC64 119 files
There is a train/validation split for each subset of the data. 75%/25%. Note, we do not use the validation set for any downstream tasks nor do we filter out any data from downstream tasks. Please filter as needed before training models or feel free to use a different dataset split.
We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives. As such, data sources are curated to inform: (1) legal analysis, knowledge, or understanding; (2) argument formation; (3) privacy filtering standards. Sources like codes and laws tend to inform (1). Transcripts and court filings tend to inform (2). Opinions tend to inform (1) and (3).
We do not normalize the data, but we provide dataset creation code and relevant urls in https://github.com/Breakend/PileOfLaw
Who are the source language producers?Varied (see sources above).
This dataset may contain personal and sensitive information. However, this has been previously filtered by the relevant government and federal agencies that weigh the harms of revealing this information against the benefits of transparency. If you encounter something particularly harmful, please file a takedown request with the upstream source and notify us in the communities tab. We will then remove the content. We cannot enable more restrictive licensing because upstream sources may restrict using a more restrictive license. However, we ask that all users of this data respect the upstream licenses and restrictions. Per the standards of CourtListener, we do not allow indexing of this data by search engines and we ask that others do not also. Please do not turn on anything that allows the data to be easily indexed.
We hope that this dataset will provide more mechanisms for doing data work. As we describe in the paper, the internal variation allows contextual privacy rules to be learned. If robust mechanisms for this are developed they can applied more broadly. This dataset can also potentially be used for legal language model pretraining. As discussed in ``On the Opportunities and Risks of Foundation Models'', legal language models can help improve access to justice in various ways. But they can also be used in potentially harmful ways. While such models are not ready for most production environments and are the subject of significant research, we ask that model creators using this data, particularly when creating generative models, consider the impacts of their model and make a good faith effort to weigh the benefits against the harms of their method. Our license and many of the sub-licenses also restrict commercial usage.
The data reflects the biases of governments and courts. As we discuss in our work, these can be significant, though more recent text will likely be less overtly toxic. Please see the above statement and embark on any model uses responsibly.
We mainly focus on U.S. and English-speaking legal sources, though we include some European and Canadian resources.
CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International. But individual sources may have other licenses. See paper for details. Some upstream data sources request that indexing be disabled. As such please do not re-host any data in a way that can be indexed by search engines.
We do not make any representation that the legal information provided here is accurate. It is meant for research purposes only. For the authoritative and updated source of information please refer directly to the governing body which provides the latest laws, rules, and regulations relevant to you.
Pile of Law follows the notice and takedown procedures in the Digital Millennium Copyright Act (DMCA), 17 U.S.C. Section 512.
If you believe content on Pile of Law violates your copyright, please immediately notify its operators by sending a message with the information described below. Please use the subject "Copyright" in your message. If Pile of Law's operators act in response to an infringement notice, they will make a good-faith attempt to contact the person who contributed the content using the most recent email address that person provided to Pile of Law.
Under the DMCA, you may be held liable for damages based on material misrepresentations in your infringement notice. You must also make a good-faith evaluation of whether the use of your content is a fair use, because fair uses are not infringing. See 17 U.S.C. Section 107 and Lenz v. Universal Music Corp., No. 13-16106 (9th Cir. Sep. 14, 2015). If you are not sure if the content you want to report infringes your copyright, you should first contact a lawyer.
The DMCA requires that all infringement notices must include all of the following:
Pile of Law will respond to all DMCA-compliant infringement notices, including, as required or appropriate, by removing the offending material or disabling all links to it.
All received infringement notices may be posted in full to the Lumen database (previously known as the Chilling Effects Clearinghouse).
All takedown requests with the above information should be posted to the Communities tab.
This removal notice has been modified from the (CourtListener DMCA takedown notice)[ https://www.courtlistener.com/terms/] .
For a citation to this work:
@misc{hendersonkrass2022pileoflaw, url = {https://arxiv.org/abs/2207.00220}, author = {Henderson*, Peter and Krass*, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.}, title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset}, publisher = {arXiv}, year = {2022} }
Since this dataset also includes several other data sources with citations, please refer to our paper and cite the additional relevant work in addition to our own work.