数据集:
maastrichtlawtech/bsard
任务:
文本检索子任务:
document-retrieval语言:
fr计算机处理:
monolingual大小:
1K<n<10K语言创建人:
found批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:2108.11792许可:
cc-by-nc-sa-4.0The Belgian Statutory Article Retrieval Dataset (BSARD) v1.0 is a French native corpus for studying statutory article retrieval. BSARD consists of more than 22,600 statutory articles from Belgian law and about 1,100 legal questions posed by Belgian citizens and labeled by experienced jurists with relevant articles from the corpus.
The text in the dataset is in French, as spoken in Wallonia and Brussels-Capital region. The associated BCP-47 code is fr-BE .
A typical data point comprises a question, with additional category , subcategory , and extra_description fields that elaborate on it, and a list of article_ids from the corpus of statutory articles that are relevant to the question.
An example from the BSARD test set looks as follows:
{ 'id': '724', 'question': 'La police peut-elle me fouiller pour chercher du cannabis ?', 'category': 'Justice', 'subcategory': 'Petite délinquance', 'extra_description': 'Détenir, acheter et vendre du cannabis', 'article_ids': '13348' }
In "questions_fr_train.csv" and "questions_fr_test.csv" :
In "articles_fr.csv" :
This dataset is split into train/test set. Number of questions in each set is given below:
Train | Test | |
---|---|---|
BSARD | 886 | 222 |
The dataset is intended to be used by researchers to build and evaluate models on retrieving law articles relevant to an input legal question. It should not be regarded as a reliable source of legal information at this point in time, as both the questions and articles correspond to an outdated version of the Belgian law from May 2021 (time of dataset collection). In the latter case, the user is advised to consult daily updated official legal resources (e.g., the Belgian Official Gazette).
BSARD was created in four stages: (i) compiling a large corpus of Belgian law articles, (ii) gathering legal questions with references to relevant law articles, (iii) refining these questions, and (iv) matching the references to the corresponding articles from the corpus.
Who are the source language producers?Speakers were not directly approached for inclusion in this dataset and thus could not be asked for demographic information. Questions were collected, anonimyzed, and reformulated by Droits Quotidiens . Therefore, no direct information about the speakers’ age and gender distribution, or socioeconomic status is available. However, it is expected that most, but not all, of the speakers are adults (18+ years), speak French as a native language, and live in Wallonia or Brussels-Capital region.
Each year, Droits Quotidiens , a Belgian organization whose mission is to clarify the law for laypeople, receives and collects around 4,000 emails from Belgian citizens asking for advice on a personal legal issue. In practice, their legal clarification process consists of four steps. First, they identify the most frequently asked questions on a common legal issue. Then, they define a new anonymized "model" question on that issue expressed in natural language terms, i.e., as close as possible as if a layperson had asked it. Next, they search the Belgian law for articles that help answer the model question and reference them.
Who are the annotators?A total of six Belgian jurists from Droits Quotidiens contributed to annotating the questions. All have a law degree from a Belgian university and years of experience in providing legal advice and clarifications of the law. They range in age from 30-60 years, including one man and five women, gave their ethnicity as white European, speak French as a native language, and represent upper middle class based on income levels.
The questions represent informal, asynchronous, edited, written language that does not exceed 265 words. None of them contained hateful, aggressive, or inappropriate language as they were all reviewed and reworded by Droits Quotidiens to be neutral, anonymous, and comprehensive. The legal articles represent strong, formal, written language that can contain up to 39,570 words.
In addition to helping advance the state-of-the-art in retrieving statutes relevant to a legal question, BSARD-based models could improve the efficiency of the legal information retrieval process in the context of legal research, therefore enabling researchers to devote themselves to more thoughtful parts of their research. Furthermore, BSARD can become a starting point of new open-source legal information search tools so that the socially weaker parties to disputes can benefit from a free professional assisting service.
[More Information Needed]
First, the corpus of articles is limited to those collected from 32 Belgian codes, which obviously does not cover the entire Belgian law as thousands of articles from decrees, directives, and ordinances are missing. During the dataset construction, all references to these uncollected articles are ignored, which causes some questions to end up with only a fraction of their initial number of relevant articles. This information loss implies that the answer contained in the remaining relevant articles might be incomplete, although it is still appropriate.
Additionally, it is essential to note that not all legal questions can be answered with statutes alone. For instance, the question “Can I evict my tenants if they make too much noise?” might not have a detailed answer within the statutory law that quantifies a specific noise threshold at which eviction is allowed. Instead, the landlord should probably rely more on case law and find precedents similar to their current situation (e.g., the tenant makes two parties a week until 2 am). Hence, some questions are better suited than others to the statutory article retrieval task, and the domain of the less suitable ones remains to be determined.
The dataset was created by Antoine Louis during work done at the Law & Tech lab of Maastricht University, with the help of jurists from Droits Quotidiens .
BSARD is licensed under the CC BY-NC-SA 4.0 license .
@inproceedings{louis2022statutory, title = {A Statutory Article Retrieval Dataset in French}, author = {Louis, Antoine and Spanakis, Gerasimos}, booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics}, month = may, year = {2022}, address = {Dublin, Ireland}, publisher = {Association for Computational Linguistics}, url = {}, doi = {}, pages = {To appear}, }
Thanks to @antoiloui for adding this dataset.