数据集:

maastrichtlawtech/bsard

任务:

文本检索

子任务:

document-retrieval

语言:

计算机处理:

monolingual

大小:

1K<n<10K

语言创建人:

found

批注创建人:

expert-generated

源数据集:

original

预印本库:

arxiv:2108.11792

许可:

cc-by-nc-sa-4.0

数据集介绍文件清单

中文

Dataset Card for BSARD

Dataset Summary

The Belgian Statutory Article Retrieval Dataset (BSARD) v1.0 is a French native corpus for studying statutory article retrieval. BSARD consists of more than 22,600 statutory articles from Belgian law and about 1,100 legal questions posed by Belgian citizens and labeled by experienced jurists with relevant articles from the corpus.

Supported Tasks and Leaderboards

document-retrieval : The dataset can be used to train a model for Ad-Hoc Legal Information Retrieval. An IR model is presented with a short user query written in natural language and asked to retrieve relevant legal information from a knowledge source (such as statutory articles). The model performance is measured by how high its recall score to the reference is. A BERT-based model trained as a dense retriever to draw information from statutory articles achieves a recall@100 of 74.8%.

Languages

The text in the dataset is in French, as spoken in Wallonia and Brussels-Capital region. The associated BCP-47 code is fr-BE .

Dataset Structure

Data Instances

A typical data point comprises a question, with additional category , subcategory , and extra_description fields that elaborate on it, and a list of article_ids from the corpus of statutory articles that are relevant to the question.

An example from the BSARD test set looks as follows:

{
 'id': '724',
 'question': 'La police peut-elle me fouiller pour chercher du cannabis ?',
 'category': 'Justice',
 'subcategory': 'Petite délinquance',
 'extra_description': 'Détenir, acheter et vendre du cannabis',
 'article_ids': '13348'
}

Data Fields

In "questions_fr_train.csv" and "questions_fr_test.csv" :
- id : an int32 feature corresponding to a unique ID number for the question.
- question : a string feature corresponding to the question.
- category : a string feature corresponding to the general topic of the question.
- subcategory : a string feature corresponding to the sub-topic of the question.
- extra_description : a string feature corresponding to the extra categorization tags of the question.
- article_ids : a string feature of comma-separated article IDs relevant to the question.
In "articles_fr.csv" :
- id : an int32 feature corresponding to a unique ID number for the article.
- article : a string feature corresponding to the full article.
- code : a string feature corresponding to the law code to which the article belongs.
- article_no : a string feature corresponding to the article number in the code.
- description : a string feature corresponding to the concatenated headings of the article.
- law_type : a string feature whose value is either "regional" or "national" .

Data Splits

This dataset is split into train/test set. Number of questions in each set is given below:

Train	Test
BSARD	886	222

Dataset Creation

Curation Rationale

The dataset is intended to be used by researchers to build and evaluate models on retrieving law articles relevant to an input legal question. It should not be regarded as a reliable source of legal information at this point in time, as both the questions and articles correspond to an outdated version of the Belgian law from May 2021 (time of dataset collection). In the latter case, the user is advised to consult daily updated official legal resources (e.g., the Belgian Official Gazette).

Source Data

Initial Data Collection and Normalization

BSARD was created in four stages: (i) compiling a large corpus of Belgian law articles, (ii) gathering legal questions with references to relevant law articles, (iii) refining these questions, and (iv) matching the references to the corresponding articles from the corpus.

Who are the source language producers?

Speakers were not directly approached for inclusion in this dataset and thus could not be asked for demographic information. Questions were collected, anonimyzed, and reformulated by Droits Quotidiens . Therefore, no direct information about the speakers’ age and gender distribution, or socioeconomic status is available. However, it is expected that most, but not all, of the speakers are adults (18+ years), speak French as a native language, and live in Wallonia or Brussels-Capital region.

Annotations

Annotation process

Each year, Droits Quotidiens , a Belgian organization whose mission is to clarify the law for laypeople, receives and collects around 4,000 emails from Belgian citizens asking for advice on a personal legal issue. In practice, their legal clarification process consists of four steps. First, they identify the most frequently asked questions on a common legal issue. Then, they define a new anonymized "model" question on that issue expressed in natural language terms, i.e., as close as possible as if a layperson had asked it. Next, they search the Belgian law for articles that help answer the model question and reference them.

Who are the annotators?

A total of six Belgian jurists from Droits Quotidiens contributed to annotating the questions. All have a law degree from a Belgian university and years of experience in providing legal advice and clarifications of the law. They range in age from 30-60 years, including one man and five women, gave their ethnicity as white European, speak French as a native language, and represent upper middle class based on income levels.

Personal and Sensitive Information

The questions represent informal, asynchronous, edited, written language that does not exceed 265 words. None of them contained hateful, aggressive, or inappropriate language as they were all reviewed and reworded by Droits Quotidiens to be neutral, anonymous, and comprehensive. The legal articles represent strong, formal, written language that can contain up to 39,570 words.

Considerations for Using the Data

Social Impact of Dataset

In addition to helping advance the state-of-the-art in retrieving statutes relevant to a legal question, BSARD-based models could improve the efficiency of the legal information retrieval process in the context of legal research, therefore enabling researchers to devote themselves to more thoughtful parts of their research. Furthermore, BSARD can become a starting point of new open-source legal information search tools so that the socially weaker parties to disputes can benefit from a free professional assisting service.

Discussion of Biases

[More Information Needed]

Other Known Limitations

First, the corpus of articles is limited to those collected from 32 Belgian codes, which obviously does not cover the entire Belgian law as thousands of articles from decrees, directives, and ordinances are missing. During the dataset construction, all references to these uncollected articles are ignored, which causes some questions to end up with only a fraction of their initial number of relevant articles. This information loss implies that the answer contained in the remaining relevant articles might be incomplete, although it is still appropriate.

Additionally, it is essential to note that not all legal questions can be answered with statutes alone. For instance, the question “Can I evict my tenants if they make too much noise?” might not have a detailed answer within the statutory law that quantifies a specific noise threshold at which eviction is allowed. Instead, the landlord should probably rely more on case law and find precedents similar to their current situation (e.g., the tenant makes two parties a week until 2 am). Hence, some questions are better suited than others to the statutory article retrieval task, and the domain of the less suitable ones remains to be determined.

Additional Information

Dataset Curators

The dataset was created by Antoine Louis during work done at the Law & Tech lab of Maastricht University, with the help of jurists from Droits Quotidiens .

Licensing Information

BSARD is licensed under the CC BY-NC-SA 4.0 license .

Citation Information

@inproceedings{louis2022statutory,
  title = {A Statutory Article Retrieval Dataset in French},
  author = {Louis, Antoine and Spanakis, Gerasimos},
  booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
  month = may,
  year = {2022},
  address = {Dublin, Ireland},
  publisher = {Association for Computational Linguistics},
  url = {},
  doi = {},
  pages = {To appear},
}

Contributions

Thanks to @antoiloui for adding this dataset.

作者:

maastrichtlawtech

数据集大小:

49.69 MB