数据集:
dennlinger/wiki-paragraphs
语言:
en计算机处理:
monolingual大小:
10M<n<100M语言创建人:
crowdsourced批注创建人:
machine-generated源数据集:
original预印本库:
arxiv:2012.03619许可:
cc-by-sa-3.0The wiki-paragraphs dataset is constructed by automatically sampling two paragraphs from a Wikipedia article. If they are from the same section, they will be considered a "semantic match", otherwise as "dissimilar". Dissimilar paragraphs can in theory also be sampled from other documents, but have not shown any improvement in the particular evaluation of the linked work. The alignment is in no way meant as an accurate depiction of similarity, but allows to quickly mine large amounts of samples.
The dataset can be used for "same-section classification", which is a binary classification task (either two sentences/paragraphs belong to the same section or not). This can be combined with document-level coherency measures, where we can check how many misclassifications appear within a single document. Please refer to our paper for more details.
The data was extracted from English Wikipedia, therefore predominantly in English.
A single instance contains three attributes:
{ "sentence1": "<Sentence from the first paragraph>", "sentence2": "<Sentence from the second paragraph>", "label": 0/1 # 1 indicates two belong to the same section }
We provide train, validation and test splits, which were split as 80/10/10 from a randomly shuffled original data source. In total, we provide 25375583 training pairs, as well as 3163685 validation and test instances, respectively.
The original idea was applied to self-segmentation of Terms of Service documents. Given that these are of domain-specific nature, we wanted to provide a more generally applicable model trained on Wikipedia data. It is meant as a cheap-to-acquire pre-training strategy for large-scale experimentation with semantic similarity for long texts (paragraph-level). Based on our experiments, it is not necessarily sufficient by itself to replace traditional hand-labeled semantic similarity datasets.
The data was collected based on the articles considered in the Wiki-727k dataset by Koshorek et al. The dump of their dataset can be found through the respective Github repository . Note that we did not use the pre-processed data, but rather only information on the considered articles, which were re-acquired from Wikipedia at a more recent state. This is due to the fact that paragraph information was not retained by the original Wiki-727k authors. We did not verify the particular focus of considered pages.
Who are the source language producers?We do not have any further information on the contributors; these are volunteers contributing to en.wikipedia.org.
No manual annotation was added to the dataset. We automatically sampled two sections from within the same article; if these belong to the same section, they were assigned a label indicating the "similarity" (1), otherwise the label indicates that they are not belonging to the same section (0). We sample three positive and three negative samples per section, per article.
Who are the annotators?No annotators were involved in the process.
We did not modify the original Wikipedia text in any way. Given that personal information, such as dates of birth (e.g., for a person of interest) may be on Wikipedia, this information is also considered in our dataset.
The purpose of the dataset is to serve as a pre-training addition for semantic similarity learning.
Systems building on this dataset should consider additional, manually annotated data, before using a system in production.
To our knowledge, there are some works indicating that male people have a several times larger chance of having a Wikipedia page created (especially in historical contexts). Therefore, a slight bias towards over-representation might be left in this dataset.
As previously stated, the automatically extracted semantic similarity is not perfect; it should be treated as such.
The dataset was originally developed as a practical project by Lucienne-Sophie Marm� under the supervision of Dennis Aumiller. Contributions to the original sampling strategy were made by Satya Almasian and Michael Gertz
Wikipedia data is available under the CC-BY-SA 3.0 license.
@inproceedings{DBLP:conf/icail/AumillerAL021, author = {Dennis Aumiller and Satya Almasian and Sebastian Lackner and Michael Gertz}, editor = {Juliano Maranh{\~{a}}o and Adam Zachary Wyner}, title = {Structural text segmentation of legal documents}, booktitle = {{ICAIL} '21: Eighteenth International Conference for Artificial Intelligence and Law, S{\~{a}}o Paulo Brazil, June 21 - 25, 2021}, pages = {2--11}, publisher = {{ACM}}, year = {2021}, url = {https://doi.org/10.1145/3462757.3466085}, doi = {10.1145/3462757.3466085} }