数据集:
newsph_nli
任务:
文本分类语言:
tl计算机处理:
monolingual大小:
100K<n<1M语言创建人:
found批注创建人:
machine-generated源数据集:
original预印本库:
arxiv:2010.11574许可:
license:unknownFirst benchmark dataset for sentence entailment in the low-resource Filipino language. Constructed through exploting the structure of news articles. Contains 600,000 premise-hypothesis pairs, in 70-15-15 split for training, validation, and testing.
[More Information Needed]
The dataset contains news articles in Filipino (Tagalog) scraped rom all major Philippine news sites online.
Sample data: { "premise": "Alam ba ninyo ang ginawa ni Erap na noon ay lasing na lasing na rin?", "hypothesis": "Ininom niya ang alak na pinagpulbusan!", "label": "0" }
[More Information Needed]
Contains 600,000 premise-hypothesis pairs, in 70-15-15 split for training, validation, and testing.
We propose the use of news articles for automatically creating benchmark datasets for NLI because of two reasons. First, news articles commonly use single-sentence paragraphing, meaning every paragraph in a news article is limited to a single sentence. Second, straight news articles follow the “inverted pyramid” structure, where every succeeding paragraph builds upon the premise of those that came before it, with the most important information on top and the least important towards the end.
To create the dataset, we scrape news articles from all major Philippine news sites online. We collect a total of 229,571 straight news articles, which we then lightly preprocess to remove extraneous unicode characters and correct minimal misspellings. No further preprocessing is done to preserve information in the data.
Who are the source language producers?The dataset was created by Jan Christian, Blaise Cruz, Jose Kristian Resabal, James Lin, Dan John Velasco, and Charibeth Cheng from De La Salle University and the University of the Philippines
[More Information Needed]
Who are the annotators?Jan Christian Blaise Cruz, Jose Kristian Resabal, James Lin, Dan John Velasco and Charibeth Cheng
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[Jan Christian Blaise Cruz] (mailto: jan_christian_cruz@dlsu.edu.ph )
[More Information Needed]
@article{cruz2020investigating, title={Investigating the True Performance of Transformers in Low-Resource Languages: A Case Study in Automatic Corpus Creation}, author={Jan Christian Blaise Cruz and Jose Kristian Resabal and James Lin and Dan John Velasco and Charibeth Cheng}, journal={arXiv preprint arXiv:2010.11574}, year={2020} }
Thanks to @anaerobeth for adding this dataset.