数据集:
IlyaGusev/headline_cause
任务:
文本分类计算机处理:
multilingual大小:
10K<n<100K语言创建人:
found批注创建人:
crowdsourced源数据集:
original预印本库:
arxiv:2108.12626其他:
causal-reasoning许可:
cc0-1.0A dataset for detecting implicit causal relations between pairs of news headlines. The dataset includes over 5000 headline pairs from English news and over 9000 headline pairs from Russian news labeled through crowdsourcing. The pairs vary from totally unrelated or belonging to the same general topic to the ones including causation and refutation relations.
Loading Russian Simple task:
from datasets import load_dataset dataset = load_dataset("IlyaGusev/headline_cause", "ru_simple")
[More Information Needed]
This dataset consists of two parts, Russian and English.
There is an URL, a title, and a timestamp for each of the two headlines in every data instance. A label is presented in three fields. 'Result' field is a textual label, 'label' field is a numeric label, and the 'agreement' field shows the majority vote agreement between annotators. Additional information includes instance ID and the presence of the link between two articles.
{ 'left_url': 'https://www.kommersant.ru/doc/4347456', 'right_url': 'https://tass.ru/kosmos/8488527', 'left_title': 'NASA: информация об отказе сотрудничать с Россией по освоению Луны некорректна', 'right_title': 'NASA назвало некорректными сообщения о нежелании США включать РФ в соглашение по Луне', 'left_timestamp': datetime.datetime(2020, 5, 15, 19, 46, 20), 'right_timestamp': datetime.datetime(2020, 5, 15, 19, 21, 36), 'label': 0, 'result': 'not_cause', 'agreement': 1.0, 'id': 'ru_tg_101', 'has_link': True }
Dataset | Split | Number of Instances |
---|---|---|
ru_simple | train | 7,641 |
validation | 955 | |
test | 957 | |
en_simple | train | 4,332 |
validation | 542 | |
test | 542 | |
ru_full | train | 5,713 |
validation | 715 | |
test | 715 | |
en_full | train | 2,009 |
validation | 251 | |
test | 252 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
Every candidate pair was annotated with Yandex Toloka , a crowdsourcing platform. The task was to determine a relationship between two headlines, A and B. There were seven possible options: titles are almost the same, A causes B, B causes A, A refutes B, B refutes A, A linked with B in another way, A is not linked to B. An annotation guideline was in Russian for Russian news and in English for English news.
Guidelines:
Ten workers annotated every pair. The total annotation budget was 870$, with the estimated hourly wage paid to participants of 45 cents. Annotation management was semi-automatic. Scripts are available in the Github repository .
Who are the annotators?Yandex Toloka workers were the annotators, 457 workers for the Russian part, 180 workers for the English part.
The dataset is not anonymized, so individuals' names can be found in the dataset. Information about the original author is not included in the dataset. No information about annotators is included except a platform worker ID.
We do not see any direct malicious applications of our work. The data probably do not contain offensive content, as news agencies usually do not produce it, and a keyword search returned nothing. However, there are news documents in the dataset on several sensitive topics.
[More Information Needed]
[More Information Needed]
The data was collected by Ilya Gusev.
[More Information Needed]
@misc{gusev2021headlinecause, title={HeadlineCause: A Dataset of News Headlines for Detecting Causalities}, author={Ilya Gusev and Alexey Tikhonov}, year={2021}, eprint={2108.12626}, archivePrefix={arXiv}, primaryClass={cs.CL} }
[N/A]