数据集:

allenai/scico

任务:

标记分类

子任务:

coreference-resolution

语言:

计算机处理:

monolingual

批注创建人:

domain experts domain+experts

其他:

cross-document-coreference-resolution structure-prediction

许可:

apache-2.0

数据集介绍文件清单

中文

Dataset Card for SciCo

Dataset Summary

SciCo consists of clusters of mentions in context and a hierarchy over them. The corpus is drawn from computer science papers, and the concept mentions are methods and tasks from across CS. Scientific concepts pose significant challenges: they often take diverse forms (e.g., class-conditional image synthesis and categorical image generation) or are ambiguous (e.g., network architecture in AI vs. systems research). To build SciCo, we develop a new candidate generation approach built on three resources: a low-coverage KB ( https://paperswithcode.com/ ), a noisy hypernym extractor, and curated candidates.

Supported Tasks and Leaderboards

More Information Needed

Languages

The text in the dataset is in English.

Dataset Structure

Data Instances

More Information Needed

Data Fields

flatten_tokens : a single list of all tokens in the topic
flatten_mentions : array of mentions, each mention is represented by [start, end, cluster_id]
tokens : array of paragraphs
doc_ids : doc_id of each paragraph in tokens
metadata : metadata of each doc_id
sentences : sentences boundaries for each paragraph in tokens [start, end]
mentions : array of mentions, each mention is represented by [paragraph_id, start, end, cluster_id]
relations : array of binary relations between cluster_ids [parent, child]
id : id of the topic
hard_10 and hard_20 (only in the test set): flag for 10% or 20% hardest topics based on Levenshtein similarity.
source : source of this topic PapersWithCode (pwc), hypernym or curated.

Data Splits

Train	Validation	Test
Topic	221	100	200
Documents	9013	4120	8237
Mentions	10925	4874	10424
Clusters	4080	1867	3711
Relations	2514	1747	2379

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

Additional Information

Dataset Curators

This dataset was initially created by Arie Cattan, Sophie Johnson, Daniel Weld, Ido Dagan, Iz Beltagy, Doug Downey and Tom Hope, while Arie was intern at Allen Institute of Artificial Intelligence.

Licensing Information

This dataset is distributed under Apache License 2.0 .

Citation Information

@inproceedings{
    cattan2021scico,
    title={SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts},
    author={Arie Cattan and Sophie Johnson and Daniel S. Weld and Ido Dagan and Iz Beltagy and Doug Downey and Tom Hope},
    booktitle={3rd Conference on Automated Knowledge Base Construction},
    year={2021},
    url={https://openreview.net/forum?id=OFLbgUP04nC}
}

Contributions

Thanks to @ariecattan for adding this dataset.

作者:

allenai

数据集大小:

9.53 MB