数据集:
allenai/scico
SciCo consists of clusters of mentions in context and a hierarchy over them. The corpus is drawn from computer science papers, and the concept mentions are methods and tasks from across CS. Scientific concepts pose significant challenges: they often take diverse forms (e.g., class-conditional image synthesis and categorical image generation) or are ambiguous (e.g., network architecture in AI vs. systems research). To build SciCo, we develop a new candidate generation approach built on three resources: a low-coverage KB ( https://paperswithcode.com/ ), a noisy hypernym extractor, and curated candidates.
The text in the dataset is in English.
Train | Validation | Test | |
---|---|---|---|
Topic | 221 | 100 | 200 |
Documents | 9013 | 4120 | 8237 |
Mentions | 10925 | 4874 | 10424 |
Clusters | 4080 | 1867 | 3711 |
Relations | 2514 | 1747 | 2379 |
This dataset was initially created by Arie Cattan, Sophie Johnson, Daniel Weld, Ido Dagan, Iz Beltagy, Doug Downey and Tom Hope, while Arie was intern at Allen Institute of Artificial Intelligence.
This dataset is distributed under Apache License 2.0 .
@inproceedings{ cattan2021scico, title={SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts}, author={Arie Cattan and Sophie Johnson and Daniel S. Weld and Ido Dagan and Iz Beltagy and Doug Downey and Tom Hope}, booktitle={3rd Conference on Automated Knowledge Base Construction}, year={2021}, url={https://openreview.net/forum?id=OFLbgUP04nC} }
Thanks to @ariecattan for adding this dataset.