中文

Dataset Card for SciCo

Dataset Summary

SciCo consists of clusters of mentions in context and a hierarchy over them. The corpus is drawn from computer science papers, and the concept mentions are methods and tasks from across CS. Scientific concepts pose significant challenges: they often take diverse forms (e.g., class-conditional image synthesis and categorical image generation) or are ambiguous (e.g., network architecture in AI vs. systems research). To build SciCo, we develop a new candidate generation approach built on three resources: a low-coverage KB ( https://paperswithcode.com/ ), a noisy hypernym extractor, and curated candidates.

Supported Tasks and Leaderboards

More Information Needed

Languages

The text in the dataset is in English.

Dataset Structure

Data Instances

More Information Needed

Data Fields

  • flatten_tokens : a single list of all tokens in the topic
  • flatten_mentions : array of mentions, each mention is represented by [start, end, cluster_id]
  • tokens : array of paragraphs
  • doc_ids : doc_id of each paragraph in tokens
  • metadata : metadata of each doc_id
  • sentences : sentences boundaries for each paragraph in tokens [start, end]
  • mentions : array of mentions, each mention is represented by [paragraph_id, start, end, cluster_id]
  • relations : array of binary relations between cluster_ids [parent, child]
  • id : id of the topic
  • hard_10 and hard_20 (only in the test set): flag for 10% or 20% hardest topics based on Levenshtein similarity.
  • source : source of this topic PapersWithCode (pwc), hypernym or curated.

Data Splits

Train Validation Test
Topic 221 100 200
Documents 9013 4120 8237
Mentions 10925 4874 10424
Clusters 4080 1867 3711
Relations 2514 1747 2379

Dataset Creation

Curation Rationale

More Information Needed

Source Data

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

Additional Information

Dataset Curators

This dataset was initially created by Arie Cattan, Sophie Johnson, Daniel Weld, Ido Dagan, Iz Beltagy, Doug Downey and Tom Hope, while Arie was intern at Allen Institute of Artificial Intelligence.

Licensing Information

This dataset is distributed under Apache License 2.0 .

Citation Information

@inproceedings{
    cattan2021scico,
    title={SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts},
    author={Arie Cattan and Sophie Johnson and Daniel S. Weld and Ido Dagan and Iz Beltagy and Doug Downey and Tom Hope},
    booktitle={3rd Conference on Automated Knowledge Base Construction},
    year={2021},
    url={https://openreview.net/forum?id=OFLbgUP04nC}
}

Contributions

Thanks to @ariecattan for adding this dataset.