数据集:
copenlu/spiced
The Scientific Paraphrase and Information ChangE Dataset (SPICED) is a dataset of paired scientific findings from scientific papers, news media, and Twitter. The types of pairs are between <paper, news> and <paper, tweet>. Each pair is labeled for the degree of information similarity in the findings described by each sentence, on a scale from 1-5. This is called the Information Matching Score (IMS) . The data was curated from S2ORC and matched news articles and Tweets using Altmetric. Instances are annotated by experts using the Prolific platform and Potato. Please use the following citation when using this dataset:
@article{modeling-information-change, title={{Modeling Information Change in Science Communication with Semantically Matched Paraphrases}}, author={Wright, Dustin and Pei, Jiaxin and Jurgens, David and Augenstein, Isabelle}, year={2022}, booktitle = {Proceedings of EMNLP}, publisher = {Association for Computational Linguistics}, year = 2022 }
The task is to predict the IMS between two scientific sentences, which is a scalar between 1 and 5. Preferred metrics are mean-squared error and Pearson correlation.
English
For the full details of how the dataset was created, please refer to our EMNLP 2022 paper .
Science communication is a complex process of translation from highly technical scientific language to common language that lay people can understand. At the same time, the general public relies on good science communication in order to inform critical decisions about their health and behavior. SPICED was curated in order to provide a training dataset and benchmark for machine learning models to measure changes in scientific information at different stages of the science communication pipeline.
Scientific text: S2ORC
News articles and Tweets are collected through Altmetric.
Who are the source language producers?Scientists, journalists, and Twitter users.
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
Models trained on SPICED can be used to perform large scale analyses of science communication. They can be used to match the same finding discussed in different media, and reveal trends in differences in reporting at different stages of the science communication pipeline. It is hoped that this can help to build tools which will improve science communication.
The dataset is restricted to computer science, medicine, biology, and psychology, which may introduce some bias in the topics which models will perform well on.
While some context is available, we do not release the full text of news articles and scientific papers, which may contain further context to help with learning the task. We do however provide the paper DOIs and links to the original news articles in case full text is desired.
Dustin Wright, Jiaxin Pei, David Jurgens, and Isabelle Augenstein
MIT
Thanks to @dwright37 for adding this dataset.