中文

Dataset Card for RAFT

Dataset Summary

The Real-world Annotated Few-shot Tasks (RAFT) dataset is an aggregation of English-language datasets found in the real world. Associated with each dataset is a binary or multiclass classification task, intended to improve our understanding of how language models perform on tasks that have concrete, real-world value. Only 50 labeled examples are provided in each dataset.

Supported Tasks and Leaderboards

  • text-classification : Each subtask in RAFT is a text classification task, and the provided train and test sets can be used to submit to the RAFT Leaderboard To prevent overfitting and tuning on a held-out test set, the leaderboard is only evaluated once per week. Each task has its macro-f1 score calculated, then those scores are averaged to produce the overall leaderboard score.

Languages

RAFT is entirely in American English (en-US).

Dataset Structure

Data Instances

Dataset First Example
Ade Corpus V2
Sentence: No regional side effects were noted.ID: 0Label: 2
Banking 77
Query: Is it possible for me to change my PIN number?ID: 0Label: 23
NeurIPS Impact Statement Risks
Paper title: Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation...Paper link: https://proceedings.neurips.cc/paper/2020/file/ec1f764517b7ffb52057af6df18142b7-Paper.pdf...Impact statement: This work makes the first attempt to search for all key components of panoptic pipeline and manages to accomplish this via the p...ID: 0Label: 1
One Stop English
Article: For 85 years, it was just a grey blob on classroom maps of the solar system. But, on 15 July, Pluto was seen in high resolution ...ID: 0Label: 3
Overruling
Sentence: in light of both our holding today and previous rulings in johnson, dueser, and gronroos, we now explicitly overrule dupree....ID: 0Label: 2
Semiconductor Org Types
Paper title: 3Gb/s AC-coupled chip-to-chip communication using a low-swing pulse receiver...Organization name: North Carolina State Univ.,Raleigh,NC,USAID: 0Label: 3
Systematic Review Inclusion
Title: Prototyping and transforming facial textures for perception research...Abstract: Wavelet based methods for prototyping facial textures for artificially transforming the age of facial images were described. Pro...Authors: Tiddeman, B.; Burt, M.; Perrett, D.Journal: IEEE Comput Graphics ApplID: 0Label: 2
TAI Safety Research
Title: Malign generalization without internal searchAbstract Note: In my last post, I challenged the idea that inner alignment failures should be explained by appealing to agents which perform ex...Url: https://www.alignmentforum.org/posts/ynt9TD6PrYw6iT49m/malign-generalization-without-internal-search...Publication Year: 2020Item Type: blogPostAuthor: Barnett, MatthewPublication Title: AI Alignment ForumID: 0Label: 1
Terms Of Service
Sentence: Crowdtangle may change these terms of service, as described above, notwithstanding any provision to the contrary in any agreemen...ID: 0Label: 2
Tweet Eval Hate
Tweet: New to Twitter-- any men on here know what the process is to get #verified?...ID: 0Label: 2
Twitter Complaints
Tweet text: @HMRCcustomers No this is my first jobID: 0Label: 2

Data Fields

The ID field is used for indexing data points. It will be used to match your submissions with the true test labels, so you must include it in your submission. All other columns contain textual data. Some contain links and URLs to websites on the internet.

All output fields are designated with the "Label" column header. The 0 value in this column indicates that the entry is unlabeled, and should only appear in the unlabeled test set. Other values in this column are various other labels. To get their textual value for a given dataset:

# Load the dataset
dataset = datasets.load_dataset("ought/raft", "ade_corpus_v2")
# First, get the object that holds information about the "Label" feature in the dataset.
label_info = dataset.features["Label"]
# Use the int2str method to access the textual labels.
print([label_info.int2str(i) for i in (0, 1, 2)])
# ['Unlabeled', 'ADE-related', 'not ADE-related']

Data Splits

There are two splits provided: train data and unlabeled test data.

The training examples were chosen at random. No attempt was made to ensure that classes were balanced or proportional in the training data -- indeed, the Banking 77 task with 77 different classes if used cannot fit all of its classes into the 50 training examples.

Dataset Train Size Test Size
Ade Corpus V2 50 5000
Banking 77 50 5000
NeurIPS Impact Statement Risks 50 150
One Stop English 50 516
Overruling 50 2350
Semiconductor Org Types 50 449
Systematic Review Inclusion 50 2243
TAI Safety Research 50 1639
Terms Of Service 50 5000
Tweet Eval Hate 50 2966
Twitter Complaints 50 3399
Total 550 28712

Dataset Creation

Curation Rationale

Generally speaking, the rationale behind RAFT was to create a benchmark for evaluating NLP models that didn't consist of contrived or artificial data sources, for which the tasks weren't originally assembled for the purpose of testing NLP models. However, each individual dataset in RAFT was collected independently. For the majority of datasets, we only collected them second-hand from existing curated sources. The datasets that we curated are:

  • NeurIPS impact statement risks
  • Semiconductor org types
  • TAI Safety Research

Each of these three datasets was sourced from our existing collaborators at Ought. They had used our service, Elicit, to analyze their dataset in the past, and we contact them to include their dataset and the associated classification task in the benchmark. For all datasets, more information is provided in our paper. For the ones which we did not curate, we provide a link to the dataset. For the ones which we did, we provide a datasheet that elaborates on many of the topics here in greater detail.

For the three datasets that we introduced:

  • NeurIPS impact statement risks The dataset was created to evaluate the then new requirement for authors to include an "impact statement" in their 2020 NeurIPS papers. Had it been successful? What kind of things did authors mention the most? How long were impact statements on average? Etc.
  • Semiconductor org types The dataset was originally created to understand better which countries’ organisations have contributed most to semiconductor R&D over the past 25 years using three main conferences. Moreover, to estimate the share of academic and private sector contributions, the organisations were classified as “university”, “research institute” or “company”.
  • TAI Safety Research The primary motivations for assembling this database were to: (1) Aid potential donors in assessing organizations focusing on TAI safety by collecting and analyzing their research output. (2) Assemble a comprehensive bibliographic database that can be used as a base for future projects, such as a living review of the field.

For the following sections, we will only describe the datasets we introduce. All other dataset details, and more details on the ones described here, can be found in our paper.

Source Data

Initial Data Collection and Normalization
  • NeurIPS impact statement risks The data was directly observable (raw text scraped) for the most part; although some data was taken from previous datasets (which themselves had taken it from raw text). The data was validated, but only in part, by human reviewers. Cf this link for full details:
  • Semiconductor org types We used the IEEE API to obtain institutions that contributed papers to semiconductor conferences in the last 25 years. This is a random sample of 500 of them with a corresponding conference paper title. The three conferences were the International Solid-State Circuits Conference (ISSCC), the Symposia on VLSI Technology and Circuits (VLSI) and the International Electron Devices Meeting (IEDM).
  • TAI Safety Research We asked TAI safety organizations for what their employees had written, emailed some individual authors, and searched Google Scholar. See the LessWrong post for more details: https://www.lesswrong.com/posts/4DegbDJJiMX2b3EKm/tai-safety-bibliographic-database
Who are the source language producers?
  • NeurIPS impact statement risks Language generated from NeurIPS 2020 impact statement authors, generally the authors of submission papers.
  • Semiconductor org types Language generated from IEEE API. Generally machine-formatted names, and title of academic papers.
  • TAI Safety Research Language generated by authors of TAI safety research publications.

Annotations

Annotation process
  • NeurIPS impact statement risks Annotations were entered directly into a Google Spreadsheet with instructions, labeled training examples, and unlabeled testing examples.
  • Semiconductor org types Annotations were entered directly into a Google Spreadsheet with instructions, labeled training examples, and unlabeled testing examples.
  • TAI Safety Research N/A
Who are the annotators?
  • NeurIPS impact statement risks Contractors paid by Ought performed the labeling of whether impact statements mention harmful applications. A majority vote was taken from 3 annotators.
  • Semiconductor org types Contractors paid by Ought performed the labeling of organization types. A majority vote was taken from 3 annotators.
  • TAI Safety Research The dataset curators annotated the dataset by hand.

Personal and Sensitive Information

It is worth mentioning that the Tweet Eval Hate, by necessity, contains highly offensive content.

  • NeurIPS impact statement risks The dataset contains authors' names. These were scraped from publicly available scientific papers submitted to NeurIPS 2020.
  • Semiconductor org types N/A
  • TAI Safety Research N/A

Considerations for Using the Data

Social Impact of Dataset

  • NeurIPS impact statement risks N/A
  • Semiconductor org types N/A
  • TAI Safety Research N/A

Discussion of Biases

  • NeurIPS impact statement risks N/A
  • Semiconductor org types N/A
  • TAI Safety Research N/A

Other Known Limitations

  • NeurIPS impact statement risks This dataset has limitations that should be taken into consideration when using it. In particular, the method used to collect broader impact statements involved automated downloads, conversions and scraping and was not error-proof. Although care has been taken to identify and correct as many errors as possible, not all texts have been reviewed by a human. This means it is possible some of the broader impact statements contained in the dataset are truncated or otherwise incorrectly extracted from their original article.
  • Semiconductor org types N/A
  • TAI Safety Research Don't use it to create a dangerous AI that could bring the end of days.

Additional Information

Dataset Curators

The overall RAFT curators are Neel Alex, Eli Lifland, and Andreas Stuhlmüller.

  • NeurIPS impact statement risks Volunteers working with researchers affiliated to Oxford's Future of Humanity Institute (Carolyn Ashurst, now at The Alan Turing Institute) created the impact statements dataset.
  • Semiconductor org types The data science unit of Stiftung Neue Verantwortung (Berlin).
  • TAI Safety Research Angelica Deibel and Jess Riedel. We did not do it on behalf of any entity.

Licensing Information

RAFT aggregates many other datasets, each of which is provided under its own license. Generally, those licenses permit research and commercial use.

Dataset License
Ade Corpus V2 Unlicensed
Banking 77 CC BY 4.0
NeurIPS Impact Statement Risks MIT License/CC BY 4.0
One Stop English CC BY-SA 4.0
Overruling Unlicensed
Semiconductor Org Types CC BY-NC 4.0
Systematic Review Inclusion CC BY 4.0
TAI Safety Research CC BY-SA 4.0
Terms Of Service Unlicensed
Tweet Eval Hate Unlicensed
Twitter Complaints Unlicensed

Citation Information

[More Information Needed]

Contributions

Thanks to @neel-alex , @uvafan , and @lewtun for adding this dataset.