数据集:

ought/raft

任务:

文本分类

子任务:

multi-class-classification

语言:

计算机处理:

monolingual

大小:

size_categories:unknown

语言创建人:

expert-generated

批注创建人:

expert-generated crowdsourced

源数据集:

original extended|ade_corpus_v2 extended|banking77

预印本库:

arxiv:2109.14076

许可:

other

数据集介绍文件清单

中文

Dataset Card for RAFT

Dataset Summary

The Real-world Annotated Few-shot Tasks (RAFT) dataset is an aggregation of English-language datasets found in the real world. Associated with each dataset is a binary or multiclass classification task, intended to improve our understanding of how language models perform on tasks that have concrete, real-world value. Only 50 labeled examples are provided in each dataset.

Supported Tasks and Leaderboards

text-classification : Each subtask in RAFT is a text classification task, and the provided train and test sets can be used to submit to the RAFT Leaderboard To prevent overfitting and tuning on a held-out test set, the leaderboard is only evaluated once per week. Each task has its macro-f1 score calculated, then those scores are averaged to produce the overall leaderboard score.

Languages

RAFT is entirely in American English (en-US).

Dataset Structure

Data Instances

Dataset	First Example
Ade Corpus V2	Sentence: No regional side effects were noted.ID: 0Label: 2
Banking 77	Query: Is it possible for me to change my PIN number?ID: 0Label: 23
NeurIPS Impact Statement Risks	Paper title: Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation...Paper link: https://proceedings.neurips.cc/paper/2020/file/ec1f764517b7ffb52057af6df18142b7-Paper.pdf...Impact statement: This work makes the first attempt to search for all key components of panoptic pipeline and manages to accomplish this via the p...ID: 0Label: 1
One Stop English	Article: For 85 years, it was just a grey blob on classroom maps of the solar system. But, on 15 July, Pluto was seen in high resolution ...ID: 0Label: 3
Overruling	Sentence: in light of both our holding today and previous rulings in johnson, dueser, and gronroos, we now explicitly overrule dupree....ID: 0Label: 2
Semiconductor Org Types	Paper title: 3Gb/s AC-coupled chip-to-chip communication using a low-swing pulse receiver...Organization name: North Carolina State Univ.,Raleigh,NC,USAID: 0Label: 3
Systematic Review Inclusion	Title: Prototyping and transforming facial textures for perception research...Abstract: Wavelet based methods for prototyping facial textures for artificially transforming the age of facial images were described. Pro...Authors: Tiddeman, B.; Burt, M.; Perrett, D.Journal: IEEE Comput Graphics ApplID: 0Label: 2
TAI Safety Research	Title: Malign generalization without internal searchAbstract Note: In my last post, I challenged the idea that inner alignment failures should be explained by appealing to agents which perform ex...Url: https://www.alignmentforum.org/posts/ynt9TD6PrYw6iT49m/malign-generalization-without-internal-search...Publication Year: 2020Item Type: blogPostAuthor: Barnett, MatthewPublication Title: AI Alignment ForumID: 0Label: 1
Terms Of Service	Sentence: Crowdtangle may change these terms of service, as described above, notwithstanding any provision to the contrary in any agreemen...ID: 0Label: 2
Tweet Eval Hate	Tweet: New to Twitter-- any men on here know what the process is to get #verified?...ID: 0Label: 2
Twitter Complaints	Tweet text: @HMRCcustomers No this is my first jobID: 0Label: 2

Data Fields

The ID field is used for indexing data points. It will be used to match your submissions with the true test labels, so you must include it in your submission. All other columns contain textual data. Some contain links and URLs to websites on the internet.

All output fields are designated with the "Label" column header. The 0 value in this column indicates that the entry is unlabeled, and should only appear in the unlabeled test set. Other values in this column are various other labels. To get their textual value for a given dataset:

# Load the dataset
dataset = datasets.load_dataset("ought/raft", "ade_corpus_v2")
# First, get the object that holds information about the "Label" feature in the dataset.
label_info = dataset.features["Label"]
# Use the int2str method to access the textual labels.
print([label_info.int2str(i) for i in (0, 1, 2)])
# ['Unlabeled', 'ADE-related', 'not ADE-related']

Data Splits

There are two splits provided: train data and unlabeled test data.

The training examples were chosen at random. No attempt was made to ensure that classes were balanced or proportional in the training data -- indeed, the Banking 77 task with 77 different classes if used cannot fit all of its classes into the 50 training examples.

Dataset	Train Size	Test Size
Ade Corpus V2	50	5000
Banking 77	50	5000
NeurIPS Impact Statement Risks	50	150
One Stop English	50	516
Overruling	50	2350
Semiconductor Org Types	50	449
Systematic Review Inclusion	50	2243
TAI Safety Research	50	1639
Terms Of Service	50	5000
Tweet Eval Hate	50	2966
Twitter Complaints	50	3399
Total	550	28712

Dataset Creation

Curation Rationale

Generally speaking, the rationale behind RAFT was to create a benchmark for evaluating NLP models that didn't consist of contrived or artificial data sources, for which the tasks weren't originally assembled for the purpose of testing NLP models. However, each individual dataset in RAFT was collected independently. For the majority of datasets, we only collected them second-hand from existing curated sources. The datasets that we curated are:

NeurIPS impact statement risks
Semiconductor org types
TAI Safety Research

Each of these three datasets was sourced from our existing collaborators at Ought. They had used our service, Elicit, to analyze their dataset in the past, and we contact them to include their dataset and the associated classification task in the benchmark. For all datasets, more information is provided in our paper. For the ones which we did not curate, we provide a link to the dataset. For the ones which we did, we provide a datasheet that elaborates on many of the topics here in greater detail.

For the three datasets that we introduced:

NeurIPS impact statement risks The dataset was created to evaluate the then new requirement for authors to include an "impact statement" in their 2020 NeurIPS papers. Had it been successful? What kind of things did authors mention the most? How long were impact statements on average? Etc.
Semiconductor org types The dataset was originally created to understand better which countries’ organisations have contributed most to semiconductor R&D over the past 25 years using three main conferences. Moreover, to estimate the share of academic and private sector contributions, the organisations were classified as “university”, “research institute” or “company”.
TAI Safety Research The primary motivations for assembling this database were to: (1) Aid potential donors in assessing organizations focusing on TAI safety by collecting and analyzing their research output. (2) Assemble a comprehensive bibliographic database that can be used as a base for future projects, such as a living review of the field.

For the following sections, we will only describe the datasets we introduce. All other dataset details, and more details on the ones described here, can be found in our paper.

Source Data

Initial Data Collection and Normalization

NeurIPS impact statement risks The data was directly observable (raw text scraped) for the most part; although some data was taken from previous datasets (which themselves had taken it from raw text). The data was validated, but only in part, by human reviewers. Cf this link for full details:
Semiconductor org types We used the IEEE API to obtain institutions that contributed papers to semiconductor conferences in the last 25 years. This is a random sample of 500 of them with a corresponding conference paper title. The three conferences were the International Solid-State Circuits Conference (ISSCC), the Symposia on VLSI Technology and Circuits (VLSI) and the International Electron Devices Meeting (IEDM).
TAI Safety Research We asked TAI safety organizations for what their employees had written, emailed some individual authors, and searched Google Scholar. See the LessWrong post for more details: https://www.lesswrong.com/posts/4DegbDJJiMX2b3EKm/tai-safety-bibliographic-database

Who are the source language producers?

NeurIPS impact statement risks Language generated from NeurIPS 2020 impact statement authors, generally the authors of submission papers.
Semiconductor org types Language generated from IEEE API. Generally machine-formatted names, and title of academic papers.
TAI Safety Research Language generated by authors of TAI safety research publications.

Annotations

Annotation process

NeurIPS impact statement risks Annotations were entered directly into a Google Spreadsheet with instructions, labeled training examples, and unlabeled testing examples.
Semiconductor org types Annotations were entered directly into a Google Spreadsheet with instructions, labeled training examples, and unlabeled testing examples.
TAI Safety Research N/A

Who are the annotators?

NeurIPS impact statement risks Contractors paid by Ought performed the labeling of whether impact statements mention harmful applications. A majority vote was taken from 3 annotators.
Semiconductor org types Contractors paid by Ought performed the labeling of organization types. A majority vote was taken from 3 annotators.
TAI Safety Research The dataset curators annotated the dataset by hand.

Personal and Sensitive Information

It is worth mentioning that the Tweet Eval Hate, by necessity, contains highly offensive content.

NeurIPS impact statement risks The dataset contains authors' names. These were scraped from publicly available scientific papers submitted to NeurIPS 2020.
Semiconductor org types N/A
TAI Safety Research N/A

Considerations for Using the Data

Social Impact of Dataset

NeurIPS impact statement risks N/A
Semiconductor org types N/A
TAI Safety Research N/A

Discussion of Biases

NeurIPS impact statement risks N/A
Semiconductor org types N/A
TAI Safety Research N/A

Other Known Limitations

NeurIPS impact statement risks This dataset has limitations that should be taken into consideration when using it. In particular, the method used to collect broader impact statements involved automated downloads, conversions and scraping and was not error-proof. Although care has been taken to identify and correct as many errors as possible, not all texts have been reviewed by a human. This means it is possible some of the broader impact statements contained in the dataset are truncated or otherwise incorrectly extracted from their original article.
Semiconductor org types N/A
TAI Safety Research Don't use it to create a dangerous AI that could bring the end of days.

Additional Information

Dataset Curators

The overall RAFT curators are Neel Alex, Eli Lifland, and Andreas Stuhlmüller.

NeurIPS impact statement risks Volunteers working with researchers affiliated to Oxford's Future of Humanity Institute (Carolyn Ashurst, now at The Alan Turing Institute) created the impact statements dataset.
Semiconductor org types The data science unit of Stiftung Neue Verantwortung (Berlin).
TAI Safety Research Angelica Deibel and Jess Riedel. We did not do it on behalf of any entity.

Licensing Information

RAFT aggregates many other datasets, each of which is provided under its own license. Generally, those licenses permit research and commercial use.

Dataset	License
Ade Corpus V2	Unlicensed
Banking 77	CC BY 4.0
NeurIPS Impact Statement Risks	MIT License/CC BY 4.0
One Stop English	CC BY-SA 4.0
Overruling	Unlicensed
Semiconductor Org Types	CC BY-NC 4.0
Systematic Review Inclusion	CC BY 4.0
TAI Safety Research	CC BY-SA 4.0
Terms Of Service	Unlicensed
Tweet Eval Hate	Unlicensed
Twitter Complaints	Unlicensed

Citation Information

[More Information Needed]

Contributions

Thanks to @neel-alex , @uvafan , and @lewtun for adding this dataset.

作者:

ought

数据集大小:

9.39 MB