数据集:
ought/raft
The Real-world Annotated Few-shot Tasks (RAFT) dataset is an aggregation of English-language datasets found in the real world. Associated with each dataset is a binary or multiclass classification task, intended to improve our understanding of how language models perform on tasks that have concrete, real-world value. Only 50 labeled examples are provided in each dataset.
RAFT is entirely in American English (en-US).
Dataset | First Example |
---|---|
Ade Corpus V2 |
Sentence: No regional side effects were noted.ID: 0Label: 2 |
Banking 77 |
Query: Is it possible for me to change my PIN number?ID: 0Label: 23 |
NeurIPS Impact Statement Risks |
Paper title: Auto-Panoptic: Cooperative Multi-Component Architecture Search for Panoptic Segmentation...Paper link: https://proceedings.neurips.cc/paper/2020/file/ec1f764517b7ffb52057af6df18142b7-Paper.pdf...Impact statement: This work makes the first attempt to search for all key components of panoptic pipeline and manages to accomplish this via the p...ID: 0Label: 1 |
One Stop English |
Article: For 85 years, it was just a grey blob on classroom maps of the solar system. But, on 15 July, Pluto was seen in high resolution ...ID: 0Label: 3 |
Overruling |
Sentence: in light of both our holding today and previous rulings in johnson, dueser, and gronroos, we now explicitly overrule dupree....ID: 0Label: 2 |
Semiconductor Org Types |
Paper title: 3Gb/s AC-coupled chip-to-chip communication using a low-swing pulse receiver...Organization name: North Carolina State Univ.,Raleigh,NC,USAID: 0Label: 3 |
Systematic Review Inclusion |
Title: Prototyping and transforming facial textures for perception research...Abstract: Wavelet based methods for prototyping facial textures for artificially transforming the age of facial images were described. Pro...Authors: Tiddeman, B.; Burt, M.; Perrett, D.Journal: IEEE Comput Graphics ApplID: 0Label: 2 |
TAI Safety Research |
Title: Malign generalization without internal searchAbstract Note: In my last post, I challenged the idea that inner alignment failures should be explained by appealing to agents which perform ex...Url: https://www.alignmentforum.org/posts/ynt9TD6PrYw6iT49m/malign-generalization-without-internal-search...Publication Year: 2020Item Type: blogPostAuthor: Barnett, MatthewPublication Title: AI Alignment ForumID: 0Label: 1 |
Terms Of Service |
Sentence: Crowdtangle may change these terms of service, as described above, notwithstanding any provision to the contrary in any agreemen...ID: 0Label: 2 |
Tweet Eval Hate |
Tweet: New to Twitter-- any men on here know what the process is to get #verified?...ID: 0Label: 2 |
Twitter Complaints |
Tweet text: @HMRCcustomers No this is my first jobID: 0Label: 2 |
The ID field is used for indexing data points. It will be used to match your submissions with the true test labels, so you must include it in your submission. All other columns contain textual data. Some contain links and URLs to websites on the internet.
All output fields are designated with the "Label" column header. The 0 value in this column indicates that the entry is unlabeled, and should only appear in the unlabeled test set. Other values in this column are various other labels. To get their textual value for a given dataset:
# Load the dataset dataset = datasets.load_dataset("ought/raft", "ade_corpus_v2") # First, get the object that holds information about the "Label" feature in the dataset. label_info = dataset.features["Label"] # Use the int2str method to access the textual labels. print([label_info.int2str(i) for i in (0, 1, 2)]) # ['Unlabeled', 'ADE-related', 'not ADE-related']
There are two splits provided: train data and unlabeled test data.
The training examples were chosen at random. No attempt was made to ensure that classes were balanced or proportional in the training data -- indeed, the Banking 77 task with 77 different classes if used cannot fit all of its classes into the 50 training examples.
Dataset | Train Size | Test Size |
---|---|---|
Ade Corpus V2 | 50 | 5000 |
Banking 77 | 50 | 5000 |
NeurIPS Impact Statement Risks | 50 | 150 |
One Stop English | 50 | 516 |
Overruling | 50 | 2350 |
Semiconductor Org Types | 50 | 449 |
Systematic Review Inclusion | 50 | 2243 |
TAI Safety Research | 50 | 1639 |
Terms Of Service | 50 | 5000 |
Tweet Eval Hate | 50 | 2966 |
Twitter Complaints | 50 | 3399 |
Total | 550 | 28712 |
Generally speaking, the rationale behind RAFT was to create a benchmark for evaluating NLP models that didn't consist of contrived or artificial data sources, for which the tasks weren't originally assembled for the purpose of testing NLP models. However, each individual dataset in RAFT was collected independently. For the majority of datasets, we only collected them second-hand from existing curated sources. The datasets that we curated are:
Each of these three datasets was sourced from our existing collaborators at Ought. They had used our service, Elicit, to analyze their dataset in the past, and we contact them to include their dataset and the associated classification task in the benchmark. For all datasets, more information is provided in our paper. For the ones which we did not curate, we provide a link to the dataset. For the ones which we did, we provide a datasheet that elaborates on many of the topics here in greater detail.
For the three datasets that we introduced:
For the following sections, we will only describe the datasets we introduce. All other dataset details, and more details on the ones described here, can be found in our paper.
It is worth mentioning that the Tweet Eval Hate, by necessity, contains highly offensive content.
The overall RAFT curators are Neel Alex, Eli Lifland, and Andreas Stuhlmüller.
RAFT aggregates many other datasets, each of which is provided under its own license. Generally, those licenses permit research and commercial use.
Dataset | License |
---|---|
Ade Corpus V2 | Unlicensed |
Banking 77 | CC BY 4.0 |
NeurIPS Impact Statement Risks | MIT License/CC BY 4.0 |
One Stop English | CC BY-SA 4.0 |
Overruling | Unlicensed |
Semiconductor Org Types | CC BY-NC 4.0 |
Systematic Review Inclusion | CC BY 4.0 |
TAI Safety Research | CC BY-SA 4.0 |
Terms Of Service | Unlicensed |
Tweet Eval Hate | Unlicensed |
Twitter Complaints | Unlicensed |
[More Information Needed]
Thanks to @neel-alex , @uvafan , and @lewtun for adding this dataset.