数据集:
HuggingFaceH4/stack-exchange-preferences
This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training . Importantly, the questions have been filtered to fit the following criteria for preference models (following closely from Askell et al. 2021 ): have >=2 answers . This data could also be used for instruction fine-tuning and language model training.
The questions are grouped with answers that are assigned a score corresponding to the Anthropic paper:
score = log2 (1 + upvotes) rounded to the nearest integer, plus 1 if the answer was accepted by the questioner (we assign a score of −1 if the number of upvotes is negative).
Some important notes when using this dataset for preference model pretraining (PMP), which can be ignored for other uses:
Subsequently, we created a binary dataset by applying a ‘binarization’ procedure to the ranked dataset. That is, for every ranked pair A > B, we transform it into two independent binary comparisons: GOOD:A > BAD:A BAD:B > GOOD:B
To see all the stackexchanges used in this data, please see this file .
Unfortunately, sharing the binarized data directly without metadata violates the license, so we have shared a script for binarization.
Here is a script from our internal tooling used to create a binarized dataset:
# Copyright 2023 The HuggingFace Team. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. import random from argparse import ArgumentParser from pathlib import Path import numpy as np from datasets import Dataset, concatenate_datasets, load_dataset from h4.data.utils import save_dataset_shards H4_DIR = Path(__file__).resolve().parents[3] DATA_DIR = H4_DIR / "data" if __name__ == "__main__": parser = ArgumentParser() parser.add_argument("--debug", action="store_true", help="Added print statements / limit data size for debugging") parser.add_argument( "--output_dir", default=f"{DATA_DIR}/pmp-binarized", type=str, help="Where to save the processed dataset", ) parser.add_argument( "--exchange_name", type=str, default=None, help="Optional argument to specify a specific subsection of the dataset", ) parser.add_argument( "--binary_score", type=int, default=8, help="Score assigned to binarized pairs for preference data." ) parser.add_argument( "--stream_data", action="store_true", help="Optionally stream data, which can be useful with weaker computers" ) parser.set_defaults(debug=False, stream_data=False) # default will process full dataset args = parser.parse_args() specific_exchange = args.exchange_name stream_dataset = args.stream_data binary_score = args.binary_score if specific_exchange: data_dir = "data/" + args.exchange_name else: data_dir = None if args.debug: data_len_limit = 10000 else: data_len_limit = np.inf dataset = load_dataset( "HuggingFaceH4/pmp-stack-exchange", data_dir=data_dir, split="train", streaming=stream_dataset, ) pmp_data = [] for i, d in enumerate(iter(dataset)): # check debug limit, quit if in debug mode (don't save) if i > data_len_limit: print("Early exit for debug mode!") print(pmp_data) break question = d["question"] answers = d["answers"] num_answers = len(answers) answer_scores = [a["pm_score"] for a in answers] if len(np.unique(answer_scores)) < 2: print(f"PM Scores are {answer_scores}, skipping this question {i}") else: # Sample 2 unique scores for binarization dif_scores = False while not dif_scores: # print("infinite loop...?") two_answers = random.sample(answers, 2) if two_answers[0]["pm_score"] != two_answers[1]["pm_score"]: dif_scores = True answer_0 = two_answers[0] answer_1 = two_answers[1] text_0 = "Question: " + question + "\n" + "Answer: " + answer_0["text"] text_1 = "Question: " + question + "\n" + "Answer: " + answer_1["text"] score_0 = binary_score score_1 = binary_score pmp_data.append({"context": text_0, "score": score_0}) pmp_data.append({"context": text_1, "score": score_1}) # Save binarized data sublist_len = 100000 print(f"Dataset length is {len(pmp_data)}") # bypass known issue in arrow https://issues.apache.org/jira/browse/ARROW-17137 print(f"Processed dataset length > {sublist_len}, processing to HF dataset in chunks") chunks = [pmp_data[x : x + sublist_len] for x in range(0, len(pmp_data), sublist_len)] ds_chunks = [Dataset.from_list(ch) for ch in chunks] ds = concatenate_datasets(ds_chunks) save_dataset_shards(ds, args.output_dir, subset="stackexchange", shard_size="100MB")
This is intended to be English only, thought other languages may be present. Some Stack Exchanges that are omitted include:
spanish: es.meta.stackoverflow.com, es.stackoverflow.com japanese: ja.meta.stackoverflow.com, ja.stackoverflow.com portugese: pt.stackoverflow.com, pt.meta.stackoverflow.com russian: ru.stackoverflow, ru.meta.stackoverflow
License: https://creativecommons.org/licenses/by-sa/4.0/
The cc-by-sa 4.0 licensing, while intentionally permissive, does require attribution:
Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). Specifically the attribution requirements are as follows:
@online{h4stackexchange, author = {Lambert, Nathan and Tunstall, Lewis and Rajani, Nazneen and Thrush, Tristan}, title = {HuggingFace H4 Stack Exchange Preference Dataset}, year = 2023, url = {https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences}, }