模型:
stanfordnlp/SteamSHP-flan-t5-xl
SteamSHP-XL is a preference model trained to predict -- given some context and two possible responses -- which response humans will find more helpful. It can be used for NLG evaluation or as a reward model for RLHF.
It is a FLAN-T5-xl model (3B parameters) finetuned on:
There is a smaller variant called SteamSHP-Large that was made by finetuning FLAN-T5-large (780M parameters). Despite being 1/4 of the size, it is on average only 0.75 points less accurate on the SHP + Anthropic test data (across all domains).
The input text should be of the format:
POST: { the context, such as the 'history' column in SHP (not containing any newlines \n) } RESPONSE A: { first possible continuation (not containing any newlines \n) } RESPONSE B: { second possible continuation (not containing any newlines \n) } Which response is better? RESPONSE
The output generated by SteamSHP-XL will either be A or B .
Here's how to use the model:
>> from transformers import T5ForConditionalGeneration, T5Tokenizer >> device = 'cuda' # if you have a GPU >> tokenizer = T5Tokenizer.from_pretrained('stanfordnlp/SteamSHP-flan-t5-xl') >> model = T5ForConditionalGeneration.from_pretrained('stanfordnlp/SteamSHP-flan-t5-xl').to(device) >> input_text = "POST: Instacart gave me 50 pounds of limes instead of 5 pounds... what the hell do I do with 50 pounds of limes? I've already donated a bunch and gave a bunch away. I'm planning on making a bunch of lime-themed cocktails, but... jeez. Ceviche? \n\n RESPONSE A: Lime juice, and zest, then freeze in small quantities.\n\n RESPONSE B: Lime marmalade lol\n\n Which response is better? RESPONSE" >> x = tokenizer([input_text], return_tensors='pt').input_ids.to(device) >> y = model.generate(x, max_new_tokens=1) >> tokenizer.batch_decode(y, skip_special_tokens=True) ['A']
If the input exceeds the 512 token limit, you can use pybsd to break the input up into sentences and only include what fits into 512 tokens. When trying to cram an example into 512 tokens, we recommend truncating the context as much as possible and leaving the responses as untouched as possible.
If you want to use SteamSHP-XL as a reward model -- to get a score for a single response -- then you need to structure the input such that RESPONSE A is what you want to score and RESPONSE B is just an empty input:
POST: { the context, such as the 'history' column in SHP (not containing any newlines \n) } RESPONSE A: { continuation (not containing any newlines \n) } RESPONSE B: . Which response is better? RESPONSE
Then calculate the probability assigned to the label A. This probability (or the logit, depending on what you want) is the score for the response:
>> input_text = "POST: Instacart gave me 50 pounds of limes instead of 5 pounds... what the hell do I do with 50 pounds of limes? I've already donated a bunch and gave a bunch away. I'm planning on making a bunch of lime-themed cocktails, but... jeez. Ceviche? \n\n RESPONSE A: Lime juice, and zest, then freeze in small quantities.\n\n RESPONSE B: .\n\n Which response is better? RESPONSE" >> x = tokenizer([input_text], return_tensors='pt').input_ids.to(device) >> outputs = model.generate(x, return_dict_in_generate=True, output_scores=True, max_new_tokens=1) >> torch.exp(outputs.scores[0][:, 71]) / torch.exp(outputs.scores[0][:,:]).sum(axis=1).item() # index 71 corresponds to the token for 'A' 0.819
The probability will almost always be high (in the range of 0.8 to 1.0), since RESPONSE B is just a null input. Therefore you may want to normalize the probability.
You can also compare the two probabilities assigned independently to each response (given the same context) to infer the preference label. For example, if one response has probability 0.95 and the other has 0.80, the former will be preferred. Inferring the preference label in this way only leads to a 0.006 drop in accuracy on the SHP + HH-RLHF test data on average across all domains, meaning that there's only a very small penalty for using SteamSHP-XL as a reward model instead of as a preference model.
SteamSHP-XL was only finetuned on 125K of the 392K training examples that were available, since we found that:
We evaluated the model on the SHP and HH-RLHF test data using accuracy, but only on the data that could be truncated to fit within 500 tokens (a total of 18621 out of 20753 available test examples). SteamSHP-XL gets an average 72.8% accuracy across all domains:
Domain | Accuracy |
---|---|
askculinary | 0.7199 |
askhr | 0.7743 |
askdocs | 0.7210 |
askanthropology | 0.7594 |
asksciencefiction | 0.7283 |
askacademia | 0.7442 |
askengineers | 0.7183 |
legaladvice | 0.8068 |
explainlikeimfive | 0.7392 |
askbaking | 0.6741 |
askphysics | 0.8000 |
askscience | 0.7114 |
askphilosophy | 0.6907 |
askvet | 0.7742 |
changemyview | 0.7043 |
askcarguys | 0.7568 |
askhistorians | 0.7476 |
asksocialscience | 0.7308 |
anthropic (helpfulness) | 0.7310 |
ALL (unweighted) | 0.7278 |
As mentioned previously, if you use SteamSHP as a reward model and try to infer the preference label based on the probability assigned to each response independently, that could also work! But doing so will lead to a 0.006 drop in accuracy on the test data (on average across all domains), meaning that there is a small penalty.
SteamSHP is trained to predict which of two responses humans will find more helpful , not which response is less harmful . It should not be used to detect toxicity, make ethical judgments, or for a similar purpose.
Biases and misinformation in the datasets used to train SteamSHP may also be propagated downstream to the model predictions. Although SHP filtered out posts with NSFW (over 18) content, chose subreddits that were well-moderated and had policies against harassment and bigotry, some of the data may contain discriminatory or harmful language. The responses that humans collectively found more helpful are also not guaranteed to be more factual.
The people whose preferences are captured in SHP and HH-RLHF are not representative of the broader population. Although specific demographic information is not available, overall, the Reddit users whose preferences are captured in SHP are disproportionately male and from developed, Western, and English-speaking countries (Pew Research).
Past work by Anthropic has found that models optimized for human preference can be obsequious, at the expense of the truth.
Please contact kawin@stanford.edu if you have any questions about the model. This model was created by Kawin Ethayarajh, Heidi (Chenyu) Zhang, Yizhong Wang, and Dan Jurafsky.
We will have a paper out soon, but until then, please cite:
@InProceedings{pmlr-v162-ethayarajh22a, title = {Understanding Dataset Difficulty with $\mathcal{V}$-Usable Information}, author = {Ethayarajh, Kawin and Choi, Yejin and Swayamdipta, Swabha}, booktitle = {Proceedings of the 39th International Conference on Machine Learning}, pages = {5988--6008}, year = {2022}, editor = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Szepesvari, Csaba and Niu, Gang and Sabato, Sivan}, volume = {162}, series = {Proceedings of Machine Learning Research}, month = {17--23 Jul}, publisher = {PMLR}, }