数据集:
Anthropic/model-written-evals
语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
machine-generated批注创建人:
machine-generated源数据集:
original许可:
cc-by-4.0This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations."
We intend the datasets to be useful to:
The evaluations were generated to be asked to dialogue agents (e.g., a model finetuned explicitly respond to a user's utterances, or a pretrained language model prompted to behave like a dialogue agent). However, it is possible to adapt the data to test other kinds of models as well.
We describe each of our collections of datasets below:
Please see our paper for additional details on the datasets, how we generated them, human validation metrics, and other analyses of the datasets.
Disclaimer : As discussed in our paper, some data contains content that includes social biases and stereotypes. The data may also contain other forms of harmful or offensive content. The views expressed in the data do not reflect the views of Anthropic or any of its employees.
For questions, please email ethan at anthropic dot com
If you would like to cite our work or data, you may use the following bibtex citation:
@misc{perez2022discovering, doi = {10.48550/ARXIV.2212.09251}, url = {https://arxiv.org/abs/2212.09251}, author = {Perez, Ethan and Ringer, Sam and Lukošiūtė, Kamilė and Nguyen, Karina and Chen, Edwin and Heiner, Scott and Pettit, Craig and Olsson, Catherine and Kundu, Sandipan and Kadavath, Saurav and Jones, Andy and Chen, Anna and Mann, Ben and Israel, Brian and Seethor, Bryan and McKinnon, Cameron and Olah, Christopher and Yan, Da and Amodei, Daniela and Amodei, Dario and Drain, Dawn and Li, Dustin and Tran-Johnson, Eli and Khundadze, Guro and Kernion, Jackson and Landis, James and Kerr, Jamie and Mueller, Jared and Hyun, Jeeyoon and Landau, Joshua and Ndousse, Kamal and Goldberg, Landon and Lovitt, Liane and Lucas, Martin and Sellitto, Michael and Zhang, Miranda and Kingsland, Neerav and Elhage, Nelson and Joseph, Nicholas and Mercado, Noemí and DasSarma, Nova and Rausch, Oliver and Larson, Robin and McCandlish, Sam and Johnston, Scott and Kravec, Shauna and {El Showk}, Sheer and Lanham, Tamera and Telleen-Lawton, Timothy and Brown, Tom and Henighan, Tom and Hume, Tristan and Bai, Yuntao and Hatfield-Dodds, Zac and Clark, Jack and Bowman, Samuel R. and Askell, Amanda and Grosse, Roger and Hernandez, Danny and Ganguli, Deep and Hubinger, Evan and Schiefer, Nicholas and Kaplan, Jared}, keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Discovering Language Model Behaviors with Model-Written Evaluations}, publisher = {arXiv}, year = {2022}, copyright = {arXiv.org perpetual, non-exclusive license} }