中文

Model-Written Evaluation Datasets

This repository includes datasets written by language models, used in our paper on "Discovering Language Model Behaviors with Model-Written Evaluations."

We intend the datasets to be useful to:

  • Those who are interested in understanding the quality and properties of model-generated data
  • Those who wish to use our datasets to evaluate other models for the behaviors we examined in our work (e.g., related to model persona, sycophancy, advanced AI risks, and gender bias)
  • The evaluations were generated to be asked to dialogue agents (e.g., a model finetuned explicitly respond to a user's utterances, or a pretrained language model prompted to behave like a dialogue agent). However, it is possible to adapt the data to test other kinds of models as well.

    We describe each of our collections of datasets below:

  • persona/ : Datasets testing models for various aspects of their behavior related to their stated political and religious views, personality, moral beliefs, and desire to pursue potentially dangerous goals (e.g., self-preservation or power-seeking).
  • sycophancy/ : Datasets testing models for whether or not they repeat back a user's view to various questions (in philosophy, NLP research, and politics)
  • advanced-ai-risk/ : Datasets testing models for various behaviors related to catastrophic risks from advanced AI systems (e.g., ). These datasets were generated in a few-shot manner. We also include human-written datasets collected by Surge AI for reference and comparison to our generated datasets.
  • winogenerated/ : Our larger, model-generated version of the Winogender Dataset ( Rudinger et al., 2018 ). We also include the names of occupation titles that we generated, to create the dataset (alongside occupation gender statistics from the Bureau of Labor Statistics)
  • Please see our paper for additional details on the datasets, how we generated them, human validation metrics, and other analyses of the datasets.

    Disclaimer : As discussed in our paper, some data contains content that includes social biases and stereotypes. The data may also contain other forms of harmful or offensive content. The views expressed in the data do not reflect the views of Anthropic or any of its employees.

    Contact

    For questions, please email ethan at anthropic dot com

    Bibtex Citation

    If you would like to cite our work or data, you may use the following bibtex citation:

    @misc{perez2022discovering,
      doi = {10.48550/ARXIV.2212.09251},
      url = {https://arxiv.org/abs/2212.09251},
      author = {Perez, Ethan and Ringer, Sam and Lukošiūtė, Kamilė and Nguyen, Karina and Chen, Edwin and Heiner, Scott and Pettit, Craig and Olsson, Catherine and Kundu, Sandipan and Kadavath, Saurav and Jones, Andy and Chen, Anna and Mann, Ben and Israel, Brian and Seethor, Bryan and McKinnon, Cameron and Olah, Christopher and Yan, Da and Amodei, Daniela and Amodei, Dario and Drain, Dawn and Li, Dustin and Tran-Johnson, Eli and Khundadze, Guro and Kernion, Jackson and Landis, James and Kerr, Jamie and Mueller, Jared and Hyun, Jeeyoon and Landau, Joshua and Ndousse, Kamal and Goldberg, Landon and Lovitt, Liane and Lucas, Martin and Sellitto, Michael and Zhang, Miranda and Kingsland, Neerav and Elhage, Nelson and Joseph, Nicholas and Mercado, Noemí and DasSarma, Nova and Rausch, Oliver and Larson, Robin and McCandlish, Sam and Johnston, Scott and Kravec, Shauna and {El Showk}, Sheer and Lanham, Tamera and Telleen-Lawton, Timothy and Brown, Tom and Henighan, Tom and Hume, Tristan and Bai, Yuntao and Hatfield-Dodds, Zac and Clark, Jack and Bowman, Samuel R. and Askell, Amanda and Grosse, Roger and Hernandez, Danny and Ganguli, Deep and Hubinger, Evan and Schiefer, Nicholas and Kaplan, Jared},
      keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
      title = {Discovering Language Model Behaviors with Model-Written Evaluations},
      publisher = {arXiv},
      year = {2022},
      copyright = {arXiv.org perpetual, non-exclusive license}
    }