数据集:
ekinakyurek/ftrace
[PAPER] FTRACE is a zero-shot infromation retrieval benchmark deviced for tracing a language model’s predictions back to training examples. In the accompanying paper, we evaluate commonly studied influence methods, including gradient-based (TracIn) and embedding-based approaches. The dataset contains two parts. First, factual queries for that we trace the knowledge are extracted from existing LAMA queries (Petroni et al., 2019). Second, Wikidata sentences are extracted from TREx corpus (Elsahar et al., 2018). We annotate the extracted sentences with their stated facts, and these facts can be mathed with the facts in query set. In both parts, we provide (input, target) pairs as a masked language modeling task -- see examples in the below. However, one can use the same data in other formalities for example auto-regressive completion via a processing of input_pretokenized and targets_pretokenized field.
An example of 'abstract' looks as follows.
{"inputs_pretokenized": "The name Austroasiatic comes from the Latin words for \"south\" and \"Asia\", hence \"<extra_id_0>\".", "targets_pretokenized": "<extra_id_0> South Asia", "page_uri": "Q33199", "masked_uri": "Q771405", "masked_type": "subject", "example_uris": "Q33199-1-Q48-Q771405-1", "facts": "P361,Q48,Q771405;P30,Q48,Q771405", "id": 8}Queries
An example of 'query' looks as follows.
{"inputs_pretokenized": "Paul Ehrlich used to work in <extra_id_0> .", "targets_pretokenized": "<extra_id_0> Frankfurt", "uuid": "5b063008-a8ba-4064-9f59-e70102bb8c50", "obj_uri": "Q1794", "sub_uri": "Q57089", "predicate_id": "P937", "obj_surface": "Frankfurt", "sub_surface": "Paul Ehrlich"}
The data fields are the same among all splits.
Abstractsname | train |
---|---|
Abstracts | 1560453 |
Queries | 31479 |
LAMA: https://github.com/facebookresearch/LAMA TRex: https://hadyelsahar.github.io/t-rex/
Initial Data Collection and Normalization Who are the source language producers?The parts of this dataset are available under the Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0) and The Creative Commons Attribution-Noncommercial 4.0 International License
The main paper should be cited as follow:
@misc{https://doi.org/10.48550/arxiv.2205.11482, doi = {10.48550/ARXIV.2205.11482}, url = {https://arxiv.org/abs/2205.11482}, author = {Akyürek, Ekin and Bolukbasi, Tolga and Liu, Frederick and Xiong, Binbin and Tenney, Ian and Andreas, Jacob and Guu, Kelvin}, keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), FOS: Computer and information sciences, FOS: Computer and information sciences}, title = {Tracing Knowledge in Language Models Back to the Training Data}, publisher = {arXiv}, year = {2022}, }
Please also cite Petroni et al., 2019 for the query set, and Elsahar et al., 2018 for the abstract set.
@inproceedings{petroni2019language, title={Language Models as Knowledge Bases?}, author={F. Petroni, T. Rockt{\"{a}}schel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu and S. Riedel}, booktitle={In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019}, year={2019} }
@inproceedings{elsahar2018t, title={T-rex: A large scale alignment of natural language with knowledge base triples}, author={Elsahar, Hady and Vougiouklis, Pavlos and Remaci, Arslen and Gravier, Christophe and Hare, Jonathon and Laforest, Frederique and Simperl, Elena}, booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)}, year={2018} }