数据集:

ekinakyurek/ftrace

中文

Dataset Card for "FTRACE"

Dataset Summary

[PAPER] FTRACE is a zero-shot infromation retrieval benchmark deviced for tracing a language model’s predictions back to training examples. In the accompanying paper, we evaluate commonly studied influence methods, including gradient-based (TracIn) and embedding-based approaches. The dataset contains two parts. First, factual queries for that we trace the knowledge are extracted from existing LAMA queries (Petroni et al., 2019). Second, Wikidata sentences are extracted from TREx corpus (Elsahar et al., 2018). We annotate the extracted sentences with their stated facts, and these facts can be mathed with the facts in query set. In both parts, we provide (input, target) pairs as a masked language modeling task -- see examples in the below. However, one can use the same data in other formalities for example auto-regressive completion via a processing of input_pretokenized and targets_pretokenized field.

Supported Tasks and Leaderboards

More Information Needed

Languages

More Information Needed

Dataset Structure

Data Instances

Abstracts
  • Size of downloaded dataset files: 112 MB
  • Size of the generated dataset: 884 MB
  • Total amount of disk used: 996 MB

An example of 'abstract' looks as follows.

{"inputs_pretokenized": "The name Austroasiatic comes from the Latin words for \"south\" and \"Asia\", hence \"<extra_id_0>\".", 
 "targets_pretokenized": "<extra_id_0> South Asia", 
 "page_uri": "Q33199", 
 "masked_uri": "Q771405", 
 "masked_type": "subject", 
 "example_uris": "Q33199-1-Q48-Q771405-1", 
 "facts": "P361,Q48,Q771405;P30,Q48,Q771405", 
 "id": 8}
Queries
  • Size of downloaded dataset files: 1.7 MB
  • Size of the generated dataset: 8.9 MB
  • Total amount of disk used: 10.6 MB

An example of 'query' looks as follows.

{"inputs_pretokenized": "Paul Ehrlich used to work in <extra_id_0> .", 
"targets_pretokenized": "<extra_id_0> Frankfurt", 
"uuid": "5b063008-a8ba-4064-9f59-e70102bb8c50", 
"obj_uri": "Q1794", 
"sub_uri": "Q57089", 
"predicate_id": "P937",
"obj_surface": "Frankfurt", 
"sub_surface": "Paul Ehrlich"}

Data Fields

The data fields are the same among all splits.

Abstracts
  • inputs_pretokenized : a string feature.
  • targets_pretokenized : a string feature.
  • masked_uri : a string feature.
  • masked_type : a string feature.
  • facts : a string feature.
  • id : a string feature.
  • example_uris : a string feature.
  • page_uri : a string feature.
Queries
  • inputs_pretokenized : a string feature.
  • targets_pretokenized : a string feature.
  • obj_surface : a string feature.
  • sub_surface : a string feature.
  • obj_uri : a string feature.
  • sub_uri : a string feature.
  • predicate_id : a string feature.
  • uuid : a string feature.

Data Splits

name train
Abstracts 1560453
Queries 31479

Dataset Creation

Curation Rationale

More Information Needed

Source Data

LAMA: https://github.com/facebookresearch/LAMA TRex: https://hadyelsahar.github.io/t-rex/

Initial Data Collection and Normalization

More Information Needed

Who are the source language producers?

More Information Needed

Annotations

Annotation process

More Information Needed

Who are the annotators?

More Information Needed

Personal and Sensitive Information

More Information Needed

Considerations for Using the Data

Social Impact of Dataset

More Information Needed

Discussion of Biases

More Information Needed

Other Known Limitations

More Information Needed

Additional Information

Dataset Curators

More Information Needed

Licensing Information

The parts of this dataset are available under the Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0) and The Creative Commons Attribution-Noncommercial 4.0 International License

Citation Information

The main paper should be cited as follow:

@misc{https://doi.org/10.48550/arxiv.2205.11482,
  doi = {10.48550/ARXIV.2205.11482},
  url = {https://arxiv.org/abs/2205.11482},
  author = {Akyürek, Ekin and Bolukbasi, Tolga and Liu, Frederick and Xiong, Binbin and Tenney, Ian and Andreas, Jacob and Guu, Kelvin},
  keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Tracing Knowledge in Language Models Back to the Training Data},
  publisher = {arXiv},
  year = {2022}, 
}

Please also cite Petroni et al., 2019 for the query set, and Elsahar et al., 2018 for the abstract set.

@inproceedings{petroni2019language,
  title={Language Models as Knowledge Bases?},
  author={F. Petroni, T. Rockt{\"{a}}schel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu and S. Riedel},
  booktitle={In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019},
  year={2019}
}
@inproceedings{elsahar2018t,
  title={T-rex: A large scale alignment of natural language with knowledge base triples},
  author={Elsahar, Hady and Vougiouklis, Pavlos and Remaci, Arslen and Gravier, Christophe and Hare, Jonathon and Laforest, Frederique and Simperl, Elena},
  booktitle={Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year={2018}
}

Contributions