数据集:
lama
@inproceedings{petroni2020how, title={How Context Affects Language Models' Factual Predictions}, author={Fabio Petroni and Patrick Lewis and Aleksandra Piktus and Tim Rockt{"a}schel and Yuxiang Wu and Alexander H. Miller and Sebastian Riedel}, booktitle={Automated Knowledge Base Construction}, year={2020}, url={ https://openreview.net/forum?id=025X0zPfn} }
This dataset provides the data for LAMA. The dataset include a subset of Google_RE ( https://code.google.com/archive/p/relation-extraction-corpus/ ), TRex (subset of wikidata triples), Conceptnet ( https://github.com/commonsense/conceptnet5/wiki ) and Squad. There are configs for each of "google_re", "trex", "conceptnet" and "squad", respectively.
The dataset includes some cleanup, and addition of a masked sentence and associated answers for the [MASK] token. The accuracy in predicting the [MASK] token shows how well the language model knows facts and common sense information. The [MASK] tokens are only for the "object" slots.
This version of the dataset includes "negated" sentences as well as the masked sentence. Also, certain of the config includes "template" and "template_negated" fields of the form "[X] some text [Y]", where [X] and [Y] are the subject and object slots respectively of certain relations.
See the paper for more details. For more information, also see: https://github.com/facebookresearch/LAMA
en
The trex config has the following fields:
{'description': 'the item (an institution, law, public office ...) or statement belongs to or has power over or applies to the value (a territorial jurisdiction: a country, state, municipality, ...)', 'label': 'applies to jurisdiction', 'masked_sentence': 'It is known as a principality as it is a monarchy headed by two Co-Princes – the Spanish/Roman Catholic Bishop of Urgell and the President of [MASK].', 'obj_label': 'France', 'obj_surface': 'France', 'obj_uri': 'Q142', 'predicate_id': 'P1001', 'sub_label': 'president of the French Republic', 'sub_surface': 'President', 'sub_uri': 'Q191954', 'template': '[X] is a legal term in [Y] .', 'template_negated': '[X] is not a legal term in [Y] .', 'type': 'N-M', 'uuid': '3fe3d4da-9df9-45ba-8109-784ce5fba38a'}
The conceptnet config has the following fields:
{'masked_sentence': 'One of the things you do when you are alive is [MASK].', 'negated': '', 'obj': 'think', 'obj_label': 'think', 'pred': 'HasSubevent', 'sub': 'alive', 'uuid': 'd4f11631dde8a43beda613ec845ff7d1'}
The squad config has the following fields:
{'id': '56be4db0acb8001400a502f0_0', 'masked_sentence': 'To emphasize the 50th anniversary of the Super Bowl the [MASK] color was used.', 'negated': "['To emphasize the 50th anniversary of the Super Bowl the [MASK] color was not used.']", 'obj_label': 'gold', 'sub_label': 'Squad'}
The google_re config has the following fields:
{'evidences': '[{\'url\': \'http://en.wikipedia.org/wiki/Peter_F._Martin\', \'snippet\': "Peter F. Martin (born 1941) is an American politician who is a Democratic member of the Rhode Island House of Representatives. He has represented the 75th District Newport since 6 January 2009. He is currently serves on the House Committees on Judiciary, Municipal Government, and Veteran\'s Affairs. During his first term of office he served on the House Committees on Small Business and Separation of Powers & Government Oversight. In August 2010, Representative Martin was appointed as a Commissioner on the Atlantic States Marine Fisheries Commission", \'considered_sentences\': [\'Peter F Martin (born 1941) is an American politician who is a Democratic member of the Rhode Island House of Representatives .\']}]', 'judgments': "[{'rater': '18349444711114572460', 'judgment': 'yes'}, {'rater': '17595829233063766365', 'judgment': 'yes'}, {'rater': '4593294093459651288', 'judgment': 'yes'}, {'rater': '7387074196865291426', 'judgment': 'yes'}, {'rater': '17154471385681223613', 'judgment': 'yes'}]", 'masked_sentence': 'Peter F Martin (born [MASK]) is an American politician who is a Democratic member of the Rhode Island House of Representatives .', 'obj': '1941', 'obj_aliases': '[]', 'obj_label': '1941', 'obj_w': 'None', 'pred': '/people/person/date_of_birth', 'sub': '/m/09gb0bw', 'sub_aliases': '[]', 'sub_label': 'Peter F. Martin', 'sub_w': 'None', 'template': '[X] (born [Y]).', 'template_negated': '[X] (not born [Y]).', 'uuid': '18af2dac-21d3-4c42-aff5-c247f245e203'}
The trex config has the following fields:
The conceptnet config has the following fields:
The squad config has the following fields:
The google_re config has the following fields:
There are no data splits.
This dataset was gathered and created to probe what language models understand.
See the reaserch paper and website for more detail. The dataset was created gathered from various other datasets with cleanups for probing.
Who are the source language producers?The LAMA authors and the original authors of the various configs.
Human annotations under the original datasets (conceptnet), and various machine annotations.
Who are the annotators?Human annotations and machine annotations.
Unkown, but likely names of famous people.
The goal for the work is to probe the understanding of language models.
Since the data is from human annotators, there is likely to be baises.
[More Information Needed]
The original documentation for the datafields are limited.
The authors of LAMA at Facebook and the authors of the original datasets.
The Creative Commons Attribution-Noncommercial 4.0 International License. see https://github.com/facebookresearch/LAMA/blob/master/LICENSE
@inproceedings{petroni2019language, title={Language Models as Knowledge Bases?}, author={F. Petroni, T. Rockt{"{a}}schel, A. H. Miller, P. Lewis, A. Bakhtin, Y. Wu and S. Riedel}, booktitle={In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019}, year={2019} }
@inproceedings{petroni2020how, title={How Context Affects Language Models' Factual Predictions}, author={Fabio Petroni and Patrick Lewis and Aleksandra Piktus and Tim Rockt{"a}schel and Yuxiang Wu and Alexander H. Miller and Sebastian Riedel}, booktitle={Automated Knowledge Base Construction}, year={2020}, url={ https://openreview.net/forum?id=025X0zPfn} }
Thanks to @ontocord for adding this dataset.