数据集:
m_lama
计算机处理:
translation大小:
100K<n<1M源数据集:
extended|lama预印本库:
arxiv:2102.00894其他:
probing许可:
cc-by-nc-sa-4.0This dataset provides the data for mLAMA, a multilingual version of LAMA. Regarding LAMA see https://github.com/facebookresearch/LAMA . For mLAMA the TREx and GoogleRE part of LAMA was considered and machine translated using Google Translate, and the Wikidata and Google Knowledge Graph API. The machine translated templates were checked for validity, i.e., whether they contain exactly one '[X]' and one '[Y]'.
This data can be used for creating fill-in-the-blank queries like "Paris is the capital of [MASK]" across 53 languages. For more details see the website http://cistern.cis.lmu.de/mlama/ or the github repo https://github.com/norakassner/mlama .
Language model knowledge probing.
This dataset contains data in 53 languages: af,ar,az,be,bg,bn,ca,ceb,cs,cy,da,de,el,en,es,et,eu,fa,fi,fr,ga,gl,he,hi,hr,hu,hy,id,it,ja,ka,ko,la,lt,lv,ms,nl,pl,pt,ro,ru,sk,sl,sq,sr,sv,ta,th,tr,uk,ur,vi,zh
For each of the 53 languages and each of the 43 relations/predicates there is a set of triples.
For each language and relation there are triples, that consists of an object, a predicate and a subject. For each predicate there is a template available. An example for dataset["test"][0] is given here:
{ 'language': 'af', 'lineid': 0, 'obj_label': 'Frankryk', 'obj_uri': 'Q142', 'predicate_id': 'P1001', 'sub_label': 'President van Frankryk', 'sub_uri': 'Q191954', 'template': "[X] is 'n wettige term in [Y].", 'uuid': '3fe3d4da-9df9-45ba-8109-784ce5fba38a' }
Each instance has the following fields
There is only one partition that is labelled as 'test data'.
The dataset was translated into 53 languages to investigate knowledge in pretrained language models multilingually.
The data has several sources:
LAMA ( https://github.com/facebookresearch/LAMA ) licensed under Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) T-REx ( https://hadyelsahar.github.io/t-rex/ ) licensed under Creative Commons Attribution-ShareAlike 4.0 International License Google-RE ( https://github.com/google-research-datasets/relation-extraction-corpus ) Wikidata ( https://www.wikidata.org/ ) licensed under Creative Commons CC0 License and Creative Commons Attribution-ShareAlike License
Who are the source language producers?See links above.
Crowdsourced (wikidata) and machine translated.
Who are the annotators?Unknown.
Names of (most likely) famous people who have entries in Google Knowledge Graph or Wikidata.
Data was created through machine translation and automatic processes.
[More Information Needed]
[More Information Needed]
Not all triples are available in all languages.
The authors of the mLAMA paper and the authors of the original datasets.
The Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0). https://creativecommons.org/licenses/by-nc-sa/4.0/
@article{kassner2021multilingual, author = {Nora Kassner and Philipp Dufter and Hinrich Sch{\"{u}}tze}, title = {Multilingual {LAMA:} Investigating Knowledge in Multilingual Pretrained Language Models}, journal = {CoRR}, volume = {abs/2102.00894}, year = {2021}, url = {https://arxiv.org/abs/2102.00894}, archivePrefix = {arXiv}, eprint = {2102.00894}, timestamp = {Tue, 09 Feb 2021 13:35:56 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2102-00894.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}, note = {to appear in EACL2021} }
Thanks to @pdufter for adding this dataset.