数据集:
adithya7/xlel_wd
计算机处理:
multilingual大小:
1M<n<10M语言创建人:
found批注创建人:
found源数据集:
original预印本库:
arxiv:2204.06535许可:
cc-by-4.0XLEL-WD is a multilingual event linking dataset. This dataset repo contains mention references in multilingual Wikipedia/Wikinews articles to event items from Wikidata.
The descriptions for Wikidata event items were collected from the corresponding Wikipedia articles. Download the event dictionary from adithya7/xlel_wd_dictionary .
This dataset can be used for the task of event linking. There are two variants of the task, multilingual and crosslingual.
This dataset contains text from 44 languages. The language names and their ISO 639-1 codes are listed below. For details on the dataset distribution for each language, refer to the original paper.
Language | Code | Language | Code | Language | Code | Language | Code |
---|---|---|---|---|---|---|---|
Afrikaans | af | Arabic | ar | Belarusian | be | Bulgarian | bg |
Bengali | bn | Catalan | ca | Czech | cs | Danish | da |
German | de | Greek | el | English | en | Spanish | es |
Persian | fa | Finnish | fi | French | fr | Hebrew | he |
Hindi | hi | Hungarian | hu | Indonesian | id | Italian | it |
Japanese | ja | Korean | ko | Malayalam | ml | Marathi | mr |
Malay | ms | Dutch | nl | Norwegian | no | Polish | pl |
Portuguese | pt | Romanian | ro | Russian | ru | Sinhala | si |
Slovak | sk | Slovene | sl | Serbian | sr | Swedish | sv |
Swahili | sw | Tamil | ta | Telugu | te | Thai | th |
Turkish | tr | Ukrainian | uk | Vietnamese | vi | Chinese | zh |
Each instance in the train.jsonl , dev.jsonl and test.jsonl files follow the below template.
{ "context_left": "Minibaev's first major international medal came in the men's synchronized 10 metre platform event at the ", "mention": "2010 European Championships", "context_right": ".", "context_lang": "en", "label_id": "830917", }
Field | Meaning |
---|---|
mention | text span of the mention |
context_left | left paragraph context from the document |
context_right | right paragraph context from the document |
context_lang | language of the context (and mention) |
context_title | document title of the mention (only Wikinews subset) |
context_date | document publication date of the mention (only Wikinews subset) |
label_id | Wikidata label ID for the event. E.g. 830917 refers to Q830917 from Wikidata. |
The Wikipedia-based corpus has three splits. This is a zero-shot evaluation setup.
Train | Dev | Test | Total | |
---|---|---|---|---|
Events | 8653 | 1090 | 1204 | 10947 |
Event Sequences | 6758 | 844 | 846 | 8448 |
Mentions | 1.44M | 165K | 190K | 1.8M |
Languages | 44 | 44 | 44 | 44 |
The Wikinews-based evaluation set has two variants, one for cross-domain evaluation and another for zero-shot evaluation.
(Cross-domain) Test | (Zero-shot) Test | |
---|---|---|
Events | 802 | 149 |
Mentions | 2562 | 437 |
Languages | 27 | 21 |
This dataset helps address the task of event linking. KB linking is extensively studied for entities, but its unclear if the same methodologies can be extended for linking mentions to events from KB. We use Wikidata as our KB, as it allows for linking mentions from multilingual Wikipedia and Wikinews articles.
First, we utilize spatial & temporal properties from Wikidata to identify event items. Second, we identify corresponding multilingual Wikipedia pages for each Wikidata event item. Third, we pool hyperlinks from multilingual Wikipedia & Wikinews articles to these event items.
Who are the source language producers?The documents in XLEL-WD are written by Wikipedia and Wikinews contributors in respective languages.
This dataset was originally collected automatically from Wikipedia, Wikinews and Wikidata. It was post-processed to improve data quality.
Who are the annotators?The annotations in XLEL-WD (hyperlinks from Wikipedia/Wikinews to Wikidata) are added the original Wiki contributors.
[More Information Needed]
[More Information Needed]
[More Information Needed]
XLEL-WD v1.0.0 mostly caters to eventive nouns from Wikidata. It does not include any links to other event items from Wikidata such as disease outbreak (Q3241045), military offensive (Q2001676) and war (Q198).
The dataset was curated by Adithya Pratapa, Rishubh Gupta and Teruko Mitamura. The code for collecting the dataset is available at Github:xlel-wd .
XLEL-WD dataset is released under CC-BY-4.0 license .
@article{pratapa-etal-2022-multilingual, title = {Multilingual Event Linking to Wikidata}, author = {Pratapa, Adithya and Gupta, Rishubh and Mitamura, Teruko}, publisher = {arXiv}, year = {2022}, url = {https://arxiv.org/abs/2204.06535}, }
Thanks to @adithya7 for adding this dataset.