数据集:
gcaillaut/frwiki_good_pages_el
This dataset contains featured and good articles from the French Wikipédia. Pages are downloaded, as HTML files, from the French Wikipedia website .
It is intended to be used to train Entity Linking (EL) systems. Links in articles are used to detect named entities.
{ "title": "Title of the page", "qid": "QID of the corresponding Wikidata entity", "words": ["tokens"], "wikipedia": ["Wikipedia description of each entity"], "wikidata": ["Wikidata description of each entity"], "labels": ["NER labels"], "titles": ["Wikipedia title of each entity"], "qids": ["QID of each entity"], }
The words field contains the article’s text splitted on white-spaces. The other fields are list with same length as words and contains data only when the respective token in words is the start of an entity . For instance, if the i-th token in words is an entity, then the i-th element of wikipedia contains a description, extracted from Wikipedia, of this entity. The same applies for the other fields. If the entity spans multiple words, then only the index of the first words contains data.
The only exception is the labels field, which is used to delimit entities. It uses the IOB encoding: if the token is not part of an entity, the label is "O" ; if it is the first word of a multi-word entity, the label is "B" ; otherwise the label is "I" .