数据集:
wikiann
任务:
标记分类计算机处理:
multilingual大小:
n<1K语言创建人:
crowdsourced批注创建人:
machine-generated源数据集:
original预印本库:
arxiv:1902.00193许可:
license:unknownWikiANN(有时称为PAN-X)是一个多语言命名实体识别数据集,由用IOB2格式标注的维基百科文章组成,其中包含LOC(位置)、PER(人物)和ORG(组织)标签。此版本对应于Rahimi等人(2019)的平衡训练、开发和测试集,支持原始WikiANN语料库中的176种语言中的176种。
数据集包含176种语言,每个配置子集中的一种语言。对应的BCP 47语言标签为:
Language tag | |
---|---|
ace | ace |
af | af |
als | als |
am | am |
an | an |
ang | ang |
ar | ar |
arc | arc |
arz | arz |
as | as |
ast | ast |
ay | ay |
az | az |
ba | ba |
bar | bar |
be | be |
bg | bg |
bh | bh |
bn | bn |
bo | bo |
br | br |
bs | bs |
ca | ca |
cdo | cdo |
ce | ce |
ceb | ceb |
ckb | ckb |
co | co |
crh | crh |
cs | cs |
csb | csb |
cv | cv |
cy | cy |
da | da |
de | de |
diq | diq |
dv | dv |
el | el |
en | en |
eo | eo |
es | es |
et | et |
eu | eu |
ext | ext |
fa | fa |
fi | fi |
fo | fo |
fr | fr |
frr | frr |
fur | fur |
fy | fy |
ga | ga |
gan | gan |
gd | gd |
gl | gl |
gn | gn |
gu | gu |
hak | hak |
he | he |
hi | hi |
hr | hr |
hsb | hsb |
hu | hu |
hy | hy |
ia | ia |
id | id |
ig | ig |
ilo | ilo |
io | io |
is | is |
it | it |
ja | ja |
jbo | jbo |
jv | jv |
ka | ka |
kk | kk |
km | km |
kn | kn |
ko | ko |
ksh | ksh |
ku | ku |
ky | ky |
la | la |
lb | lb |
li | li |
lij | lij |
lmo | lmo |
ln | ln |
lt | lt |
lv | lv |
mg | mg |
mhr | mhr |
mi | mi |
min | min |
mk | mk |
ml | ml |
mn | mn |
mr | mr |
ms | ms |
mt | mt |
mwl | mwl |
my | my |
mzn | mzn |
nap | nap |
nds | nds |
ne | ne |
nl | nl |
nn | nn |
no | no |
nov | nov |
oc | oc |
or | or |
os | os |
other-bat-smg | sgs |
other-be-x-old | be-tarask |
other-cbk-zam | cbk |
other-eml | eml |
other-fiu-vro | vro |
other-map-bms | jv-x-bms |
other-simple | en-basiceng |
other-zh-classical | lzh |
other-zh-min-nan | nan |
other-zh-yue | yue |
pa | pa |
pdc | pdc |
pl | pl |
pms | pms |
pnb | pnb |
ps | ps |
pt | pt |
qu | qu |
rm | rm |
ro | ro |
ru | ru |
rw | rw |
sa | sa |
sah | sah |
scn | scn |
sco | sco |
sd | sd |
sh | sh |
si | si |
sk | sk |
sl | sl |
so | so |
sq | sq |
sr | sr |
su | su |
sv | sv |
sw | sw |
szl | szl |
ta | ta |
te | te |
tg | tg |
th | th |
tk | tk |
tl | tl |
tr | tr |
tt | tt |
ug | ug |
uk | uk |
ur | ur |
uz | uz |
vec | vec |
vep | vep |
vi | vi |
vls | vls |
vo | vo |
wa | wa |
war | war |
wuu | wuu |
xmf | xmf |
yi | yi |
yo | yo |
zea | zea |
zh | zh |
这是“训练”集中“af”(南非语)配置子集的示例:
{ 'tokens': ['Sy', 'ander', 'seun', ',', 'Swjatopolk', ',', 'was', 'die', 'resultaat', 'van', '’n', 'buite-egtelike', 'verhouding', '.'], 'ner_tags': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'langs': ['af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af'], 'spans': ['PER: Swjatopolk'] }
对于每个配置子集,数据被拆分为“训练”、“验证”和“测试”集,每个集合包含以下数量的示例:
Train | Validation | Test | |
---|---|---|---|
ace | 100 | 100 | 100 |
af | 5000 | 1000 | 1000 |
als | 100 | 100 | 100 |
am | 100 | 100 | 100 |
an | 1000 | 1000 | 1000 |
ang | 100 | 100 | 100 |
ar | 20000 | 10000 | 10000 |
arc | 100 | 100 | 100 |
arz | 100 | 100 | 100 |
as | 100 | 100 | 100 |
ast | 1000 | 1000 | 1000 |
ay | 100 | 100 | 100 |
az | 10000 | 1000 | 1000 |
ba | 100 | 100 | 100 |
bar | 100 | 100 | 100 |
bat-smg | 100 | 100 | 100 |
be | 15000 | 1000 | 1000 |
be-x-old | 5000 | 1000 | 1000 |
bg | 20000 | 10000 | 10000 |
bh | 100 | 100 | 100 |
bn | 10000 | 1000 | 1000 |
bo | 100 | 100 | 100 |
br | 1000 | 1000 | 1000 |
bs | 15000 | 1000 | 1000 |
ca | 20000 | 10000 | 10000 |
cbk-zam | 100 | 100 | 100 |
cdo | 100 | 100 | 100 |
ce | 100 | 100 | 100 |
ceb | 100 | 100 | 100 |
ckb | 1000 | 1000 | 1000 |
co | 100 | 100 | 100 |
crh | 100 | 100 | 100 |
cs | 20000 | 10000 | 10000 |
csb | 100 | 100 | 100 |
cv | 100 | 100 | 100 |
cy | 10000 | 1000 | 1000 |
da | 20000 | 10000 | 10000 |
de | 20000 | 10000 | 10000 |
diq | 100 | 100 | 100 |
dv | 100 | 100 | 100 |
el | 20000 | 10000 | 10000 |
eml | 100 | 100 | 100 |
en | 20000 | 10000 | 10000 |
eo | 15000 | 10000 | 10000 |
es | 20000 | 10000 | 10000 |
et | 15000 | 10000 | 10000 |
eu | 10000 | 10000 | 10000 |
ext | 100 | 100 | 100 |
fa | 20000 | 10000 | 10000 |
fi | 20000 | 10000 | 10000 |
fiu-vro | 100 | 100 | 100 |
fo | 100 | 100 | 100 |
fr | 20000 | 10000 | 10000 |
frr | 100 | 100 | 100 |
fur | 100 | 100 | 100 |
fy | 1000 | 1000 | 1000 |
ga | 1000 | 1000 | 1000 |
gan | 100 | 100 | 100 |
gd | 100 | 100 | 100 |
gl | 15000 | 10000 | 10000 |
gn | 100 | 100 | 100 |
gu | 100 | 100 | 100 |
hak | 100 | 100 | 100 |
he | 20000 | 10000 | 10000 |
hi | 5000 | 1000 | 1000 |
hr | 20000 | 10000 | 10000 |
hsb | 100 | 100 | 100 |
hu | 20000 | 10000 | 10000 |
hy | 15000 | 1000 | 1000 |
ia | 100 | 100 | 100 |
id | 20000 | 10000 | 10000 |
ig | 100 | 100 | 100 |
ilo | 100 | 100 | 100 |
io | 100 | 100 | 100 |
is | 1000 | 1000 | 1000 |
it | 20000 | 10000 | 10000 |
ja | 20000 | 10000 | 10000 |
jbo | 100 | 100 | 100 |
jv | 100 | 100 | 100 |
ka | 10000 | 10000 | 10000 |
kk | 1000 | 1000 | 1000 |
km | 100 | 100 | 100 |
kn | 100 | 100 | 100 |
ko | 20000 | 10000 | 10000 |
ksh | 100 | 100 | 100 |
ku | 100 | 100 | 100 |
ky | 100 | 100 | 100 |
la | 5000 | 1000 | 1000 |
lb | 5000 | 1000 | 1000 |
li | 100 | 100 | 100 |
lij | 100 | 100 | 100 |
lmo | 100 | 100 | 100 |
ln | 100 | 100 | 100 |
lt | 10000 | 10000 | 10000 |
lv | 10000 | 10000 | 10000 |
map-bms | 100 | 100 | 100 |
mg | 100 | 100 | 100 |
mhr | 100 | 100 | 100 |
mi | 100 | 100 | 100 |
min | 100 | 100 | 100 |
mk | 10000 | 1000 | 1000 |
ml | 10000 | 1000 | 1000 |
mn | 100 | 100 | 100 |
mr | 5000 | 1000 | 1000 |
ms | 20000 | 1000 | 1000 |
mt | 100 | 100 | 100 |
mwl | 100 | 100 | 100 |
my | 100 | 100 | 100 |
mzn | 100 | 100 | 100 |
nap | 100 | 100 | 100 |
nds | 100 | 100 | 100 |
ne | 100 | 100 | 100 |
nl | 20000 | 10000 | 10000 |
nn | 20000 | 1000 | 1000 |
no | 20000 | 10000 | 10000 |
nov | 100 | 100 | 100 |
oc | 100 | 100 | 100 |
or | 100 | 100 | 100 |
os | 100 | 100 | 100 |
pa | 100 | 100 | 100 |
pdc | 100 | 100 | 100 |
pl | 20000 | 10000 | 10000 |
pms | 100 | 100 | 100 |
pnb | 100 | 100 | 100 |
ps | 100 | 100 | 100 |
pt | 20000 | 10000 | 10000 |
qu | 100 | 100 | 100 |
rm | 100 | 100 | 100 |
ro | 20000 | 10000 | 10000 |
ru | 20000 | 10000 | 10000 |
rw | 100 | 100 | 100 |
sa | 100 | 100 | 100 |
sah | 100 | 100 | 100 |
scn | 100 | 100 | 100 |
sco | 100 | 100 | 100 |
sd | 100 | 100 | 100 |
sh | 20000 | 10000 | 10000 |
si | 100 | 100 | 100 |
simple | 20000 | 1000 | 1000 |
sk | 20000 | 10000 | 10000 |
sl | 15000 | 10000 | 10000 |
so | 100 | 100 | 100 |
sq | 5000 | 1000 | 1000 |
sr | 20000 | 10000 | 10000 |
su | 100 | 100 | 100 |
sv | 20000 | 10000 | 10000 |
sw | 1000 | 1000 | 1000 |
szl | 100 | 100 | 100 |
ta | 15000 | 1000 | 1000 |
te | 1000 | 1000 | 1000 |
tg | 100 | 100 | 100 |
th | 20000 | 10000 | 10000 |
tk | 100 | 100 | 100 |
tl | 10000 | 1000 | 1000 |
tr | 20000 | 10000 | 10000 |
tt | 1000 | 1000 | 1000 |
ug | 100 | 100 | 100 |
uk | 20000 | 10000 | 10000 |
ur | 20000 | 1000 | 1000 |
uz | 1000 | 1000 | 1000 |
vec | 100 | 100 | 100 |
vep | 100 | 100 | 100 |
vi | 20000 | 10000 | 10000 |
vls | 100 | 100 | 100 |
vo | 100 | 100 | 100 |
wa | 100 | 100 | 100 |
war | 100 | 100 | 100 |
wuu | 100 | 100 | 100 |
xmf | 100 | 100 | 100 |
yi | 100 | 100 | 100 |
yo | 100 | 100 | 100 |
zea | 100 | 100 | 100 |
zh | 20000 | 10000 | 10000 |
zh-classical | 100 | 100 | 100 |
zh-min-nan | 100 | 100 | 100 |
zh-yue | 20000 | 10000 | 10000 |
[需要更多信息]
[需要更多信息]
谁是源语言的生产者?[需要更多信息]
[需要更多信息]
标注者是谁?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
与此文章相关的原始282个数据集为:
@inproceedings{pan-etal-2017-cross, title = "Cross-lingual Name Tagging and Linking for 282 Languages", author = "Pan, Xiaoman and Zhang, Boliang and May, Jonathan and Nothman, Joel and Knight, Kevin and Ji, Heng", booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2017", address = "Vancouver, Canada", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P17-1178", doi = "10.18653/v1/P17-1178", pages = "1946--1958", abstract = "The ambitious goal of this work is to develop a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify name mentions, assign a coarse-grained or fine-grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable. We achieve this goal by performing a series of new KB mining methods: generating {``}silver-standard{''} annotations by transferring annotations from English to other languages through cross-lingual links and KB properties, refining annotations through self-training and topic selection, deriving language-specific morphology features from anchor links, and mining word translation pairs from cross-lingual links. Both name tagging and linking results for 282 languages are promising on Wikipedia data and on-Wikipedia data.", }
而此版本支持的176种语言与以下文章相关:
@inproceedings{rahimi-etal-2019-massively, title = "Massively Multilingual Transfer for {NER}", author = "Rahimi, Afshin and Li, Yuan and Cohn, Trevor", booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics", month = jul, year = "2019", address = "Florence, Italy", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/P19-1015", pages = "151--164", }