数据集:

wikiann

任务:

标记分类

子任务:

named-entity-recognition

语言:

language:ace

language:als

计算机处理:

multilingual

大小:

n<1K

语言创建人:

crowdsourced

批注创建人:

machine-generated

源数据集:

original

预印本库:

arxiv:1902.00193

许可:

license:unknown

数据集介绍文件清单

中文

Dataset Card for WikiANN

Dataset Summary

WikiANN (sometimes called PAN-X) is a multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format. This version corresponds to the balanced train, dev, and test splits of Rahimi et al. (2019), which supports 176 of the 282 languages from the original WikiANN corpus.

Supported Tasks and Leaderboards

named-entity-recognition : The dataset can be used to train a model for named entity recognition in many languages, or evaluate the zero-shot cross-lingual capabilities of multilingual models.

Languages

The dataset contains 176 languages, one in each of the configuration subsets. The corresponding BCP 47 language tags are:

Language tag
ace	ace
af	af
als	als
am	am
an	an
ang	ang
ar	ar
arc	arc
arz	arz
as	as
ast	ast
ay	ay
az	az
ba	ba
bar	bar
be	be
bg	bg
bh	bh
bn	bn
bo	bo
br	br
bs	bs
ca	ca
cdo	cdo
ce	ce
ceb	ceb
ckb	ckb
co	co
crh	crh
cs	cs
csb	csb
cv	cv
cy	cy
da	da
de	de
diq	diq
dv	dv
el	el
en	en
eo	eo
es	es
et	et
eu	eu
ext	ext
fa	fa
fi	fi
fo	fo
fr	fr
frr	frr
fur	fur
fy	fy
ga	ga
gan	gan
gd	gd
gl	gl
gn	gn
gu	gu
hak	hak
he	he
hi	hi
hr	hr
hsb	hsb
hu	hu
hy	hy
ia	ia
id	id
ig	ig
ilo	ilo
io	io
is	is
it	it
ja	ja
jbo	jbo
jv	jv
ka	ka
kk	kk
km	km
kn	kn
ko	ko
ksh	ksh
ku	ku
ky	ky
la	la
lb	lb
li	li
lij	lij
lmo	lmo
ln	ln
lt	lt
lv	lv
mg	mg
mhr	mhr
mi	mi
min	min
mk	mk
ml	ml
mn	mn
mr	mr
ms	ms
mt	mt
mwl	mwl
my	my
mzn	mzn
nap	nap
nds	nds
ne	ne
nl	nl
nn	nn
no	no
nov	nov
oc	oc
or	or
os	os
other-bat-smg	sgs
other-be-x-old	be-tarask
other-cbk-zam	cbk
other-eml	eml
other-fiu-vro	vro
other-map-bms	jv-x-bms
other-simple	en-basiceng
other-zh-classical	lzh
other-zh-min-nan	nan
other-zh-yue	yue
pa	pa
pdc	pdc
pl	pl
pms	pms
pnb	pnb
ps	ps
pt	pt
qu	qu
rm	rm
ro	ro
ru	ru
rw	rw
sa	sa
sah	sah
scn	scn
sco	sco
sd	sd
sh	sh
si	si
sk	sk
sl	sl
so	so
sq	sq
sr	sr
su	su
sv	sv
sw	sw
szl	szl
ta	ta
te	te
tg	tg
th	th
tk	tk
tl	tl
tr	tr
tt	tt
ug	ug
uk	uk
ur	ur
uz	uz
vec	vec
vep	vep
vi	vi
vls	vls
vo	vo
wa	wa
war	war
wuu	wuu
xmf	xmf
yi	yi
yo	yo
zea	zea
zh	zh

Dataset Structure

Data Instances

This is an example in the "train" split of the "af" (Afrikaans language) configuration subset:

{
  'tokens': ['Sy', 'ander', 'seun', ',', 'Swjatopolk', ',', 'was', 'die', 'resultaat', 'van', '’n', 'buite-egtelike', 'verhouding', '.'],
  'ner_tags': [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
  'langs': ['af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af', 'af'],
  'spans': ['PER: Swjatopolk']
}

Data Fields

tokens : a list of string features.
langs : a list of string features that correspond to the language of each token.
ner_tags : a list of classification labels, with possible values including O (0), B-PER (1), I-PER (2), B-ORG (3), I-ORG (4), B-LOC (5), I-LOC (6).
spans : a list of string features, that is the list of named entities in the input text formatted as <TAG>: <mention>

Data Splits

For each configuration subset, the data is split into "train", "validation" and "test" sets, each containing the following number of examples:

Train	Validation	Test
ace	100	100	100
af	5000	1000	1000
als	100	100	100
am	100	100	100
an	1000	1000	1000
ang	100	100	100
ar	20000	10000	10000
arc	100	100	100
arz	100	100	100
as	100	100	100
ast	1000	1000	1000
ay	100	100	100
az	10000	1000	1000
ba	100	100	100
bar	100	100	100
bat-smg	100	100	100
be	15000	1000	1000
be-x-old	5000	1000	1000
bg	20000	10000	10000
bh	100	100	100
bn	10000	1000	1000
bo	100	100	100
br	1000	1000	1000
bs	15000	1000	1000
ca	20000	10000	10000
cbk-zam	100	100	100
cdo	100	100	100
ce	100	100	100
ceb	100	100	100
ckb	1000	1000	1000
co	100	100	100
crh	100	100	100
cs	20000	10000	10000
csb	100	100	100
cv	100	100	100
cy	10000	1000	1000
da	20000	10000	10000
de	20000	10000	10000
diq	100	100	100
dv	100	100	100
el	20000	10000	10000
eml	100	100	100
en	20000	10000	10000
eo	15000	10000	10000
es	20000	10000	10000
et	15000	10000	10000
eu	10000	10000	10000
ext	100	100	100
fa	20000	10000	10000
fi	20000	10000	10000
fiu-vro	100	100	100
fo	100	100	100
fr	20000	10000	10000
frr	100	100	100
fur	100	100	100
fy	1000	1000	1000
ga	1000	1000	1000
gan	100	100	100
gd	100	100	100
gl	15000	10000	10000
gn	100	100	100
gu	100	100	100
hak	100	100	100
he	20000	10000	10000
hi	5000	1000	1000
hr	20000	10000	10000
hsb	100	100	100
hu	20000	10000	10000
hy	15000	1000	1000
ia	100	100	100
id	20000	10000	10000
ig	100	100	100
ilo	100	100	100
io	100	100	100
is	1000	1000	1000
it	20000	10000	10000
ja	20000	10000	10000
jbo	100	100	100
jv	100	100	100
ka	10000	10000	10000
kk	1000	1000	1000
km	100	100	100
kn	100	100	100
ko	20000	10000	10000
ksh	100	100	100
ku	100	100	100
ky	100	100	100
la	5000	1000	1000
lb	5000	1000	1000
li	100	100	100
lij	100	100	100
lmo	100	100	100
ln	100	100	100
lt	10000	10000	10000
lv	10000	10000	10000
map-bms	100	100	100
mg	100	100	100
mhr	100	100	100
mi	100	100	100
min	100	100	100
mk	10000	1000	1000
ml	10000	1000	1000
mn	100	100	100
mr	5000	1000	1000
ms	20000	1000	1000
mt	100	100	100
mwl	100	100	100
my	100	100	100
mzn	100	100	100
nap	100	100	100
nds	100	100	100
ne	100	100	100
nl	20000	10000	10000
nn	20000	1000	1000
no	20000	10000	10000
nov	100	100	100
oc	100	100	100
or	100	100	100
os	100	100	100
pa	100	100	100
pdc	100	100	100
pl	20000	10000	10000
pms	100	100	100
pnb	100	100	100
ps	100	100	100
pt	20000	10000	10000
qu	100	100	100
rm	100	100	100
ro	20000	10000	10000
ru	20000	10000	10000
rw	100	100	100
sa	100	100	100
sah	100	100	100
scn	100	100	100
sco	100	100	100
sd	100	100	100
sh	20000	10000	10000
si	100	100	100
simple	20000	1000	1000
sk	20000	10000	10000
sl	15000	10000	10000
so	100	100	100
sq	5000	1000	1000
sr	20000	10000	10000
su	100	100	100
sv	20000	10000	10000
sw	1000	1000	1000
szl	100	100	100
ta	15000	1000	1000
te	1000	1000	1000
tg	100	100	100
th	20000	10000	10000
tk	100	100	100
tl	10000	1000	1000
tr	20000	10000	10000
tt	1000	1000	1000
ug	100	100	100
uk	20000	10000	10000
ur	20000	1000	1000
uz	1000	1000	1000
vec	100	100	100
vep	100	100	100
vi	20000	10000	10000
vls	100	100	100
vo	100	100	100
wa	100	100	100
war	100	100	100
wuu	100	100	100
xmf	100	100	100
yi	100	100	100
yo	100	100	100
zea	100	100	100
zh	20000	10000	10000
zh-classical	100	100	100
zh-min-nan	100	100	100
zh-yue	20000	10000	10000

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

The original 282 datasets are associated with this article

@inproceedings{pan-etal-2017-cross,
    title = "Cross-lingual Name Tagging and Linking for 282 Languages",
    author = "Pan, Xiaoman  and
      Zhang, Boliang  and
      May, Jonathan  and
      Nothman, Joel  and
      Knight, Kevin  and
      Ji, Heng",
    booktitle = "Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2017",
    address = "Vancouver, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P17-1178",
    doi = "10.18653/v1/P17-1178",
    pages = "1946--1958",
    abstract = "The ambitious goal of this work is to develop a cross-lingual name tagging and linking framework for 282 languages that exist in Wikipedia. Given a document in any of these languages, our framework is able to identify name mentions, assign a coarse-grained or fine-grained type to each mention, and link it to an English Knowledge Base (KB) if it is linkable. We achieve this goal by performing a series of new KB mining methods: generating {``}silver-standard{''} annotations by transferring annotations from English to other languages through cross-lingual links and KB properties, refining annotations through self-training and topic selection, deriving language-specific morphology features from anchor links, and mining word translation pairs from cross-lingual links. Both name tagging and linking results for 282 languages are promising on Wikipedia data and on-Wikipedia data.",
}

while the 176 languages supported in this version are associated with the following article

@inproceedings{rahimi-etal-2019-massively,
    title = "Massively Multilingual Transfer for {NER}",
    author = "Rahimi, Afshin  and
      Li, Yuan  and
      Cohn, Trevor",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1015",
    pages = "151--164",
}

Contributions

Thanks to @lewtun and @rabeehk for adding this dataset.

作者:

佚名

数据集大小:

742.79 KB