数据集:

arabic_pos_dialect

预印本库:

arxiv:1708.05891

许可:

apache-2.0

批注创建人:

expert-generated

源数据集:

extended

任务:

标记分类

子任务:

part-of-speech

语言:

计算机处理:

multilingual

大小:

n<1K

语言创建人:

found

数据集介绍文件清单

中文

Dataset Card for Arabic POS Dialect

Dataset Summary

This dataset was created to support part of speech (POS) tagging in dialects of Arabic. It contains sets of 350 manually segmented and POS tagged tweets for each of four dialects: Egyptian, Levantine, Gulf, and Maghrebi.

Supported Tasks and Leaderboards

The dataset can be used to train a model for Arabic token segmentation and part of speech tagging in Arabic dialects. Success on this task is typically measured by achieving a high accuracy over a held out dataset. Darwish et al. (2018) train a CRF model across all four dialects and achieve an average accuracy of 89.3%.

Languages

The BCP-47 code is ar-Arab. The dataset consists of four dialects of Arabic, Egyptian (EGY), Levantine (LEV), Gulf (GLF), and Maghrebi (MGR), written in Arabic script.

Dataset Structure

Data Instances

Below is a partial example from the Egyptian set:

- `Fold`: 4
- `SubFold`: A
- `Word`: [ليه, لما, تحب, حد, من, قلبك, ...]
- `Segmentation`: [ليه, لما, تحب, حد, من, قلب+ك, ...]
- `POS`: [PART, PART, V, NOUN, PREP, NOUN+PRON, ...]

Data Fields

The fold and the subfold fields refer to the crossfold validation splits used by Darwish et al., which can be generated using this script .

fold : An int32 indicating which fold the instance was in for the crossfold validation
subfold : A string, either 'A' or 'B', indicating which subfold the instance was in for the crossfold validation
words : A sequence of strings of the unsegmented token
segments : A sequence of strings consisting of the segments of the word separated by '+' if there is more than one segment
pos_tags : A sequence of strings of the part of speech tags of the segments separated by '+' if there is more than one segment

The POS tags consist of a set developed by Darwish et al. (2017) for Modern Standard Arabic (MSA) plus an additional 6 tags (2 dialect-specific tags and 4 tweet-specific tags).

Tag	Purpose	Description
ADV	MSA	Adverb
ADJ	MSA	Adjective
CONJ	MSA	Conjunction
DET	MSA	Determiner
NOUN	MSA	Noun
NSUFF	MSA	Noun suffix
NUM	MSA	Number
PART	MSA	Particle
PREP	MSA	Preposition
PRON	MSA	Pronoun
PUNC	MSA	Preposition
V	MSA	Verb
ABBREV	MSA	Abbreviation
CASE	MSA	Alef of tanween fatha
JUS	MSA	Jussification attached to verbs
VSUFF	MSA	Verb Suffix
FOREIGN	MSA	Non-Arabic as well as non-MSA words
FUR_PART	MSA	Future particle "s" prefix and "swf"
PROG_PART	Dialect	Progressive particle
NEG_PART	Dialect	Negation particle
HASH	Tweet	Hashtag
EMOT	Tweet	Emoticon/Emoji
MENTION	Tweet	Mention
URL	Tweet	URL

Data Splits

The dataset is split by dialect.

Dialect	Tweets	Words
Egyptian (EGY)	350	7481
Levantine (LEV)	350	7221
Gulf (GLF)	350	6767
Maghrebi (MGR)	350	6400

Dataset Creation

Curation Rationale

This dataset was created to address the lack of computational resources available for dialects of Arabic. These dialects are typically used in speech, while written forms of the language are typically in Modern Standard Arabic. Social media, however, has provided a venue for people to use dialects in written format.

Source Data

This dataset builds off of the work of Eldesouki et al. (2017) and Samih et al. (2017b) who originally collected the tweets.

Initial Data Collection and Normalization

They started with 175 million Arabic tweets returned by the Twitter API using the query "lang:ar" in March 2014. They then filtered this set using author-identified locations and tokens that are unique to each dialect. Finally, they had native speakers of each dialect select 350 tweets that were heavily accented.

Who are the source language producers?

The source language producers are people who posted on Twitter in Arabic using dialectal words from countries where the dialects of interest were spoken, as identified in Mubarak and Darwish (2014) .

Annotations

Annotation process

The segmentation guidelines are available at https://alt.qcri.org/resources1/da_resources/seg-guidelines.pdf . The tagging guidelines are not provided, but Darwish at al. note that there were multiple rounds of quality control and revision.

Who are the annotators?

The POS tags were annotated by native speakers of each dialect. Further information is not known.

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

Darwish et al find that the accuracy on the Maghrebi dataset suffered the most when the training set was from another dialect, and conversely training on Maghrebi yielded the worst results for all the other dialects. They suggest that Egyptian, Levantine, and Gulf may be more similar to each other and Maghrebi the most dissimilar to all of them. They also find that training on Modern Standard Arabic (MSA) and testing on dialects yielded significantly lower results compared to training on dialects and testing on MSA. This suggests that dialectal variation should be a significant consideration for future work in Arabic NLP applications, particularly when working with social media text.

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

This dataset was curated by Kareem Darwish, Hamdy Mubarak, Mohamed Eldesouki and Ahmed Abdelali with the Qatar Computing Research Institute (QCRI), Younes Samih and Laura Kallmeyer with the University of Dusseldorf, Randah Alharbi and Walid Magdy with the University of Edinburgh, and Mohammed Attia with Google. No funding information was included.

Licensing Information

This dataset is licensed under the Apache License, Version 2.0 .

Citation Information

Kareem Darwish, Hamdy Mubarak, Ahmed Abdelali, Mohamed Eldesouki, Younes Samih, Randah Alharbi, Mohammed Attia, Walid Magdy and Laura Kallmeyer (2018) Multi-Dialect Arabic POS Tagging: A CRF Approach. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), May 7-12, 2018. Miyazaki, Japan.

@InProceedings{DARWISH18.562,
  author = {Kareem Darwish ,Hamdy Mubarak ,Ahmed Abdelali ,Mohamed Eldesouki ,Younes Samih ,Randah Alharbi ,Mohammed Attia ,Walid Magdy and Laura Kallmeyer},
  title = {Multi-Dialect Arabic POS Tagging: A CRF Approach},
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {may},
  date = {7-12},
  location = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {979-10-95546-00-9},
  language = {english}
  }

Contributions

Thanks to @mcmillanmajora for adding this dataset.

作者:

佚名

数据集大小:

30.22 KB