数据集:
norec
许可:
cc-by-nc-4.0源数据集:
original批注创建人:
expert-generated语言创建人:
found大小:
100K<n<1M计算机处理:
monolingual任务:
标记分类This dataset contains Norwegian Review Corpus (NoReC), created for the purpose of training and evaluating models for document-level sentiment analysis. More than 43,000 full-text reviews have been collected from major Norwegian news sources and cover a range of different domains, including literature, movies, video games, restaurants, music and theater, in addition to product reviews across a range of categories. Each review is labeled with a manually assigned score of 1–6, as provided by the rating of the original author.
[More Information Needed]
The sentences in the dataset are in Norwegian (nb, nn, no).
A sample from training set is provided below:
{'deprel': ['det', 'amod', 'cc', 'conj', 'nsubj', 'case', 'nmod', 'cop', 'case', 'case', 'root', 'flat:name', 'flat:name', 'punct'], 'deps': ['None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None'], 'feats': ["{'Gender': 'Masc', 'Number': 'Sing', 'PronType': 'Dem'}", "{'Definite': 'Def', 'Degree': 'Pos', 'Number': 'Sing'}", 'None', "{'Definite': 'Def', 'Degree': 'Pos', 'Number': 'Sing'}", "{'Definite': 'Def', 'Gender': 'Masc', 'Number': 'Sing'}", 'None', 'None', "{'Mood': 'Ind', 'Tense': 'Pres', 'VerbForm': 'Fin'}", 'None', 'None', 'None', 'None', 'None', 'None'], 'head': ['5', '5', '4', '2', '11', '7', '5', '11', '11', '11', '0', '11', '11', '11'], 'idx': '000000-02-01', 'lemmas': ['den', 'andre', 'og', 'sist', 'sesong', 'av', 'Rome', 'være', 'ute', 'på', 'DVD', 'i', 'Norge', '$.'], 'misc': ['None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', "{'SpaceAfter': 'No'}", 'None'], 'pos_tags': [5, 0, 4, 0, 7, 1, 11, 3, 1, 1, 11, 1, 11, 12], 'text': 'Den andre og siste sesongen av Rome er ute på DVD i Norge.', 'tokens': ['Den', 'andre', 'og', 'siste', 'sesongen', 'av', 'Rome', 'er', 'ute', 'på', 'DVD', 'i', 'Norge', '.'], 'xpos_tags': ['None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None', 'None']}
The data instances have the following fields:
The part of speech taggs correspond to these labels: "ADJ" (0), "ADP" (1), "ADV" (2), "AUX" (3), "CCONJ" (4), "DET" (5), "INTJ" (6), "NOUN" (7), "NUM" (8), "PART" (9), "PRON" (10), "PROPN" (11), "PUNCT" (12), "SCONJ" (13), "SYM" (14), "VERB" (15), "X" (16),
The training, validation, and test set contain 680792 , 101106 , and 101594 sentences respectively.
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@InProceedings{VelOvrBer18, author = {Erik Velldal and Lilja {\O}vrelid and Eivind Alexander Bergem and Cathrine Stadsnes and Samia Touileb and Fredrik J{\o}rgensen}, title = {{NoReC}: The {N}orwegian {R}eview {C}orpus}, booktitle = {Proceedings of the 11th edition of the Language Resources and Evaluation Conference}, year = {2018}, address = {Miyazaki, Japan}, pages = {4186--4191} }
Thanks to @abhishekkrthakur for adding this dataset.