数据集:
norec
许可:
源数据集:
original批注创建人:
expert-generated语言创建人:
found大小:
100K<n<1M计算机处理:
monolingual任务:
This dataset contains Norwegian Review Corpus (NoReC), created for the purpose of training and evaluating models for document-level sentiment analysis. More than 43,000 full-text reviews have been collected from major Norwegian news sources and cover a range of different domains, including literature, movies, video games, restaurants, music and theater, in addition to product reviews across a range of categories. Each review is labeled with a manually assigned score of 1–6, as provided by the rating of the original author.
[More Information Needed]
The sentences in the dataset are in Norwegian (nb, nn, no).
A sample from training set is provided below:
{'deprel': ['det',
'amod',
'cc',
'conj',
'nsubj',
'case',
'nmod',
'cop',
'case',
'case',
'root',
'flat:name',
'flat:name',
'punct'],
'deps': ['None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None'],
'feats': ["{'Gender': 'Masc', 'Number': 'Sing', 'PronType': 'Dem'}",
"{'Definite': 'Def', 'Degree': 'Pos', 'Number': 'Sing'}",
'None',
"{'Definite': 'Def', 'Degree': 'Pos', 'Number': 'Sing'}",
"{'Definite': 'Def', 'Gender': 'Masc', 'Number': 'Sing'}",
'None',
'None',
"{'Mood': 'Ind', 'Tense': 'Pres', 'VerbForm': 'Fin'}",
'None',
'None',
'None',
'None',
'None',
'None'],
'head': ['5',
'5',
'4',
'2',
'11',
'7',
'5',
'11',
'11',
'11',
'0',
'11',
'11',
'11'],
'idx': '000000-02-01',
'lemmas': ['den',
'andre',
'og',
'sist',
'sesong',
'av',
'Rome',
'være',
'ute',
'på',
'DVD',
'i',
'Norge',
'$.'],
'misc': ['None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
"{'SpaceAfter': 'No'}",
'None'],
'pos_tags': [5, 0, 4, 0, 7, 1, 11, 3, 1, 1, 11, 1, 11, 12],
'text': 'Den andre og siste sesongen av Rome er ute på DVD i Norge.',
'tokens': ['Den',
'andre',
'og',
'siste',
'sesongen',
'av',
'Rome',
'er',
'ute',
'på',
'DVD',
'i',
'Norge',
'.'],
'xpos_tags': ['None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None',
'None']}
The data instances have the following fields:
The part of speech taggs correspond to these labels: "ADJ" (0), "ADP" (1), "ADV" (2), "AUX" (3), "CCONJ" (4), "DET" (5), "INTJ" (6), "NOUN" (7), "NUM" (8), "PART" (9), "PRON" (10), "PROPN" (11), "PUNCT" (12), "SCONJ" (13), "SYM" (14), "VERB" (15), "X" (16),
The training, validation, and test set contain 680792 , 101106 , and 101594 sentences respectively.
[More Information Needed]
[More Information Needed]
Initial Data Collection and Normalization[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
@InProceedings{VelOvrBer18,
author = {Erik Velldal and Lilja {\O}vrelid and
Eivind Alexander Bergem and Cathrine Stadsnes and
Samia Touileb and Fredrik J{\o}rgensen},
title = {{NoReC}: The {N}orwegian {R}eview {C}orpus},
booktitle = {Proceedings of the 11th edition of the
Language Resources and Evaluation Conference},
year = {2018},
address = {Miyazaki, Japan},
pages = {4186--4191}
}
Thanks to @abhishekkrthakur for adding this dataset.