数据集:
financial_phrasebank
任务:
文本分类语言:
en计算机处理:
monolingual大小:
1K<n<10K语言创建人:
found批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:1307.5336许可:
cc-by-nc-sa-3.0Polar sentiment dataset of sentences from financial news. The dataset consists of 4840 sentences from English language financial news categorised by sentiment. The dataset is divided by agreement rate of 5-8 annotators.
Sentiment Classification
English
{ "sentence": "Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing .", "label": "negative" }
There's no train/validation/test split.
However the dataset is available in four possible configurations depending on the percentage of agreement of annotators:
sentences_50agree ; Number of instances with >=50% annotator agreement: 4846 sentences_66agree : Number of instances with >=66% annotator agreement: 4217 sentences_75agree : Number of instances with >=75% annotator agreement: 3453 sentences_allagree : Number of instances with 100% annotator agreement: 2264
The key arguments for the low utilization of statistical techniques in financial sentiment analysis have been the difficulty of implementation for practical applications and the lack of high quality training data for building such models. Especially in the case of finance and economic texts, annotated collections are a scarce resource and many are reserved for proprietary use only. To resolve the missing training data problem, we present a collection of ∼ 5000 sentences to establish human-annotated standards for benchmarking alternative modeling techniques.
The objective of the phrase level annotation task was to classify each example sentence into a positive, negative or neutral category by considering only the information explicitly available in the given sentence. Since the study is focused only on financial and economic domains, the annotators were asked to consider the sentences from the view point of an investor only; i.e. whether the news may have positive, negative or neutral influence on the stock price. As a result, sentences which have a sentiment that is not relevant from an economic or financial perspective are considered neutral.
The corpus used in this paper is made out of English news on all listed companies in OMX Helsinki. The news has been downloaded from the LexisNexis database using an automated web scraper. Out of this news database, a random subset of 10,000 articles was selected to obtain good coverage across small and large companies, companies in different industries, as well as different news sources. Following the approach taken by Maks and Vossen (2010), we excluded all sentences which did not contain any of the lexicon entities. This reduced the overall sample to 53,400 sentences, where each has at least one or more recognized lexicon entity. The sentences were then classified according to the types of entity sequences detected. Finally, a random sample of ∼5000 sentences was chosen to represent the overall news database.
Who are the source language producers?The source data was written by various financial journalists.
This release of the financial phrase bank covers a collection of 4840 sentences. The selected collection of phrases was annotated by 16 people with adequate background knowledge on financial markets.
Given the large number of overlapping annotations (5 to 8 annotations per sentence), there are several ways to define a majority vote based gold standard. To provide an objective comparison, we have formed 4 alternative reference datasets based on the strength of majority agreement:
Who are the annotators?Three of the annotators were researchers and the remaining 13 annotators were master's students at Aalto University School of Business with majors primarily in finance, accounting, and economics.
[More Information Needed]
[More Information Needed]
All annotators were from the same institution and so interannotator agreement should be understood with this taken into account.
[More Information Needed]
[More Information Needed]
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ .
If you are interested in commercial use of the data, please contact the following authors for an appropriate license:
@article{Malo2014GoodDO, title={Good debt or bad debt: Detecting semantic orientations in economic texts}, author={P. Malo and A. Sinha and P. Korhonen and J. Wallenius and P. Takala}, journal={Journal of the Association for Information Science and Technology}, year={2014}, volume={65} }
Thanks to @frankier for adding this dataset.