数据集:
google_wellformed_query
许可:
cc-by-sa-4.0预印本库:
arxiv:1808.09419源数据集:
extended批注创建人:
crowdsourced语言创建人:
found大小:
10K<n<100K计算机处理:
monolingual语言:
en子任务:
text-scoring任务:
文本分类Google's query wellformedness dataset was created by crowdsourcing well-formedness annotations for 25,100 queries from the Paralex corpus. Every query was annotated by five raters each with 1/0 rating of whether or not the query is well-formed.
[More Information Needed]
English
{'rating': 0.2, 'content': 'The European Union includes how many ?'}
Train | Valid | Test | |
---|---|---|---|
Input Sentences | 17500 | 3750 | 3850 |
Understanding search queries is a hard problem as it involves dealing with “word salad” text ubiquitously issued by users. However, if a query resembles a well-formed question, a natural language processing pipeline is able to perform more accurate interpretation, thus reducing downstream compounding errors. Hence, identifying whether or not a query is well formed can enhance query understanding. This dataset introduce a new task of identifying a well-formed natural language question.
Used the Paralex corpus (Fader et al., 2013) that contains pairs of noisy paraphrase questions. These questions were issued by users in WikiAnswers (a Question-Answer forum) and consist of both web-search query like constructs (“5 parts of chloroplast?”) and well-formed questions (“What is the punishment for grand theft?”).
Initial Data Collection and NormalizationSelected 25,100 queries from the unique list of queries extracted from the corpus such that no two queries in the selected set are paraphrases.
Who are the source language producers?[More Information Needed]
The queries are annotated into well-formed or non-wellformed questions if it satisfies the following:
Every query was labeled by five different crowdworkers with a binary label indicating whether a query is well-formed or not. And average of the ratings of the five annotators was reported, to get the probability of a query being well-formed.
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Query-wellformedness dataset is licensed under CC BY-SA 4.0. Any third party content or data is provided “As Is” without any warranty, express or implied.
@InProceedings{FaruquiDas2018, title = {{Identifying Well-formed Natural Language Questions}}, author = {Faruqui, Manaal and Das, Dipanjan}, booktitle = {Proc. of EMNLP}, year = {2018} }
Thanks to @vasudevgupta7 for adding this dataset.