数据集:
LLukas22/cqadupstack
This is a preprocessed version of cqadupstack, to make it easily consumable via huggingface. The original dataset can be found here .
CQADupStack is a benchmark dataset for community question-answering (cQA) research. It contains threads from twelve StackExchange1 subforums, annotated with duplicate question information and comes with pre-defined training, development, and test splits, both for retrieval and classification experiments.
An example of 'train' looks as follows.
{ "question": "Very often, when some unknown company is calling me, in couple of seconds I see its name and logo on standard ...", "answer": "You didn't explicitely mention it, but from the context I assume you're using a device with Android 4.4 (Kitkat). With that ...", "title": "Why Dialer shows contact name and image, when contact is not in my address book?", "forum_tag": "android" }
The data fields are the same among all splits.
This dataset is distributed under the Apache 2.0 licence.