数据集:
TurkuNLP/turku_paraphrase_corpus
The project gathered a large dataset of Finnish paraphrase pairs (over 100,000). The paraphrases are selected and classified manually, so as to minimize lexical overlap, and provide examples that are maximally structurally and lexically different. The objective is to create a dataset which is challenging and better tests the capabilities of natural language understanding. An important feature of the data is that most paraphrase pairs are distributed in their document context. The primary application for the dataset is the development and evaluation of deep language models, and representation learning in general.
Usage:
from datasets import load_dataset dataset = load_dataset('TurkuNLP/turku_paraphrase_corpus', name="plain")
where name is one of the supported loading options: plain , plain-context , classification , classification-context , or generation . See Data Fields for more information.
Finnish
[More Information Needed]
The dataset consist of pairs of text passages, where a typical passage is about a sentence long, however, a passage may also be longer or shorter than a sentence. Thus, each example includes two text passages (string), a manually annotated label to indicate the paraphrase type (string), and additional metadata. The dataset includes three different configurations: plain , classification , and generation . The plain configuration loads the original data without any additional preprocessing or transformations, while the classification configuration directly builds the data in a form suitable for training a paraphrase classifier, where each example is doubled in the data with different directions (text1, text2, label) --> (text2, text1, label) taking care of the label flipping as well if needed (paraphrases with directionality flag < or >). In the generation configuration, the examples are preprocessed to be directly suitable for the paraphrase generation task. In here, paraphrases not suitable for generation are discarded (negative, and highly context-dependent paraphrases), and directional paraphrases are provided so that the generation goes from more detailed passage to the more general one in order to prevent model hallucination (i.e. model learning to introduce new information). The rest of the paraphrases are provided in both directions (text1, text2, label) --> (text2, text1, label).
Each pair in the plain and classification configurations will include fields:
id : Identifier of the paraphrase pair (string)
gem_id : Identifier of the paraphrase pair in the GEM dataset (string)
goeswith : Identifier of the document from which the paraphrase was extracted, can be not available in case the source of the paraphrase is not from document-structured data. All examples with the same goeswith value (other than not available ) should be kept together in any train/dev/test split; most users won't need this (string)
fold : 0-99, data split into 100 parts respecting document boundaries, you can use this e.g. to implement crossvalidation safely as all paraphrases from one document are in one fold, most users won't need this (int)
text1 : First paraphrase passage (string)
text2 : Second paraphrase passage (string)
label : Manually annotated labels (string)
binary_label : Label turned into binary with values positive (paraphrase) and negative (not-paraphrase) (string)
is_rewrite : Indicator whether the example is human produced rewrite or naturally occurring paraphrase (bool)
Each pair in the generation config will include the same fields except text1 and text2 are renamed to input and output in order to indicate the generation direction. Thus the fields are: id , gem_id , goeswith , fold , input , output , label , binary_label , and is_rewrite
Context : Most (but not all) of the paraphrase pairs are identified in their document context. By default, these contexts are not included to conserve memory, but can be accessed using the configurations plain-context and classification-context . These are exactly like plain and classification with these additional fields:
context1 : a dictionary with the fields doctext (string), begin (int), end (int). These mean that the paraphrase in text1 was extracted from doctext[begin:end] . In most cases, doctext[begin:end] and text1 are the exact same string, but occassionally that is not the case when e.g. intervening punctuations or other unrelated texts were "cleaned" from text1 during annotation. In case the context is not available, doctext is an empty string and beg==end==0
context2 : same as context1 but for text2
[More Information Needed]
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]