数据集:

GroNLP/ik-nlp-22_pestyle

任务:

翻译

语言:

计算机处理:

translation

大小:

1K<n<10K

语言创建人:

found

批注创建人:

machine-generated expert-generated

源数据集:

original

许可:

other

数据集介绍文件清单

中文

Dataset Card for IK-NLP-22 Project 1: A Study in Post-Editing Stylometry

Dataset Summary

This dataset contains a sample of sentences taken from the FLORES-101 dataset that were either translated from scratch or post-edited from an existing automatic translation by three human translators. Translation were performed for the English-Italian language pair, and translators' behavioral data (keystrokes, pauses, editing times) were collected using the PET platform.

This dataset is made available for final projects of the 2022 edition of the Natural Language Processing course at the Information Science Master's Degree at the University of Groningen, taught by Arianna Bisazza and Gabriele Sarti with the assistance of Anjali Nair .

Disclaimer : This repository is provided without direct data access due to currently unpublished results. For this reason, it is strictly forbidden to share or publish all the data associated to this repository . Students will be provided with a compressed folder containing the data upon choosing a project based on this dataset. To load the dataset using 🤗 Datasets, download and unzip the provided folder and pass it to the load_dataset method as: datasets.load_dataset('GroNLP/ik-nlp-22_pestyle', 'full', data_dir='path/to/unzipped/folder')

Languages

The language data of is in English (BCP-47 en ) and Italian (BCP-47 it )

Dataset Structure

Data Instances

The dataset contains four configurations: full , test_mask_subject , test_mask_modality , test_mask_time . full contains the main train split in which all fields are available. The other three, test_mask_subject , test_mask_modality , test_mask_time , contain a test split each with different fields removed to avoid information leaking during evaluation. See more details in the Data Splits section.

Data Fields

The following fields are contained in the training set:

Field	Description
item_id	The sentence identifier. The first digits of the number represent the document containing the sentence, while the last digit of the number represents the sentence position inside the document. Documents can contain from 3 to 5 semantically-related sentences each.
subject_id	The identifier for the translator performing the translation from scratch or post-editing task. Values: t1 , t2 or t3 .
modality	The modality of the translation task. Values: ht (translation from scratch), pe1 (post-editing Google Translate translations), pe2 (post-editing mBART translations).
src_text	The original source sentence extracted from Wikinews, wikibooks or wikivoyage.
mt_text	Missing if tasktype is ht . Otherwise, contains the automatically-translated sentence before post-editing.
tgt_text	Final sentence produced by the translator (either via translation from scratch of sl_text or post-editing mt_text )
edit_time	Total editing time for the translation in seconds.
k_total	Total number of keystrokes for the translation.
k_letter	Total number of letter keystrokes for the translation.
k_digit	Total number of digit keystrokes for the translation.
k_white	Total number of whitespace keystrokes for the translation.
k_symbol	Total number of symbol (punctuation, etc.) keystrokes for the translation.
k_nav	Total number of navigation keystrokes (left-right arrows, mouse clicks) for the translation.
k_erase	Total number of erase keystrokes (backspace, cancel) for the translation.
k_copy	Total number of copy (Ctrl + C) actions during the translation.
k_cut	Total number of cut (Ctrl + X) actions during the translation.
k_paste	Total number of paste (Ctrl + V) actions during the translation.
n_pause_geq_300	Number of pauses of 300ms or more during the translation.
len_pause_geq_300	Total duration of pauses of 300ms or more, in milliseconds.
n_pause_geq_1000	Number of pauses of 1s or more during the translation.
len_pause_geq_1000	Total duration of pauses of 1000ms or more, in milliseconds.
num_annotations	Number of times the translator focused the texbox for performing the translation of the sentence during the translation session. E.g. 1 means the translation was performed once and never revised.
n_insert	Number of post-editing insertions (empty for modality ht ) computed using the tercom library.
n_delete	Number of post-editing deletions (empty for modality ht ) computed using the tercom library.
n_substitute	Number of post-editing substitutions (empty for modality ht ) computed using the tercom library.
n_shift	Number of post-editing shifts (empty for modality ht ) computed using the tercom library.
bleu	Sentence-level BLEU score between MT and post-edited fields (empty for modality ht ) computed using the SacreBLEU library with default parameters.
chrf	Sentence-level chrF score between MT and post-edited fields (empty for modality ht ) computed using the SacreBLEU library with default parameters.
ter	Sentence-level TER score between MT and post-edited fields (empty for modality ht ) computed using the tercom library.
aligned_edit	Aligned visual representation of REF ( mt_text ), HYP ( tl_text ) and edit operations (I = Insertion, D = Deletion, S = Substitution) performed on the field. Replace \\n with \n to show the three aligned rows.

Data Splits

config	train	test
main	1170	120

Train Split

The train split contains a total of 1170 triplets (or pairs, when translation from scratch is performed) annotated with behavioral data produced during the translation. The following is an example of the subject t3 post-editing a machine translation produced by system 2 (tasktype pe2 ) taken from the train split. The field aligned_edit is showed over three lines to provide a visual understanding of its contents.

{
    "item_id": 1072,
    "subject_id": "t3",
    "tasktype": "pe2",
    "src_text": "At the beginning dress was heavily influenced by the Byzantine culture in the east.",
    "mt_text": "All'inizio il vestito era fortemente influenzato dalla cultura bizantina dell'est.",
    "tgt+text": "Inizialmente, l'abbigliamento era fortemente influenzato dalla cultura bizantina orientale.",
    "edit_time": 45.687,
    "k_total": 51,
    "k_letter": 31,
    "k_digit": 0,
    "k_white": 2,
    "k_symbol": 3,
    "k_nav": 7,
    "k_erase": 3,
    "k_copy": 0,
    "k_cut": 0,
    "k_paste": 0,
    "n_pause_geq_300": 9,
    "len_pause_geq_300": 40032,
    "n_pause_geq_1000": 5,
    "len_pause_geq_1000": 38392,
    "num_annotations": 1,
    "n_insert": 0.0,
    "n_delete": 1.0,
    "n_substitute": 3.0,
    "n_shift": 0.0,
    "bleu": 47.99,
    "chrf": 62.05,
    "ter": 40.0,
    "aligned_edit: "REF:  all'inizio il            vestito         era fortemente influenzato dalla cultura bizantina dell'est.\\n
                    HYP:  ********** inizialmente, l'abbigliamento era fortemente influenzato dalla cultura bizantina orientale.\\n 
                    EVAL: D          S             S                                                                  S"
}

The text is provided as-is, without further preprocessing or tokenization.

Test splits

The three test splits (one per configuration) contain the same 120 entries each, following the same structure as train . Each test split omit some of the fields to prevent leakage of information:

In test_mask_subject the subject_id is absent, for the main task of post-editor stylometry.
In test_mask_modality the following fields are absent for the modality prediction extra task: modality , mt_text , n_insert , n_delete , n_substitute , n_shift , ter , bleu , chrf , aligned_edit .
In test_mask_time the following fields are absent for the time and pause prediction extra task: edit_time , n_pause_geq_300 , len_pause_geq_300 , n_pause_geq_1000 , and len_pause_geq_1000 .

Dataset Creation

The dataset was parsed from PET XML files into CSV format using a script adapted from the one by Antonio Toral found at the following link: https://github.com/antot/postediting_novel_frontiers

Additional Information

Dataset Curators

For problems related to this 🤗 Datasets version, please contact us at ik-nlp-course@rug.nl .

Licensing Information

It is forbidden to share or publish the data associated with this 🤗 Dataset version.

Citation Information

No citation information is provided for this dataset.

作者:

GroNLP

数据集大小:

16.95 KB