Dataset Card for ChatGPT Dutch Simplification

Dataset Summary

Created in light of a master thesis by Charlotte Van de Velde as part of the Master of Science in Artificial Intelligence at KU Leuven. Charlotte is supervised by Vincent Vandeghinste and Bram Vanroy. The dataset contains Dutch source sentences and aligned simplified sentences, generated with ChatGPT. All splits combined, the dataset consists of 1267 entries.

Charlotte used gpt-3.5-turbo with the following prompt:

Schrijf een moeilijke zin, en daarna een simpele versie ervan. De simpele versie moet makkelijker zijn om te lezen en te begrijpen. Schrijf "Moeilijke zin: " aan het begin van de moeilijke zin, en "Simpele versie: " aan het begin van de simpele versie.

Parameters:

temperature=0.9
max tokens=1000
top p=1
frequency penalty=0.1
presence penalty=0

Bram Vanroy was not involved in the data collection but only generated the data splits and provides the dataset as-is on this online platform. Splits were generated with the following script .

Supported Tasks and Leaderboards

Intended for text2text generation, specifically text simplification.

Languages

Dutch

Dataset Structure

Data Instances

{
    "source": "Het fenomeen van acquisitie van taalkennis vindt plaats door middel van het opdoen van ervaringen met de taal in diverse contexten.",
    "target": "Je leert een taal door de taal te gebruiken in verschillende situaties."
}

Data Fields

source: the "more difficult" Dutch sentence
target: the simplified Dutch sentence

Data Splits

train: 1013
validation: 126
test: 128

Disclaimer about data usage

This text was generated (either in part or in full) with GPT-3 ( gpt-3.5-turbo ), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

If you use this dataset, you must also follow the Sharing and Usage policies.

As clearly stated in their Terms of Use , specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware , that is a specific restriction that should serve as an addendum to the current license.

作者:

BramVanroy

数据集大小:

325.2 KB