数据集:
BramVanroy/chatgpt-dutch-simplification
Created in light of a master thesis by Charlotte Van de Velde as part of the Master of Science in Artificial Intelligence at KU Leuven. Charlotte is supervised by Vincent Vandeghinste and Bram Vanroy. The dataset contains Dutch source sentences and aligned simplified sentences, generated with ChatGPT. All splits combined, the dataset consists of 1267 entries.
Charlotte used gpt-3.5-turbo with the following prompt:
Schrijf een moeilijke zin, en daarna een simpele versie ervan. De simpele versie moet makkelijker zijn om te lezen en te begrijpen. Schrijf "Moeilijke zin: " aan het begin van de moeilijke zin, en "Simpele versie: " aan het begin van de simpele versie.
Parameters:
Bram Vanroy was not involved in the data collection but only generated the data splits and provides the dataset as-is on this online platform. Splits were generated with the following script .
Intended for text2text generation, specifically text simplification.
Dutch
{ "source": "Het fenomeen van acquisitie van taalkennis vindt plaats door middel van het opdoen van ervaringen met de taal in diverse contexten.", "target": "Je leert een taal door de taal te gebruiken in verschillende situaties." }
This text was generated (either in part or in full) with GPT-3 ( gpt-3.5-turbo ), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.
If you use this dataset, you must also follow the Sharing and Usage policies.
As clearly stated in their Terms of Use , specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware , that is a specific restriction that should serve as an addendum to the current license.