数据集:

BramVanroy/alpaca-cleaned-dutch

中文

Dataset Card for Alpaca Cleaned Dutch

Dataset Summary

This dataset contains 51,712 conversations between een AI assistant and a (fake) "Human" (generated) in Dutch. They are translations of Alpaca Cleaned Dataset .

Want to help me out? Translating the data with the OpenAI API, and prompt testing, cost me ?$57.99?. If you like this dataset, please consider buying me a coffee to offset a portion of this cost, I appreciate it a lot! ☕

Languages

  • Dutch

Dataset Structure

Data Instances

{
    'id': 7,
    'instruction': 'Leg uit waarom de volgende breuk gelijk is aan 1/4',
    'input': '4/16',
    'output': 'De breuk 4/16 is gelijk aan 1/4 omdat zowel de teller als de '
              'noemer deelbaar zijn door 4. Door zowel de teller als de noemer '
              'door 4 te delen, krijgen we de breuk 1/4.'
}

Data Fields

  • id : the ID of the item. The following ID is not included because they could not be translated: [23019]
  • instruction : the given instruction input : optional input to accompany the instruction. Can be empty.
  • output : the "answer" to the instruction

Dataset Creation

The instructions, inputs and outputs were translated with OpenAI's API for gpt-3.5-turbo . max_tokens=1024, temperature=0 as parameters.

The prompt template to translate is (where src_lang is English and tgt_lang is Dutch):

TRANSLATION_PROMPT = """You are asked to translate a task's instruction, optional input to the task, and the output of the task, from {src_lang} into {tgt_lang}.

Here are the requirements that you should adhere to:
1. maintain the format: the task consists of a task instruction (marked `instruction: `), optional input to the task (marked `input: `) and output for the task marked with `output: `;
2. do not translate the identifiers `instruction: `, `input: `, and `output: ` but instead copy them to your output;
3. make sure that text is fluent to read and does not contain grammatical errors. Use standard {tgt_lang} without regional bias;
4. translate the instruction and input text using informal, but standard, language;
5. make sure to avoid biases (such as gender bias, grammatical bias, social bias);
6. if the instruction is to correct grammar mistakes or spelling mistakes then you have to generate a similar mistake in the input in {tgt_lang}, and then also generate a corrected output version in the output in {tgt_lang};
7. if the instruction is to translate text from one language to another, then you do not translate the text that needs to be translated in the instruction or the input, nor the translation in the output (just copy them as-is);
8. do not translate code fragments but copy them to your output. If there are English examples, variable names or definitions in code fragments, keep them in English.

Now translate the following task with the requirements set out above. Do not provide an explanation and do not add anything else.\n\n"""

This prompt is concatenated with the instruction, optionally the input, and the output. In code, that last part looks like this:

text = f'instruction: "{instruction}"\n\n'
if inputstr:
    text += f'input: "{inputstr}"\n\n'
text += f'output: "{outputstr}"'

The system message was:

You are a helpful assistant that translates English to Dutch to the requirements that are given to you.

Note that 1 item (0.0001%) was not successfully translated. The translation was missing the input, instruction, or output keywords where those were expected. The ID for the missing item is [23019] .

Source Data

Initial Data Collection and Normalization

Initial data creation by Tatsu lab and cleaned by Yahma .

Who are the source language producers?

The original dataset was generated with OpenAI's text-davinci-003 .

Considerations for Using the Data

Note that the translations in this new dataset have not been verified by humans.

Discussion of Biases

As with any machine-generated texts, users should be aware of potential biases that are included in this dataset. Although the prompt specifically includes make sure to avoid biases (such as gender bias, grammatical bias, social bias) , of course the impact of such command is not known. It is likely that biases remain in the dataset so use with caution.

Other Known Limitations

The translation quality has not been verified. Use at your own risk!

Licensing Information

As per OpenAI's terms of use, this dataset cannot be used to build a commercial system that competes with OpenAI's services . Similar to the original Alpaca dataset, this dataset is released under CC NC 4.0.

This text was generated (either in part or in full) with GPT-3 ( gpt-3.5-turbo ), OpenAI’s large-scale language-generation model. Upon generating draft language, the author reviewed, edited, and revised the language to their own liking and takes ultimate responsibility for the content of this publication.

If you use this dataset, you must also follow the Sharing and Usage policies.

As clearly stated in their Terms of Use , specifically 2c.iii, "[you may not] use output from the Services to develop models that compete with OpenAI". That means that you cannot use this dataset to build models that are intended to commercially compete with OpenAI. As far as I am aware , that is a specific restriction that should serve as an addendum to the current license.

Citation Information

If you use this data set, please cite :

Vanroy, B. (2023). Alpaca Cleaned Dutch [Data set]. Hugging Face. https://doi.org/10.57967/HF/0530

@misc{https://doi.org/10.57967/hf/0530,
  doi = {10.57967/HF/0530},
  url = {https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch},
  author = {Vanroy, Bram},
  title = {{A}lpaca {C}leaned {D}utch},
  publisher = {Hugging Face},
  year = {2023}
}

Contributions

Thanks to Tatsu lab for the initial machine-generated dataset and yahma for cleaning it .