数据集:
argilla/databricks-dolly-15k-curated-multilingual
许可:
cc-by-sa-3.0大小:
10K<n<100KA curated and multilingual version of the Databricks Dolly instructions dataset. It includes a programmatically and manually corrected version of the original en dataset. See below.
STATUS :
Currently, the original Dolly v2 English version has been curated combining automatic processing and collaborative human curation using Argilla (~400 records have been manually edited and fixed). The following graph shows a summary about the number of edited fields.
This dataset collection is a curated and machine-translated version of the databricks-dolly-15k dataset originally created by Databricks, Inc. in 2023.
The goal is to give practitioners a starting point for training open-source instruction-following models with better-quality English data and translated data beyond English. However, as the translation quality will not be perfect, we highly recommend dedicating time to curate and fix translation issues. Below we explain how to load the datasets into Argilla for data curation and fixing . Additionally, we'll be improving the datasets made available here, with the help of different communities.
Currently, the original English version has been curated combining automatic processing and collaborative human curation using Argilla (~400 records have been manually edited and fixed). The following graph shows a summary of the number of edited fields.
The main issues (likely many issues still remaining) are the following:
We programmatically identified records with these potential issues and ran a campaign to fix it and as a result more than 400 records have been adapted. See below for statistics:
As a result of this curation process the content of the fields has been reduced, counted in number of tokens, especially for the responses:
If you want to browse and curate your dataset with Argilla, you can:
There's one split per language:
from datasets import load_dataset # loads all splits load_dataset("argilla/databricks-dolly-15k-curate-multilingual") # loads Spanish splits load_dataset("argilla/databricks-dolly-15k-curated-multilingual", split="es")
As described in the README of the original dataset, this dataset can be used for:
Currently: es , fr , de , en
Join Argilla Slack community if you want to help us include other languages.
[More Information Needed]
[More Information Needed]
There's one split per language:
from datasets import load_dataset # loads all splits load_dataset("argilla/databricks-dolly-15k-multilingual") # loads Spanish splits load_dataset("argilla/databricks-dolly-15k-multilingual", split="es")
These datasets have been translated using the DeepL API from the original English dataset between the 13th and 14th of April
Refer to the original dataset for more information.
Who are the source language producers?[More Information Needed]
Annotations are planned but not performed yet.
Annotation process[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
This dataset can be used for any purpose, whether academic or commercial, under the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported License .
Original dataset Owner: Databricks, Inc.
[More Information Needed]