数据集:
LennardZuendorf/interpretor
This is an edit of original work from this paper , which I have uploaded to Huggingface here . It is not my original work, I just edited it. Data is used in the similarly named Interpretor-Model, which is part of my bachelor Thesis.
Refer to the Huggingface or GitHub Repo for more information
This Dataset contains dynamically generated hate-speech, already split into training (90%) and testing (10%). I inteded it to be used for classifcation tasks like this model.
The only represented language is english.
Each entry looks like this (train and test).
{ 'id': ..., 'text': , '' }
Provide any additional information that is not covered in the other sections about the data here. In particular describe any relationships between data points and if these relationships are made explicit.
List and describe the fields present in the dataset. Mention their data type, and whether they are used as input or output in any of the tasks the dataset currently supports. If the data has span indices, describe their attributes, such as whether they are at the character level or word level, whether they are contiguous or not, etc. If the datasets contains example IDs, state whether they have an inherent meaning, such as a mapping to other datasets or pointing to relationships between data points.
Note that the descriptions can be initialized with the Show Markdown Data Fields output of the Datasets Tagging app , you will then only need to refine the generated descriptions.
Describe and name the splits in the dataset if there are more than one.
Describe any criteria for splitting the data, if used. If there are differences between the splits (e.g. if the training annotations are machine-generated and the dev and test ones are created by humans, or if different numbers of annotators contributed to each example), describe them here.
Provide the sizes of each split. As appropriate, provide any descriptive statistics for the features, such as average length. For example:
train | validation | test |
---|---|---|
Input Sentences | ||
Average Sentence Length |
Please cite this repository and the original authors when using it.
I removed some data fields and did a new split with hugging face datasets.