I-Dataset Summary

english_historical_quotes is a dataset of many historical quotes. This dataset can be used for multi-label text classification and text generation. The content of each quote is in English.

II-Supported Tasks and Leaderboards

Multi-label text classification : The dataset can be used to train a model for text-classification, which consists of classifying quotes by author as well as by topic (using tags). Success on this task is typically measured by achieving a high or low accuracy.
Text-generation : The dataset can be used to train a model to generate quotes by fine-tuning an existing pretrained model on the corpus composed of all quotes (or quotes by author).

III-Languages

The texts in the dataset are in English (en).

IV-Dataset Structure

Data Instances

A JSON-formatted example of a typical instance in the dataset:

{"quote":"Almost anyone can be an author the business is to collect money and fame from this state of being.", "author":"A. A. Milne", "categories": "['business', 'money']" }

Data Fields

author : The author of the quote.
quote : The text of the quote.
tags: The tags could be characterized as topics around the quote.

Data Splits

The dataset is one block, so that it can be further processed using Hugging Face datasets functions like the ``.train_test_split() method.

V-Dataset Creation

Curation Rationale

The goal is to share good datasets with the HuggingFace community so that they can use them in NLP tasks and advance artificial intelligence.

Source Data

The data has been aggregated from various open-access internet archives. Then it has been manually refined, duplicates and false quotes removed by me.

It is the backbone of my website dixit.app , which allows to search historical quotes through semantic search.

VI-Additional Informations

Dataset Curators

Aymeric Roucher

Licensing Information This work is licensed under a MIT License.

作者:

A-Roucher

数据集大小:

4.24 MB