数据集:

Abirate/french_book_reviews

任务:

文本分类

子任务:

multi-label-classification

语言:

计算机处理:

monolingual

语言创建人:

expert-generated crowdsourced

批注创建人:

expert-generated

源数据集:

original

数据集介绍文件清单

中文

Dataset Card for French book reviews

I-Dataset Summary

The majority of review datasets are in English. There are datasets in other languages, but not many. Through this work, I would like to enrich the datasets in the French language(my mother tongue with Arabic). The data was retrieved from two French websites: Babelio and Critiques Libres Like Wikipedia, these two French sites are made possible by the contributions of volunteers who use the Internet to share their knowledge and reading experiences. The French book reviews is a dataset of a huge number of reader reviews on French books that ill be constantly updated over time.

II-Supported Tasks and Leaderboards

Multi-label text classification : The dataset can be used to train a model for text-classification, which consists of classifying reviews by label value. Success on this task is typically measured by achieving a high or low accuracy.

III-Languages

The texts in the dataset are in French (fr).

IV-Dataset Structure

Data Instances

A JSON-formatted example of a typical instance in the dataset:

{
    "book_title": "La belle histoire des maths",
    "author": "Michel Rousselet",
    "reader_review": "C’est un livre impressionnant, qui inspire le respect 
    par la qualité de sa reliure et son contenu. Je le feuillette et je découvre
    à chaque tour de page un thème distinct magnifiquement illustré. Très beau livre !",
    "rating": 4.0,
    "label": 1
}

Data Fields

book_title : The title of the book that received the reader's review,
author : The author of the book that received the reader's review,
reader_review : The text of the reader's review,
rating : A five-star rating system is used to rate the book read,
label : A post-processed field indicating if the review is positive (1), neutral(0), or negative(-1) based on the rating field. For more details, see the Notebook of the Dataset creation

Data Splits

I kept the dataset as one block (train), so it can be shuffled and split by users later using methods of the hugging face dataset library like the (.train_test_split()) method.

V-Dataset Creation

Curation Rationale

The majority of review datasets are in English. There are datasets in other languages, but not many. Through this work, I would like to enrich the datasets in the French language (French is my mother tongue with Arabic) and slightly contribute to advancing data science and AI, not only for English NLP tasks but for other languages around the world.

French is an international language and it is gaining ground. In addition, it is the language of love. The richness of the French language, so appreciated around the world, is largely related to the richness of its culture. The most telling example is French literature, which has many world-famous writers, such as Gustave Flaubert , Albert Camus , Victor Hugo , Molière , Simone de Beauvoir , Antoine de Saint-Exupéry : the author of "Le Petit Prince" (The Little Prince), which is still among the most translated books in literary history. And one of the world-famous quotes from this book is: "Voici mon secret. Il est très simple: on ne voit bien qu'avec le coeur. L'essentiel est invisible pour les yeux." etc.

Source Data

The source of Data is: two French websites: Babelio and Critiques Libres

Initial Data Collection and Normalization

The data was collected using web scraping (with Scrapy Framework) and subjected to additional data processing. For more details, see this notebook, which details the dataset creation process. Notebook of the Dataset creation

Note : This dataset will be constantly updated to include the most recent reviews on French books by aggregating the old datasets with the updated ones in order to have a huge dataset over time.

Who are the source Data producers ?

I created the Dataset using web scraping, by building a spider and a crawler to scrape the two french web sites Babelio and Critiques Libres

Annotations

Annotations are part of the initial data collection (see the script above).

VI-Additional Informations

Dataset Curators

Abir ELTAIEF

Licensing Information

This work is licensed under CC0: Public Domain

Contributions

Thanks to @Abirate for creating and adding this dataset.

作者:

Abirate

数据集大小:

4.02 MB