数据集:
Abirate/english_quotes
english_quotes is a dataset of all the quotes retrieved from goodreads quotes . This dataset can be used for multi-label text classification and text generation. The content of each quote is in English and concerns the domain of datasets for NLP and beyond.
The texts in the dataset are in English (en).
A JSON-formatted example of a typical instance in the dataset:
{'author': 'Ralph Waldo Emerson', 'quote': '“To be yourself in a world that is constantly trying to make you something else is the greatest accomplishment.”', 'tags': ['accomplishment', 'be-yourself', 'conformity', 'individuality']}Data Fields
I kept the dataset as one block (train), so it can be shuffled and split by users later using methods of the hugging face dataset library like the (.train_test_split()) method.
I want to share my datasets (created by web scraping and additional cleaning treatments) with the HuggingFace community so that they can use them in NLP tasks to advance artificial intelligence.
Source DataThe source of Data is goodreads site: from goodreads quotes
Initial Data Collection and NormalizationThe data collection process is web scraping using BeautifulSoup and Requests libraries. The data is slightly modified after the web scraping: removing all quotes with "None" tags, and the tag "attributed-no-source" is removed from all tags, because it has not added value to the topic of the quote.
Who are the source Data producers ?The data is machine-generated (using web scraping) and subjected to human additional treatment.
below, I provide the script I created to scrape the data (as well as my additional treatment):
import requests from bs4 import BeautifulSoup import pandas as pd import json from collections import OrderedDict page = requests.get('https://www.goodreads.com/quotes') if page.status_code == 200: pageParsed = BeautifulSoup(page.content, 'html5lib') # Define a function that retrieves information about each HTML quote code in a dictionary form. def extract_data_quote(quote_html): quote = quote_html.find('div',{'class':'quoteText'}).get_text().strip().split('\n')[0] author = quote_html.find('span',{'class':'authorOrTitle'}).get_text().strip() if quote_html.find('div',{'class':'greyText smallText left'}) is not None: tags_list = [tag.get_text() for tag in quote_html.find('div',{'class':'greyText smallText left'}).find_all('a')] tags = list(OrderedDict.fromkeys(tags_list)) if 'attributed-no-source' in tags: tags.remove('attributed-no-source') else: tags = None data = {'quote':quote, 'author':author, 'tags':tags} return data # Define a function that retrieves all the quotes on a single page. def get_quotes_data(page_url): page = requests.get(page_url) if page.status_code == 200: pageParsed = BeautifulSoup(page.content, 'html5lib') quotes_html_page = pageParsed.find_all('div',{'class':'quoteDetails'}) return [extract_data_quote(quote_html) for quote_html in quotes_html_page] # Retrieve data from the first page. data = get_quotes_data('https://www.goodreads.com/quotes') # Retrieve data from all pages. for i in range(2,101): print(i) url = f'https://www.goodreads.com/quotes?page={i}' data_current_page = get_quotes_data(url) if data_current_page is None: continue data = data + data_current_page data_df = pd.DataFrame.from_dict(data) for i, row in data_df.iterrows(): if row['tags'] is None: data_df = data_df.drop(i) # Produce the data in a JSON format. data_df.to_json('C:/Users/Abir/Desktop/quotes.jsonl',orient="records", lines =True,force_ascii=False) # Then I used the familiar process to push it to the Hugging Face hub.Annotations
Annotations are part of the initial data collection (see the script above).
Abir ELTAIEF
Licensing InformationThis work is licensed under a Creative Commons Attribution 4.0 International License (all software and libraries used for web scraping are made available under this Creative Commons Attribution license).
ContributionsThanks to @Abirate for adding this dataset.