数据集:
blbooksgenre
This dataset consists of metadata relating to books digitised by the British Library in partnership with Microsoft . Some of this metadata was exported from the British Library catalogue whilst others was generated as part of a crowdsourcing project. The text of this book and other metadata can be found on the date.bl website.
The majority of the books in this collection were published in the 18th and 19th Century but the collection also includes a smaller number of books from earlier periods. Items within this collection cover a wide range of subject areas including geography, philosophy, history, poetry and literature and are published in a variety of languages.
For the subsection of the data which contains additional crowsourced annotations the date of publication breakdown is as follows:
Date of publication | |
---|---|
1630 | 8 |
1690 | 4 |
1760 | 10 |
1770 | 5 |
1780 | 5 |
1790 | 18 |
1800 | 45 |
1810 | 96 |
1820 | 152 |
1830 | 182 |
1840 | 259 |
1850 | 400 |
1860 | 377 |
1870 | 548 |
1880 | 776 |
1890 | 1484 |
1900 | 17 |
1910 | 1 |
1970 | 1 |
[More Information Needed]
The digitised books collection which this dataset describes has been used in a variety of digital history and humanities projects since being published.
This dataset is suitable for a variety of unsupervised tasks and for a 'genre classification task'.
Supervised tasksThe main possible use case for this dataset is to develop and evaluate 'genre classification' models. The dataset includes human generated labels for whether a book is 'fiction' or 'non-fiction'. This has been used to train models for genre classifcation which predict whether a book is 'fiction' or 'non-fiction' based on its title.
[More Information Needed]
The dataset currently has three configurations intended to support a range of tasks for which this dataset could be used for:
An example data instance from the title_genre_classifiction config:
{'BL record ID': '014603046', 'title': 'The Canadian farmer. A missionary incident [Signed: W. J. H. Y, i.e. William J. H. Yates.]', 'label': 0}
An example data instance from the annotated_raw config:
{'BL record ID': '014603046', 'Name': 'Yates, William Joseph H.', 'Dates associated with name': '', 'Type of name': 'person', 'Role': '', 'All names': ['Yates, William Joseph H. [person] ', ' Y, W. J. H. [person]'], 'Title': 'The Canadian farmer. A missionary incident [Signed: W. J. H. Y, i.e. William J. H. Yates.]', 'Variant titles': '', 'Series title': '', 'Number within series': '', 'Country of publication': ['England'], 'Place of publication': ['London'], 'Publisher': '', 'Date of publication': '1879', 'Edition': '', 'Physical description': 'pages not numbered, 21 cm', 'Dewey classification': '', 'BL shelfmark': 'Digital Store 11601.f.36. (1.)', 'Topics': '', 'Genre': '', 'Languages': ['English'], 'Notes': 'In verse', 'BL record ID for physical resource': '004079262', 'classification_id': '267476823.0', 'user_id': '15.0', 'subject_ids': '44369003.0', 'annotator_date_pub': '1879', 'annotator_normalised_date_pub': '1879', 'annotator_edition_statement': 'NONE', 'annotator_FAST_genre_terms': '655 7 ‡aPoetry‡2fast‡0(OCoLC)fst01423828', 'annotator_FAST_subject_terms': '60007 ‡aAlice,‡cGrand Duchess, consort of Ludwig IV, Grand Duke of Hesse-Darmstadt,‡d1843-1878‡2fast‡0(OCoLC)fst00093827', 'annotator_comments': '', 'annotator_main_language': '', 'annotator_other_languages_summaries': 'No', 'annotator_summaries_language': '', 'annotator_translation': 'No', 'annotator_original_language': '', 'annotator_publisher': 'NONE', 'annotator_place_pub': 'London', 'annotator_country': 'enk', 'annotator_title': 'The Canadian farmer. A missionary incident [Signed: W. J. H. Y, i.e. William J. H. Yates.]', 'Link to digitised book': 'http://access.bl.uk/item/viewer/ark:/81055/vdc_00000002842E', 'annotated': True, 'Type of resource': 0, 'created_at': datetime.datetime(2020, 8, 11, 14, 30, 33), 'annotator_genre': 0}
The data fields differ slightly between configs. All possible fields for the annotated_raw config are listed below. For the raw version of the dataset datatypes are usually string to avoid errors when processing missing values.
The following fields are all generated via the crowdsourcing task (discussed in more detail below)
Finally the label field of the title_genre_classifiction configuration is a class label with values 0 (Fiction) or 1 (Non-fiction).
[More Information Needed]
This dataset contains a single split train .
Note this section is a work in progress.
The books in this collection were digitised as part of a project partnership between the British Library and Microsoft. Mass digitisation i.e. projects where there is a goal to quickly digitise large volumes of materials shape the selection of materials to include in a number of ways. Some consideratoins which are often involved in the decision of whether to include items for digitization include (but are not limited to):
These criteria can have knock-on effects on the makeup of a collection. For example systematically excluding large books may result in some types of book content not being digitized. Large volumes are likely to be correlated to content to at least some extent so excluding them from digitization will mean that material is under represented. Similarly copyright status is often (but not only) determined by publication data. This can often lead to a rapid fall in the number of items in a collection after a certain cut-off date.
All of the above is largely to make clear that this collection was not curated with the aim of creating a representative sample of the British Library's holdings. Some material will be over-represented and other under-represented. Similarly, the collection should not be considered a representative sample of what was published across the time period covered by the dataset (nor that that the relative proportions of the data for each time period represent a proportional sample of publications from that period).
[More Information Needed]
The original source data (physical items) includes a variety of resources (predominantly monographs) held by the British Library . The British Library is a Legal Deposit library. "Legal deposit requires publishers to provide a copy of every work they publish in the UK to the British Library. It's existed in English law since 1662." source .
[More Information Needed]
Initial Data Collection and NormalizationThis version of the dataset was created partially from data exported from British Library catalogue records and partially via data generated from a crowdsourcing task involving British Library staff.
Who are the source language producers?[More Information Needed]
The data does includes metadata associated with the books these are produced by British Library staff. The additional annotations were carried out during 2020 as part of an internal crowdsourcing task.
Annotation processNew annotations were produced via a crowdsourcing tasks. Annotators have the option to pick titles from a particular language subset from the broader digitized 19th century books collection. As a result the annotations are not random and overrepresent some languages.
[More Information Needed]
Who are the annotators?Staff working at the British Library. Most of these staff work with metadata as part of their jobs and so could be considered expert annotators.
[More Information Needed]
[More Information Needed]
There a range of considerations around using the data. These include the representativeness of the dataset, the bias towards particular languages etc.
It is also important to note that library metadata is not static. The metadata held in library catalogues is updated and changed over time for a variety of reasons.
The way in which different institutions catalogue items also varies. As a result it is important to evaluate the performance of any models trained on this data before applying to a new collection.
[More Information Needed]
[More Information Needed]
The text in this collection is derived from historic text. As a result the text will reflect to social beliefs and attitudes of this time period. The titles of the book give some sense of their content. Examples of book titles which appear in the data (these are randomly sampled from all titles):
Whilst using titles alone, is obviously insufficient to integrate bias in this collection it gives some insight into the topics covered by books in the corpus. Further looking into the tiles highlight some particular types of bias we might find in the collection. This should in no way be considered an exhaustive list.
ColonialismWe can see even in the above random sample of titles examples of colonial attitudes. We can try and interrogate this further by searching for the name of countries which were part of the British Empire at the time many of these books were published.
Searching for the string India in the titles and randomly sampling 10 titles returns:
Searching form the string Africa in the titles and randomly sampling 10 titles returns:
Whilst this dataset doesn't include the underlying text it is important to consider the potential attitudes represented in the title of the books, or the full text if you are using this dataset in conjunction with the full text.
[More Information Needed]
[More Information Needed]
[More Information Needed]
The books are licensed under the CC Public Domain Mark 1.0 license.
@misc{british library_genre, title={ 19th Century Books - metadata with additional crowdsourced annotations}, url={https://doi.org/10.23636/BKHQ-0312}, author={{British Library} and Morris, Victoria and van Strien, Daniel and Tolfo, Giorgia and Afric, Lora and Robertson, Stewart and Tiney, Patricia and Dogterom, Annelies and Wollner, Ildi}, year={2021}}
Thanks to @davanstrien for adding this dataset.