数据集:
biglam/cultural_heritage_metadata_accuracy
任务:
文本分类语言:
it计算机处理:
monolingual大小:
10K<n<100K语言创建人:
expert-generated源数据集:
original许可:
cc-by-4.0The dataset contains more than 100K textual descriptions of cultural items from Cultura Italia , the Italian National Cultural aggregator. Each of the description is labeled either HIGH or LOW quality, according its adherence to the standard cataloguing guidelines provided by Istituto Centrale per il Catalogo e la Documentazione (ICCD). More precisely, each description is labeled as HIGH quality if the object and subject of the item (for which the description is provided) are both described according to the ICCD guidelines, and as LOW quality in all other cases. Most of the dataset was manually annotated, with ~30K descriptions automatically labeled as LOW quality due to their length (less than 3 tokens) or their provenance from old (pre-2012), not curated, collections. The dataset was developed to support the training and testing of ML text classification approaches for automatically assessing the quality of textual descriptions in digital Cultural Heritage repositories.
This dataset can be used for text classification tasks. The paper introducing the dataset achieved an f1 score of .783 for the task of classifying if a metadata record was low or high quality. Please see the results table for a full overview of the results reported in the paper.
The dataset consists of Italian metadata records. The labels are in English.
The dataset has only one configuration.
An example instance from the dataset:
{'metadata_text': 'Figure:putto.Oggetti:ghirlanda di fiori', 'label': 0, 'source': 'OpereArteVisiva'}
The datafields are:
The dataset used 'ten-fold cross-validation' and doesn't report specific splits for train, validation and test data.
The dataset was generated using records from Cultura Italia . From the paper introducing the dataset:
By using the textual description encoded by the dc:description element from the Dublin Core metadata schema, we collect a dataset of 100,821 descriptions, after duplicate removal. These records include mainly data from “Musei d’Italia” and “Regione Marche” datasets, which have been chosen because they contain a high number of non-empty dc:description elements. p.221
From the paper:
Duplicates were removed for two reasons: this reduced annotation effort in the subsequent manual annotation, and avoided that the same example appear both in the training and in the test set, a situation that could make classification biased and lead to inaccurate evaluation in supervised settings.Footnote 10 Duplicated descriptions were mainly short and of low-quality, reporting few generic words to describe an item (e.g. “Mensola.”, “Dipinto.”). p.221
The dataset was generated using records from Cultura Italia . This repository is accessible via an OAI-PMH handler or via a SPARQL endpoint .
As discussed above duplicates were removed from the dataset.
Who are the source language producers?The metadata producers are staff working in Italian cultural heritage institutions.
From the paper:
"Most of the dataset was manually annotated, with ~30K descriptions automatically labeled as LOW quality due to their length (less than 3 tokens) or their provenance from old (pre-2012), not curated, collections."
To determine the quality of the collected descriptions the authors of the paper used guidelines from the Istituto Centrale per il Catalogo e la Documentazione
From the paper:
"More precisely, a specific section of the guidelines addresses how to describe any cultural item, clarifying that both the object and the subject of the item must be presented in the description as follows:
Object: the object typology and shape must be described. To describe the object, the cataloguer must refer to the vocabularies provided by ICCD, using specific terminology (e.g. the technique used for paintings and drawings, or the material for the archaeological items);
Subject: the cataloguer must report the iconographic and decorative settings of the item, such as the characters of the depicted scene in a painting and their attribution. Other aspects (e.g. the history behind the painting or the painter) should not be included." p.221
[More Information Needed]
Who are the annotators?"The annotation is carried out by an expert in cultural heritage who collaborated in the past with Cultura Italia and has therefore in-depth knowledge of the data characteristics and of the ICCD guidelines." p.222
No personal or sensitive information is described in the paper.
[More Information Needed]
[More Information Needed]
[More Information Needed]
@article{Lorenzini2020, author = "Matteo Lorenzini and Marco Rospocher and Sara Tonelli", title = "{Annotated dataset to assess the accuracy of the textual description of cultural heritage records}", year = "2020", month = "12", url = "https://figshare.com/articles/dataset/Annotated_dataset_to_assess_the_accuracy_of_the_textual_description_of_cultural_heritage_records/13359104", doi = "10.6084/m9.figshare.13359104.v1" }
Thanks to @davanstrien for adding this dataset.