数据集:
chenghao/scielo_books
子任务:
language-modeling计算机处理:
multilingual大小:
n<1K语言创建人:
found批注创建人:
no-annotation源数据集:
original许可:
cc-by-nc-sa-3.0This dataset contains all text from open-access PDFs on scielo.org . As of Dec. 5 2021, the total number of books available is 962. Some of them are not in native PDF format (e.g. scanned images) though.
As of Dec. 5 2021, there are 902 books in Portuguese, 55 in Spanish, and 5 in English.
Provide an JSON-formatted example and brief description of a typical instance in the dataset. If available, provide a link to further examples.
{ "sbid":"23pcw", "id":"23pcw", "shortname":"", "title":"Educa\u00e7\u00e3o, sa\u00fade e esporte: novos\tdesafios \u00e0 Educa\u00e7\u00e3o F\u00edsica", "eisbn":"9788574554907", "isbn":"9788574554273", "author":"Farias, Gelcemar Oliveira; Nascimento, Juarez Vieira do", "corporate_authors":"", "translators":"", "coordinators":"", "editors":"", "others":"", "organizers":"", "collaborators":"", "publisher":"Editus", "language":"pt", "year": 2016, "synopsis":"\"A colet\u00e2nea contempla cap\u00edtulos que discutem a Educa\u00e7\u00e3o F\u00edsica a partir dos pressupostos da Educa\u00e7\u00e3o, da Sa\u00fade e do Esporte, enquanto importante desafio do momento atual e diante dos avan\u00e7os e das mudan\u00e7as que se consolidaram na forma\u00e7\u00e3o inicial em Educa\u00e7\u00e3o F\u00edsica. A obra convida a todos para a realiza\u00e7\u00e3o de futuras investiga\u00e7\u00f5es, no sentido de concentrar esfor\u00e7os para o fortalecimento de n\u00facleos de estudos e a sistematiza\u00e7\u00e3o de linhas de pesquisa.\"", "format":"", "type":"book", "is_public":"true", "is_comercial":"false", "publication_date":"2018-11-07", "_version_":"1718206093473087488", "pdf_url":"http://books.scielo.org//id/23pcw/pdf/farias-9788574554907.pdf", "pdf_filename":"farias-9788574554907.pdf", "metadata_filename":"farias-9788574554907.json", "text":"..." }
All fields are of string type except year .
All records are in the default train split.
Part of the big science efforts to create lanague modeling datasets.
All PDFs are directly downloaded from the website and text is extracted with pdftotext lib.
Who are the source language producers?NA
No annotation is available.
Annotation processNA
Who are the annotators?NA
NA
NA
NA
NA
Provide the license and link to the license webpage if available.
NA