数据集:

HuggingFaceM4/OBELISC

许可:

cc-by-4.0

预印本库:

arxiv:2306.16527

大小:

100M<n<1B

语言:

数据集介绍文件清单

中文

Dataset Card for OBELISC

Dataset Summary

OBELISC is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.

This dataset can be used to train large multimodal models, significantly improving their reasoning abilities compared to models trained solely on image/text pairs. Please refer to our paper for further details about the construction of the dataset, quantitative and qualitative analyses of OBELISC , and experiments we conducted.

Languages

English

Data Fields

There are 4 fields: images , texts , metadata and general_metadata .

For each example, the data in the columns images and texts are two lists of the same size, where for each index, one element and only one is not None .

For example, for the web document <image_1>text<image_2> , in images , we have [image_1,None,image_2] and in texts we have [None,text,None] .

The images are replaced by their URLs, and the users have to download them themselves, for example with the library img2dataset .

In metadata , there is a string that can be transformed into a list with json.loads(example["metadata"]) . This list will have the same size as the lists of images and texts, and will have a dictionary for each index where there is an image, and a None value when there is a text. This dictionary will contain the metadata of the image (original source document, unformatted source, alt-text if present, ...).

Finally, in general_metadata , there is a string that can be transformed into a dictionary, containing the URL of the document, and information about its location in the Common Crawl data.

Data Splits

There is only one split, train , that contains 141,047,697 examples.

Size

OBELISC with images replaced by their URLs weighs 666.6 GB (unwanted!) in arrow format and 377 GB in this uploaded parquet format.

Terms of Use

By using the dataset, you agree to comply with the original licenses of the source content as well as the dataset license (CC-BY-4.0). Additionally, if you use this dataset to train a Machine Learning model, you agree to disclose your use of the dataset when releasing the model or an ML application using the model.

Licensing Information

License CC-BY-4.0.

Citation Information

If you are using this dataset, please cite

@inproceedings{
lauren{\c{c}}on2023obe,
title={OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents},
author={Hugo Lauren{\c{c}}on and Lucile Saulnier and L{\'e}o Tronchon and Stas Bekman and Amanpreet Singh and Anton Lozhkov and Thomas Wang and Siddharth Karamcheti and Alexander M Rush and Douwe Kiela and Matthieu Cord and Victor Sanh},
year={2023}
}

作者:

HuggingFaceM4

数据集大小:

12.45 GB