数据集:

HuggingFaceM4/cm4-synthetic-testing-with-embeddings

中文

This dataset is designed to be used in testing multimodal text/image models. It's derived from cm4-10k dataset.

The current splits are: ['100.unique', '100.repeat', '300.unique', '300.repeat', '1k.unique', '1k.repeat', '10k.unique', '10k.repeat'] .

The unique ones ensure uniqueness across text entries.

The repeat ones are repeating the same 10 unique records: - these are useful for memory leaks debugging as the records are always the same and thus remove the record variation from the equation.

The default split is 100.unique .

The full process of this dataset creation is documented inside cm4-synthetic-testing.py .