数据集:
HuggingFaceM4/cm4-synthetic-testing-with-embeddings
This dataset is designed to be used in testing multimodal text/image models. It's derived from cm4-10k dataset.
The current splits are: ['100.unique', '100.repeat', '300.unique', '300.repeat', '1k.unique', '1k.repeat', '10k.unique', '10k.repeat'] .
The unique ones ensure uniqueness across text entries.
The repeat ones are repeating the same 10 unique records: - these are useful for memory leaks debugging as the records are always the same and thus remove the record variation from the equation.
The default split is 100.unique .
The full process of this dataset creation is documented inside cm4-synthetic-testing.py .