数据集:
Bingsu/laion2B-multi-korean-subset
任务:
特征提取语言:
ko计算机处理:
monolingual大小:
10M<n<100M语言创建人:
crowdsourced批注创建人:
crowdsourced许可:
cc-by-4.0a subset data of laion/laion2B-multi , including only korean
CC-BY-4.0
>>> from datasets import load_dataset >>> dataset = load_dataset("Bingsu/laion2B-multi-korean-subset") >>> dataset DatasetDict({ train: Dataset({ features: ['SAMPLE_ID', 'URL', 'TEXT', 'HEIGHT', 'WIDTH', 'LICENSE', 'LANGUAGE', 'NSFW', 'similarity'], num_rows: 11376263 }) })
>>> dataset["train"].features {'SAMPLE_ID': Value(dtype='int64', id=None), 'URL': Value(dtype='string', id=None), 'TEXT': Value(dtype='string', id=None), 'HEIGHT': Value(dtype='int32', id=None), 'WIDTH': Value(dtype='int32', id=None), 'LICENSE': Value(dtype='string', id=None), 'LANGUAGE': Value(dtype='string', id=None), 'NSFW': Value(dtype='string', id=None), 'similarity': Value(dtype='float32', id=None)}
download: 1.56 GiB generated: 2.37 GiB total: 3.93 GiB
train | |
---|---|
# of data | 11376263 |
이미지의 가로가 HEIGHT 로, 세로가 WIDTH 로 되어있는 것 같습니다.
>>> dataset["train"][98] {'SAMPLE_ID': 2937471001780, 'URL': 'https://image.ajunews.com/content/image/2019/04/12/20190412175643597949.png', 'TEXT': '인천시교육청, 인천 시군구발전협의회 임원진과의 간담회 개최', 'HEIGHT': 640, 'WIDTH': 321, 'LICENSE': '?', 'LANGUAGE': 'ko', 'NSFW': 'UNLIKELY', 'similarity': 0.33347243070602417}
# pip install zstandard import pandas as pd from huggingface_hub import hf_hub_url url = hf_hub_url("Bingsu/laion2B-multi-korean-subset", filename="laion2B-multi-korean-subset.csv.zst", repo_type="dataset") # url = "https://huggingface.co/datasets/Bingsu/laion2B-multi-korean-subset/resolve/main/laion2B-multi-korean-subset.csv.zst" df = pd.read_csv(url)
778 MB
import csv import re from datasets import load_dataset from tqdm import tqdm pattern = re.compile(r"[가-힣]") def quote(s: str) -> str: s = s.replace('"""', "") return s def filter_func(example) -> bool: lang = example.get("LANGUAGE") text = example.get("TEXT") if not isinstance(lang, str) or not isinstance(text, str): return False return lang == "ko" or pattern.search(text) is not None file = open("./laion2B-mulit_korean_subset.csv", "w", encoding="utf-8", newline="") ds = load_dataset("laion/laion2B-multi", split="train", streaming=True) dsf = ds.filter(filter_func) header = [ "SAMPLE_ID", "URL", "TEXT", "HEIGHT", "WIDTH", "LICENSE", "LANGUAGE", "NSFW", "similarity", ] writer = csv.DictWriter(file, fieldnames=header) writer.writeheader() try: for data in tqdm(dsf): # total=11378843 data["TEXT"] = quote(data.get("TEXT", "")) if data["TEXT"]: writer.writerow(data) finally: file.close() print("Done!")
실행에 약 8시간이 소요되었습니다. 이후에 HEIGHT 나 WIDTH 가 None인 데이터를 제거하고 업로드하였습니다.
img2dataset 을 사용하여 URL로된 이미지들을 데이터셋 형태로 만들 수 있습니다.