数据集:

MaCoCu/parallel_data

任务:

翻译

计算机处理:

translation

大小:

10M<n<100M

语言创建人:

found

批注创建人:

no-annotation

源数据集:

original

许可:

cc0-1.0
中文

license: cc0-1.0

Dataset Summary

The collection of MaCoCu parallel corpora have been crawled and consist of pairs of source and target segments (one or several sentences) and additional metadata. The following metadata is included:

  • "src_url" and "trg_url": source and target document URL;
  • "src_text" and "trg_text": text in non-English language and in English Language;
  • "bleualign_score": similarity score as provided by the sentence alignment tool Bleualign (value between 0 and 1);
  • "src_deferred_hash" and "trg_deferred_hash": hash identifier for the corresponding segment;
  • "src_paragraph_id" and "trg_paragraph_id": identifier of the paragraph where the segment appears in the original document;
  • "src_doc_title" and "trg_doc_title": title of the documents from which segments where obtained;
  • "src_crawl_date" and "trg_crawl_date": date and time when source and target documents where donwoaded;
  • "src_file_type" and "trg_file_type": type of the original documents (usually HTML format);
  • "src_boilerplate" and "trg_boilerplate": are source or target segments boilerplates?
  • "bifixer_hash": hash identifier for the segment pair;
  • "bifixer_score": score that indicates how likely are segments to be correct in their corresponding language;
  • "bicleaner_ai_score": score that indicates how likely are segments to be parallel;
  • "biroamer_entities_detected": do any of the segments contain personal information?
  • "dsi": a DSI class (“dsi”): information whether the segment is connected to any of Digital Service Infrastructure (DSI) classes (e.g., cybersecurity, e-health, e-justice, open-data-portal), defined by the Connecting Europe Facility ( https://github.com/RikVN/DSI );
  • "translation_direction": translation direction and machine translation identification ("translation-direction"): the source segment in each segment pair was identified by using a probabilistic model ( https://github.com/RikVN/TranslationDirection ), which also determines if the translation has been produced by a machine-translation system;
  • "en_document_level_variant": the language variant of English (British or American, using a lexicon-based English variety classifier - https://pypi.org/project/abclf/ ) was identified on document and domain level;
  • "domain_en": name of the web domain for the English document;
  • "en_domain_level_variant": language variant for English at the level of the web domain.

To load a language pair just indicate the dataset and the pair of languages with English first

dataset = load_dataset("MaCoCu/parallel_data", "en-is")