larryvrh/WikiMatrix-v1-Ja_Zh-filtered | ATYUN.COM 官网-人工智能教程资讯全方位服务平台

数据集:

larryvrh/WikiMatrix-v1-Ja_Zh-filtered

任务:

语言:

大小:

许可:

Filtered and modified version of Japanese/Chinese language pair data from WikiMatrix v1 .

Process steps:

Basic regex based filtering / length checking to remove abnormal pairs.

Semantic similarity filtering with a threshold value of 0.6, based on sentence-transformers/LaBSE .

Convert all Traditional Chinese sentences into Simplified Chinese with zhconv .

经过过滤和修改的日语/中文语言对数据，来自 WikiMatrix v1 。

处理步骤：

基本的基于正则表达式的过滤/长度检查，以删除异常对。

基于 sentence-transformers/LaBSE 的语义相似性过滤，阈值为0.6。

使用 zhconv 将所有繁体中文句子转换为简体中文。

以下はフィルタリングされ修正された日本語/中国語のペアデータです。データ元は WikiMatrix v1 です。

処理手順：

正規表現に基づくフィルタリング/長さのチェックを行い、異常なペアを削除します。

sentence-transformers/LaBSE に基づくセマンティック類似性フィルタリングを行い、閾値は0.6です。

zhconv を使って、すべての繁体字中国語の文を簡体字中国語に変換します。

作者:

larryvrh

数据集大小:

110.51 MB