数据集:
shunk031/JGLUE
From the official README.md :
JGLUE, Japanese General Language Understanding Evaluation, is built to measure the general NLU ability in Japanese. JGLUE has been constructed from scratch without translation. We hope that JGLUE will facilitate NLU research in Japanese.
JGLUE has been constructed by a joint research project of Yahoo Japan Corporation and Kawahara Lab at Waseda University.
From the official README.md :
JGLUE consists of the tasks of text classification, sentence pair classification, and QA. Each task consists of multiple datasets.
Supported Tasks MARC-jaFrom the official README.md :
MARC-ja is a dataset of the text classification task. This dataset is based on the Japanese portion of Multilingual Amazon Reviews Corpus (MARC) ( Keung+, 2020 ).
JSTSFrom the official README.md :
JSTS is a Japanese version of the STS (Semantic Textual Similarity) dataset. STS is a task to estimate the semantic similarity of a sentence pair. The sentences in JSTS and JNLI (described below) are extracted from the Japanese version of the MS COCO Caption Dataset, the YJ Captions Dataset ( Miyazaki and Shimizu, 2016 ).
JNLIFrom the official README.md :
JNLI is a Japanese version of the NLI (Natural Language Inference) dataset. NLI is a task to recognize the inference relation that a premise sentence has to a hypothesis sentence. The inference relations are entailment, contradiction, and neutral.
JSQuADFrom the official README.md :
JSQuAD is a Japanese version of SQuAD ( Rajpurkar+, 2018 ), one of the datasets of reading comprehension. Each instance in the dataset consists of a question regarding a given context (Wikipedia article) and its answer. JSQuAD is based on SQuAD 1.1 (there are no unanswerable questions). We used the Japanese Wikipedia dump as of 20211101.
JCommonsenseQAFrom the official README.md :
JCommonsenseQA is a Japanese version of CommonsenseQA ( Talmor+, 2019 ), which is a multiple-choice question answering dataset that requires commonsense reasoning ability. It is built using crowdsourcing with seeds extracted from the knowledge base ConceptNet .
LeaderboardFrom the official README.md :
A leaderboard will be made public soon. The test set will be released at that time.
The language data in JGLUE is in Japanese ( BCP-47 ja-JP ).
When loading a specific configuration, users has to append a version dependent suffix:
MARC-jafrom datasets import load_dataset dataset = load_dataset("shunk031/JGLUE", name="MARC-ja") print(dataset) # DatasetDict({ # train: Dataset({ # features: ['sentence', 'label', 'review_id'], # num_rows: 187528 # }) # validation: Dataset({ # features: ['sentence', 'label', 'review_id'], # num_rows: 5654 # }) # })JSTS
from datasets import load_dataset dataset = load_dataset("shunk031/JGLUE", name="JSTS") print(dataset) # DatasetDict({ # train: Dataset({ # features: ['sentence_pair_id', 'yjcaptions_id', 'sentence1', 'sentence2', 'label'], # num_rows: 12451 # }) # validation: Dataset({ # features: ['sentence_pair_id', 'yjcaptions_id', 'sentence1', 'sentence2', 'label'], # num_rows: 1457 # }) # })
An example of the JSTS dataset looks as follows:
{ "sentence_pair_id": "691", "yjcaptions_id": "127202-129817-129818", "sentence1": "街中の道路を大きなバスが走っています。 (A big bus is running on the road in the city.)", "sentence2": "道路を大きなバスが走っています。 (There is a big bus running on the road.)", "label": 4.4 }JNLI
from datasets import load_dataset dataset = load_dataset("shunk031/JGLUE", name="JNLI") print(dataset) # DatasetDict({ # train: Dataset({ # features: ['sentence_pair_id', 'yjcaptions_id', 'sentence1', 'sentence2', 'label'], # num_rows: 20073 # }) # validation: Dataset({ # features: ['sentence_pair_id', 'yjcaptions_id', 'sentence1', 'sentence2', 'label'], # num_rows: 2434 # }) # })
An example of the JNLI dataset looks as follows:
{ "sentence_pair_id": "1157", "yjcaptions_id": "127202-129817-129818", "sentence1": "街中の道路を大きなバスが走っています。 (A big bus is running on the road in the city.)", "sentence2": "道路を大きなバスが走っています。 (There is a big bus running on the road.)", "label": "entailment" }JSQuAD
from datasets import load_dataset dataset = load_dataset("shunk031/JGLUE", name="JSQuAD") print(dataset) # DatasetDict({ # train: Dataset({ # features: ['id', 'title', 'context', 'question', 'answers', 'is_impossible'], # num_rows: 62859 # }) # validation: Dataset({ # features: ['id', 'title', 'context', 'question', 'answers', 'is_impossible'], # num_rows: 4442 # }) # })
An example of the JSQuAD looks as follows:
{ "id": "a1531320p0q0", "title": "東海道新幹線", "context": "東海道新幹線 [SEP] 1987 年(昭和 62 年)4 月 1 日の国鉄分割民営化により、JR 東海が運営を継承した。西日本旅客鉄道(JR 西日本)が継承した山陽新幹線とは相互乗り入れが行われており、東海道新幹線区間のみで運転される列車にも JR 西日本所有の車両が使用されることがある。2020 年(令和 2 年)3 月現在、東京駅 - 新大阪駅間の所要時間は最速 2 時間 21 分、最高速度 285 km/h で運行されている。", "question": "2020 年(令和 2 年)3 月現在、東京駅 - 新大阪駅間の最高速度はどのくらいか。", "answers": { "text": ["285 km/h"], "answer_start": [182] }, "is_impossible": false }JCommonsenseQA
from datasets import load_dataset dataset = load_dataset("shunk031/JGLUE", name="JCommonsenseQA") print(dataset) # DatasetDict({ # train: Dataset({ # features: ['q_id', 'question', 'choice0', 'choice1', 'choice2', 'choice3', 'choice4', 'label'], # num_rows: 8939 # }) # validation: Dataset({ # features: ['q_id', 'question', 'choice0', 'choice1', 'choice2', 'choice3', 'choice4', 'label'], # num_rows: 1119 # }) # })
An example of the JCommonsenseQA looks as follows:
{ "q_id": 3016, "question": "会社の最高責任者を何というか? (What do you call the chief executive officer of a company?)", "choice0": "社長 (president)", "choice1": "教師 (teacher)", "choice2": "部長 (manager)", "choice3": "バイト (part-time worker)", "choice4": "部下 (subordinate)", "label": 0 }
From the official README.md , there are the following two cases:
From the official README.md :
Only train/dev sets are available now, and the test set will be available after the leaderboard is made public.
Task | Dataset | Train | Dev | Test |
---|---|---|---|---|
Text Classification | MARC-ja | 187,528 | 5,654 | 5,639 |
JCoLA† | - | - | - | |
Sentence Pair Classification | JSTS | 12,451 | 1,457 | 1,589 |
JNLI | 20,073 | 2,434 | 2,508 | |
Question Answering | JSQuAD | 62,859 | 4,442 | 4,420 |
JCommonsenseQA | 8,939 | 1,119 | 1,118 |
†JCoLA will be added soon.
From the original paper :
JGLUE is designed to cover a wide range of GLUE and SuperGLUE tasks and consists of three kinds of tasks: text classification, sentence pair classification, and question answering.
[More Information Needed]
Who are the source language producers?From the original paper :
As one of the text classification datasets, we build a dataset based on the Multilingual Amazon Reviews Corpus (MARC) (Keung et al., 2020). MARC is a multilingual corpus of product reviews with 5-level star ratings (1-5) on the Amazon shopping site. This corpus covers six languages, including English and Japanese. For JGLUE, we use the Japanese part of MARC and to make it easy for both humans and computers to judge a class label, we cast the text classification task as a binary classification task, where 1- and 2-star ratings are converted to “negative”, and 4 and 5 are converted to “positive”. We do not use reviews with a 3-star rating.
One of the problems with MARC is that it sometimes contains data where the rating diverges from the review text. This happens, for example, when a review with positive content is given a rating of 1 or 2. These data degrade the quality of our dataset. To improve the quality of the dev/test instances used for evaluation, we crowdsource a positive/negative judgment task for approximately 12,000 reviews. We adopt only reviews with the same votes from 7 or more out of 10 workers and assign a label of the maximum votes to these reviews. We divide the resulting reviews into dev/test data.
We obtained 5,654 and 5,639 instances for the dev and test data, respectively, through the above procedure. For the training data, we extracted 187,528 instances directly from MARC without performing the cleaning procedure because of the large number of training instances. The statistics of MARC-ja are listed in Table 2. For the evaluation metric for MARC-ja, we use accuracy because it is a binary classification task of texts.
JSTS and JNLIFrom the original paper :
For the sentence pair classification datasets, we construct a semantic textual similarity (STS) dataset, JSTS, and a natural language inference (NLI) dataset, JNLI.
STS is a task of estimating the semantic similarity of a sentence pair. Gold similarity is usually assigned as an average of the integer values 0 (completely different meaning) to 5 (equivalent meaning) assigned by multiple workers through crowdsourcing.
NLI is a task of recognizing the inference relation that a premise sentence has to a hypothesis sentence. Inference relations are generally defined by three labels: “entailment”, “contradiction”, and “neutral”. Gold inference relations are often assigned by majority voting after collecting answers from multiple workers through crowdsourcing.
For the STS and NLI tasks, STS-B (Cer et al., 2017) and MultiNLI (Williams et al., 2018) are included in GLUE, respectively. As Japanese datasets, JSNLI (Yoshikoshi et al., 2020) is a machine translated dataset of the NLI dataset SNLI (Stanford NLI), and JSICK (Yanaka and Mineshima, 2021) is a human translated dataset of the STS/NLI dataset SICK (Marelli et al., 2014). As mentioned in Section 1, these have problems originating from automatic/manual translations. To solve this problem, we construct STS/NLI datasets in Japanese from scratch. We basically extract sentence pairs in JSTS and JNLI from the Japanese version of the MS COCO Caption Dataset (Chen et al., 2015), the YJ Captions Dataset (Miyazaki and Shimizu, 2016). Most of the sentence pairs in JSTS and JNLI overlap, allowing us to analyze the relationship between similarities and inference relations for the same sentence pairs like SICK and JSICK.
The similarity value in JSTS is assigned a real number from 0 to 5 as in STS-B. The inference relation in JNLI is assigned from the above three labels as in SNLI and MultiNLI. The definitions of the inference relations are also based on SNLI.
Our construction flow for JSTS and JNLI is shown in Figure 1. Basically, two captions for the same image of YJ Captions are used as sentence pairs. For these sentence pairs, similarities and NLI relations of entailment and neutral are obtained by crowdsourcing. However, it is difficult to collect sentence pairs with low similarity and contradiction relations from captions for the same image. To solve this problem, we collect sentence pairs with low similarity from captions for different images. We collect contradiction relations by asking workers to write contradictory sentences for a given caption.
The detailed construction procedure for JSTS and JNLI is described below.
From the original paper :
As QA datasets, we build a Japanese version of SQuAD (Rajpurkar et al., 2016), one of the datasets of reading comprehension, and a Japanese version ofCommonsenseQA, which is explained in the next section.
Reading comprehension is the task of reading a document and answering questions about it. Many reading comprehension evaluation sets have been built in English, followed by those in other languages or multilingual ones.
In Japanese, reading comprehension datasets for quizzes (Suzukietal.,2018) and those in the drivingdomain (Takahashi et al., 2019) have been built, but none are in the general domain. We use Wikipedia to build a dataset for the general domain. The construction process is basically based on SQuAD 1.1 (Rajpurkar et al., 2016).
First, to extract high-quality articles from Wikipedia, we use Nayuki, which estimates the quality of articles on the basis of hyperlinks in Wikipedia. We randomly chose 822 articles from the top-ranked 10,000 articles. For example, the articles include “熊本県 (Kumamoto Prefecture)” and “フランス料理 (French cuisine)”. Next, we divide an article into paragraphs, present each paragraph to crowdworkers, and ask them to write questions and answers that can be answered if one understands the paragraph. Figure 2 shows an example of JSQuAD. We ask workers to write two additional answers for the dev and test sets to make the system evaluation robust.
JCommonsenseQAFrom the original paper :
JCommonsenseQA is a Japanese version of CommonsenseQA (Talmor et al., 2019), which consists of five choice QA to evaluate commonsense reasoning ability. Figure 3 shows examples of JCommonsenseQA. In the same way as CommonsenseQA, JCommonsenseQA is built using crowdsourcing with seeds extracted from the knowledge base ConceptNet (Speer et al., 2017). ConceptNet is a multilingual knowledge base that consists of triplets of two concepts and their relation. The triplets are directional and represented as (source concept, relation, target concept), for example (bullet train, AtLocation, station).
The construction flow for JCommonsenseQA is shown in Figure 4. First, we collect question sets (QSs) from ConceptNet, each of which consists of a source concept and three target concepts that have the same relation to the source concept. Next, for each QS, we crowdAtLocation 2961source a task of writing a question with only one target concept as the answer and a task of adding two distractors. We describe the detailed construction procedure for JCommonsenseQA below, showing how it differs from CommonsenseQA.
From the official README.md :
We use Yahoo! Crowdsourcing for all crowdsourcing tasks in constructing the datasets.
[More Information Needed]
From the original paper :
We build a Japanese NLU benchmark, JGLUE, from scratch without translation to measure the general NLU ability in Japanese. We hope that JGLUE will facilitate NLU research in Japanese.
[More Information Needed]
[More Information Needed]
The authors curated the original data for JSQuAD from the Japanese wikipedia dump.
JCommonsenseQAIn the same way as CommonsenseQA, JCommonsenseQA is built using crowdsourcing with seeds extracted from the knowledge base ConceptNet
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
@inproceedings{kurihara-etal-2022-jglue, title = "{JGLUE}: {J}apanese General Language Understanding Evaluation", author = "Kurihara, Kentaro and Kawahara, Daisuke and Shibata, Tomohide", booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference", month = jun, year = "2022", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2022.lrec-1.317", pages = "2957--2966", abstract = "To develop high-performance natural language understanding (NLU) models, it is necessary to have a benchmark to evaluate and analyze NLU ability from various perspectives. While the English NLU benchmark, GLUE, has been the forerunner, benchmarks are now being released for languages other than English, such as CLUE for Chinese and FLUE for French; but there is no such benchmark for Japanese. We build a Japanese NLU benchmark, JGLUE, from scratch without translation to measure the general NLU ability in Japanese. We hope that JGLUE will facilitate NLU research in Japanese.", }
@InProceedings{Kurihara_nlp2022, author = "栗原健太郎 and 河原大輔 and 柴田知秀", title = "JGLUE: 日本語言語理解ベンチマーク", booktitle = "言語処理学会第 28 回年次大会", year = "2022", url = "https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/E8-4.pdf" note= "in Japanese" }
Thanks to Kentaro Kurihara , Daisuke Kawahara , and Tomohide Shibata for creating this dataset.