数据集:
RussianNLP/tape
TAPE (Text Attack and Perturbation Evaluation) is a novel benchmark for few-shot Russian language understanding evaluation that includes six complex NLU tasks, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. TAPE's design focuses on systematic zero-shot and few-shot NLU evaluation across different axes:
General data collection principles of TAPE are based on combining "intellectual abilities" needed to solve GLUE-like tasks, ranging from world knowledge to logic and commonsense reasoning. Based on the GLUE format, we have built six new datasets from the ground up, each of them requiring the modeling abilities of at least two skills:
The perturbations, included in the framework, can be divided into two categories:
Refer to the TAPE paper or the RuTransform repo for more information.
The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logic and reasoning.
MotivationThe dataset presents an extended version of a traditional Winograd challenge (Levesque et al., 2012) : each sentence contains unresolved homonymy, which can be resolved based on commonsense and reasoning. The Winograd scheme is extendable with the real-life sentences filtered out of the National Corpora with a set of 11 syntactic queries, extracting sentences like " Katya asked Masha if she ..." (two possible references to a pronoun), "A change of scenery that ..." (Noun phrase & subordinate clause with "that" in the same gender and number), etc. The extraction pipeline can be adjusted to various languages depending on the set of ambiguous syntactic constructions possible.
Dataset Composition Data InstancesEach instance in the dataset is a sentence with unresolved homonymy.
{ 'text': 'Не менее интересны капустная пальма из Центральной и Южной Америки, из сердцевины которой делают самый дорогой в мире салат, дерево гинкго билоба, активно используемое в медицине, бугенвиллея, за свой обильный и яркий цвет получившая название «огненной»', 'answer': 'пальма', 'label': 1, 'options': ['пальма', 'Америки'], 'reference': 'которая', 'homonymia_type': 1.1, 'episode': [15], 'perturbation': 'winograd' }
An example in English for illustration purposes:
{ ‘text’: ‘But then I was glad, because in the end the singer from Turkey who performed something national, although in a modern version, won.’, ‘answer’: ‘singer’, ‘label’: 1, ‘options’: [‘singer’, ‘Turkey’], ‘reference’: ‘who’, ‘homonymia_type’: ‘1.1’, episode: [15], ‘perturbation’ : ‘winograd’ }Data Fields
The dataset consists of a training set with labeled examples and a test set in two configurations:
The train and test sets are disjoint with respect to the sentence-candidate answer pairs but may include overlaps in individual sentences and homonymy type.
Test PerturbationsEach training episode in the dataset corresponds to six test variations, including the original test data and five adversarial test sets, acquired through the modification of the original test through the following text perturbations:
The following table contains the number of examples in each data split and the label distribution:
Split | Size (Original/Perturbed) | Label Distribution |
---|---|---|
Train.raw | 804 | 66.3 / 33.7 |
Test.raw | 3458 | 58.1 / 41.9 |
Train.episodes | 60 | 72.8 / 27.1 |
Test.episodes | 976 / 5856 | 58.0 / 42.0 |
The texts for the dataset are taken from the Russian National Corpus , the most representative and authoritative corpus of the Russian language available. The corpus includes texts from several domains, including news, fiction, and the web.
Data CollectionThe texts for the Winograd scheme problem are obtained using a semi-automatic pipeline.
First, lists of 11 typical grammatical structures with syntactic homonymy (mainly case) are compiled. For example, two noun phrases with a complex subordinate:
'A trinket from Pompeii that has survived the centuries.'
Second, requests corresponding to these constructions are submitted to the search of the Russian National Corpus, or rather its sub-corpus with removed homonymy.
Then, in the resulting 2k+ examples, homonymy is removed automatically with manual validation afterwards. Each original sentence is split into multiple examples in the binary classification format, indicating whether the homonymy is resolved correctly or not.
Sakaguchi et al. (2019) showed that the data Winograd Schema challenge might contain potential biases. We use the AFLite algorithm to filter out any potential biases in the data to make the test set more challenging for models. However, we do not guarantee that no spurious biases exist in the data.
RuWorldTree is a QA dataset with multiple-choice elementary-level science questions, which evaluate the understanding of core science facts.
MotivationThe WorldTree dataset starts the triad of the Reasoning and Knowledge tasks. The data includes the corpus of factoid utterances of various kinds, complex factoid questions and a corresponding causal chain of facts from the corpus resulting in a correct answer.
The WorldTree design was originally proposed in (Jansen et al., 2018) .
Dataset Composition Data InstancesEach instance in the datasets is a multiple-choice science question with 4 answer options.
{ 'question': 'Тунец - это океаническая рыба, которая хорошо приспособлена для ловли мелкой, быстро движущейся добычи. Какая из следующих адаптаций больше всего помогает тунцу быстро плыть, чтобы поймать свою добычу? (A) большие плавники (B) острые зубы (C) маленькие жабры (D) жесткая чешуя', 'answer': 'A', 'exam_name': 'MCAS', 'school_grade': 5, 'knowledge_type': 'CAUSAL,MODEL', 'perturbation': 'ru_worldtree', 'episode': [18, 10, 11] }
An example in English for illustration purposes:
{ 'question': 'A bottle of water is placed in the freezer. What property of water will change when the water reaches the freezing point? (A) color (B) mass (C) state of matter (D) weight', 'answer': 'C', 'exam_name': 'MEA', 'school_grade': 5, 'knowledge_type': 'NO TYPE', 'perturbation': 'ru_worldtree', 'episode': [18, 10, 11] }Data Fields
The dataset consists of a training set with labeled examples and a test set in two configurations:
We use the same splits of data as in the original English version.
Test PerturbationsEach training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:
The following table contains the number of examples in each data split and the label distribution:
Split | Size (Original/Perturbed) | Label Distribution |
---|---|---|
Train.raw | 118 | 28.81 / 26.27 / 22.88 / 22.03 |
Test.raw | 633 | 22.1 / 27.5 / 25.6 / 24.8 |
Train.episodes | 47 | 29.79 / 23.4 / 23.4 / 23.4 |
Test.episodes | 629 / 4403 | 22.1 / 27.5 / 25.6 / 24.8 |
The questions for the dataset are taken from the original WorldTree dataset, which was sourced from the AI2 Science Questions V2 corpus, consisting of both standardized exam questions from 12 US states, and the AI2 Science Questions Mercury dataset, a set of questions licensed from a student assessment entity.
Data CollectionThe dataset mainly consists of automatic translation of the English WorldTree Corpus and human validation and correction.
RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions which probe the understanding of core science facts.
MotivationRuOpenBookQA is mainly based on the work of (Mihaylov et al., 2018) : it is a QA dataset with multiple-choice elementary-level science questions, which probe the understanding of 1k+ core science facts.
Very similar to the pipeline of the RuWorldTree, the dataset includes a corpus of factoids, factoid questions and correct answer. Only one fact is enough to find the correct answer, so this task can be considered easier.
Dataset Composition Data InstancesEach instance in the datasets is a multiple-choice science question with 4 answer options.
{ 'ID': '7-674', 'question': 'Если животное живое, то (A) оно вдыхает воздух (B) оно пытается дышать (C) оно использует воду (D) оно стремится к воспроизводству', 'answer': 'A', 'episode': [11], 'perturbation': 'ru_openbook' }
An example in English for illustration purposes:
{ 'ID': '7-674', 'question': 'If a person walks in the direction opposite to the compass needle, they are going (A) west (B) north (C) east (D) south', 'answer': 'D', 'episode': [11], 'perturbation': 'ru_openbook' }Data Fields
The dataset consists of a training set with labeled examples and a test set in two configurations:
Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:
The following table contains the number of examples in each data split and the label distribution:
Split | Size (Original/Perturbed) | Label Distribution |
---|---|---|
Train.raw | 2339 | 31.38 / 23.64 / 21.76 / 23.22 |
Test.raw | 500 | 25.2 / 27.6 / 22.0 / 25.2 |
Train.episodes | 48 | 27.08 / 18.75 / 20.83 / 33.33 |
Test.episodes | 500 / 3500 | 25.2 / 27.6 / 22.0 / 25.2 |
The questions are taken from the original OpenBookQA dataset, created via multi-stage crowdsourcing and partial expert filtering.
Data CollectionThe dataset mainly consists of automatic translation of the English OpenBookQA and human validation and correction.
Ethics 1 (sit ethics) dataset is created to test the knowledge of the basic concepts of morality. The task is to predict human ethical judgments about diverse text situations in a multi-label classification setting. Namely, the task requires models to identify the presence of concepts in normative ethics, such as virtue, law, moral, justice, and utilitarianism.
MotivationThere is a multitude of approaches to evaluating ethics in machine learning. The Ethics dataset for Russian is created from scratch for the first time, relying on the design compatible with (Hendrycks et al., 2021) .
Dataset Composition Data InstancesData instances are given as excerpts from news articles and fiction texts.
{ 'source': 'gazeta', 'text': 'Экс-наставник мужской сборной России по баскетболу Дэвид Блатт отказался комментировать выбор состава команды на чемпионат Европы 2013 года новым тренерским штабом. «Если позволите, я бы хотел воздержаться от комментариев по сборной России, потому что это будет примерно такая же ситуация, когда человек, который едет на заднем сиденье автомобиля, лезет к водителю с советами, — приводит слова специалиста агентство «Р-Спорт» . — У российской сборной новый главный тренер, новый тренерский штаб. Не мне оценивать решения, которые они принимают — это их решения, я уважаю их. Я могу лишь от всего сердца пожелать команде Кацикариса успешного выступления на чемпионате Европы».', 'sit_virtue': 0, 'sit_moral': 0, 'sit_law': 0, 'sit_justice': 0, 'sit_util': 0, 'episode': [5], 'perturbation': 'sit_ethics' }
An example in English for illustration purposes:
{ 'source': 'gazeta', 'text': '100-year-old Greta Ploech gave handmade cookies to a toddler who helped her cross a busy highway at a pedestrian crossing. The video was posted on the Readers Channel.', 'sit_virtue': 1, 'sit_moral': 0, 'sit_law': 0, 'sit_justice': 1, 'sit_util': 1, 'episode': [5], 'perturbation': 'sit_ethics' }Data Fields
The dataset consists of a training set with labeled examples and a test set in two configurations:
Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:
The following table contains the number of examples in each data split and the label distribution:
Split | Size (Original/Perturbed) | Label Distribution |
---|---|---|
Train.raw | 254 | 31.9 / 39.0 / 44.9 / 5.9 / 38.2 |
Test.raw | 1436 | 31.0 / 34.8 / 36.8 / 15.3 / 39.0 |
Train.episodes | 59 | 30.51 / 38.98 / 35.59 / 6.78 / 37.29 |
Test.episodes | 1000 / 7000 | 31.0 / 34.8 / 36.8 / 15.3 / 39.0 |
The data is sampled from the news and fiction sub-corpora of the Taiga corpus (Shavrina and Shapovalova, 2017) .
Data CollectionThe composition of the dataset is conducted in a semi-automatic mode.
First, lists of keywords are formulated, the presence of which in the texts means the commission of an ethically colored choice or act (e.g., 'kill', 'give', 'create', etc.). The collection of keywords includes the automatic collection of synonyms using the semantic similarity tools of the RusVestores project (Kutuzov and Kuzmenko, 2017) .
After that, we extract short texts containing these keywords.
Each text is annotated via a Russian crowdsourcing platform Toloka. The workers were asked to answer five questions, one for each target column:
Do you think the text…
Examples with low inter-annotator agreement rates were filtered out.
Human annotators' submissions are collected and stored anonymously. The average hourly pay rate exceeds the hourly minimum wage in Russia. Each annotator is warned about potentially sensitive topics in data (e.g., politics, societal minorities, and religion). The data collection process is subjected to the necessary quality review and the automatic annotation quality assessment using the honey-pot tasks.
Ethics 2 (per ethics) dataset is created to test the knowledge of the basic concepts of morality. The task is to predict human ethical judgments about diverse text situations in a multi-label classification setting. The main objective of the task is to evaluate the positive or negative implementation of five concepts in normative with ‘yes’ and ‘no’ ratings. The included concepts are as follows: virtue, law, moral, justice, and utilitarianism.
MotivationThere are a multitude of approaches to evaluating ethics in machine learning. The Ethics dataset for Russian is created from scratch for the first time, relying on the design compatible with (Hendrycks et al., 2021) .
Our Ethics dataset would go through community validation and discussion as it is the first ethics dataset for Russian based on the established methodology. We acknowledge that the work (Hendrycks et al., 2021) has flaws; thus, we do not reproduce the generative approach. We construct the dataset using a similar annotation scheme: we avoid the direct question of whether the deed is good or bad. Instead, we make annotations according to five criteria that describe the aspects of the annotators' attitude to the deed.
Dataset Composition Data InstancesData instances are given as excerpts from news articles and fiction texts.
{ 'source': 'interfax', 'text': 'Вашингтон. 8 апреля. ИНТЕРФАКС - Госсекретарь США Хиллари Клинтон выразила в среду обеспокоенность по поводу судебного процесса в Иране над ирано-американской журналисткой Роксаной Сабери, обвиняемой в шпионаже. "Поступившая к нам информация вызывает у нас серьезное беспокойство. Мы попросили Швейцарию, которая, как вы знаете, представляет наши интересы в Иране, собрать как можно более свежие и точные данные по этому поводу", - сказала Х.Клинтон журналистам. Ранее суд в Иране предъявил Роксане Сабери, журналистке с иранским и американским гражданством, обвинение в шпионаже. Судья заявил, что "существуют доказательства вины Р.Сабери, и она уже призналась в преступлениях".', 'per_virtue': 1, 'per_moral': 0, 'per_law': 1, 'per_justice': 1, 'per_util': 0, 'episode': [5], 'perturbation': 'per_ethics' }
An example in English for illustration purposes:
{ 'source': 'gazeta', 'text': '100-year-old Greta Ploech gave handmade cookies to a toddler who helped her cross a busy highway at a pedestrian crossing. The video was posted on the Readers Channel.', 'sit_virtue': 1, 'sit_moral': 0, 'sit_law': 0, 'sit_justice': 1, 'sit_util': 1, 'episode': [5], 'perturbation': 'sit_ethics' }Data Fields
The dataset consists of a training set with labeled examples and a test set in two configurations:
Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:
The following table contains the number of examples in each data split and the label distribution:
Split | Size (Original/Perturbed) | Label Distribution |
---|---|---|
Train.raw | 259 | 69.1 / 65.3 / 78.4 / 40.9 / 23.9 |
Test.raw | 1466 | 64.7 / 63.5 / 78.9 / 53.0 / 27.9 |
Train.episodes | 58 | 67.24 / 65.52 / 77.59 / 46.55 / 24.14 |
Test.episodes | 1000 / 7000 | 64.7 / 63.5 / 78.9 / 53.0 / 27.9 |
The data is sampled from the news and fiction sub-corpora of the Taiga corpus (Shavrina and Shapovalova, 2017) .
Data CollectionThe composition of the dataset is conducted in a semi-automatic mode.
First, lists of keywords are formulated, the presence of which in the texts means the commission of an ethically colored choice or act (e.g., 'kill', 'give', 'create', etc.). The collection of keywords includes the automatic collection of synonyms using the semantic similarity tools of the RusVestores project (Kutuzov and Kuzmenko, 2017) .
After that, we extract short texts containing these keywords.
Each text is annotated via a Russian crowdsourcing platform Toloka. The workers were asked to answer five questions, one for each target column:
Do you think the text…
Examples with low inter-annotator agreement rates were filtered out.
Human annotators' submissions are collected and stored anonymously. The average hourly pay rate exceeds the hourly minimum wage in Russia. Each annotator is warned about potentially sensitive topics in data (e.g., politics, societal minorities, and religion). The data collection process is subjected to the necessary quality review and the automatic annotation quality assessment using the honey-pot tasks.
CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK.
MotivationThe task can be considered the most challenging in terms of reasoning, knowledge and logic, as the task implies the QA pairs with a free response form (no answer choices); however, a long chain of causal relationships between facts and associations forms the correct answer.
The original corpus of the CheGeKa game was introduced in Mikhalkova (2021) .
Dataset Composition Data InstancesData instances are given as question and answer pairs.
{ 'question_id': 966, 'question': '"Каждую ночь я открываю конверт" именно его.', 'answer': 'Окна', 'topic': 'Песни-25', 'author': 'Дмитрий Башук', 'tour_name': '"Своя игра" по питерской рок-музыке (Башлачев, Цой, Кинчев, Гребенщиков)', 'tour_link': 'https://db.chgk.info/tour/spbrock', 'episode': [13, 18], 'perturbation': 'chegeka' }
An example in English for illustration purposes:
{ 'question_id': 3665, 'question': 'THIS MAN replaced John Lennon when the Beatles got together for the last time.', 'answer': 'Julian Lennon', 'topic': 'The Liverpool Four', 'author': 'Bayram Kuliyev', 'tour_name': 'Jeopardy!. Ashgabat-1996', 'tour_link': 'https://db.chgk.info/tour/ash96sv', 'episode': [16], 'perturbation': 'chegeka' }Data Fields
The dataset consists of a training set with labeled examples and a test set in two configurations:
Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:
The following table contains the number of examples in each data split:
Split | Size (Original/Perturbed) |
---|---|
Train.raw | 29376 |
Test.raw | 520 |
Train.episodes | 49 |
Test.episodes | 520 / 3640 |
The train data for the task was collected from the official ChGK database. Since that the database is open and its questions are easily accessed via search machines, a pack of unpublished questions written by authors of ChGK was prepared to serve as a closed test set.
Data CollectionFor information on the data collection procedure, please, refer to Mikhalkova (2021) .
MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks.
MotivationQuestion-answering has been an essential task in natural language processing and information retrieval. However, certain areas in QA remain quite challenging for modern approaches, including the multi-hop one, which is traditionally considered an intersection of graph methods, knowledge representation, and SOTA language modeling.
Multi-hop reasoning has been the least addressed QA direction for Russian. The task is represented by the MuSeRC dataset (Fenogenova et al., 2020) and only a few dozen questions in SberQUAD (Efimov et al., 2020) and RuBQ (Rybin et al., 2021) . In response, we have developed a semi-automatic pipeline for multi-hop dataset generation based on Wikidata.
Dataset Composition Data InstancesData instances are given as a question with two additional texts for answer extraction.
{ 'support_text': 'Пабло Андрес Санчес Спакес ( 3 января 1973, Росарио, Аргентина), — аргентинский футболист, полузащитник. Играл за ряд клубов, такие как: "Росарио Сентраль", "Фейеноорд" и другие, ныне главный тренер чилийского клуба "Аудакс Итальяно".\\n\\nБиография.\\nРезультаты команды были достаточно хорошм, чтобы она заняла второе место. Позже он недолгое время представлял "Депортиво Алавес" из Испании и бельгийский "Харелбек". Завершил игровую карьеру в 2005 году в "Кильмесе". Впоследствии начал тренерскую карьеру. На родине работал в "Банфилде" и "Росарио Сентрале". Также тренировал боливийский "Ориенте Петролеро" (дважды) и ряд чилийских клубов.', 'main_text': "'Банфилд' (полное название — ) — аргентинский футбольный клуб из города Банфилд, расположенного в 14 км к югу от Буэнос-Айреса и входящего в Большой Буэнос-Айрес. Один раз, в 2009 году, становился чемпионом Аргентины.\\n\\nДостижения.\\nЧемпион Аргентины (1): 2009 (Апертура). Вице-чемпион Аргентины (2): 1951, 2004/05 (Клаусура). Чемпионы Аргентины во Втором дивизионе (7): 1939, 1946, 1962, 1973, 1992/92, 2000/01, 2013/14.", 'question': 'В какой лиге играет команда, тренера которой зовут Пабло Санчес?', 'bridge_answers': [{'label': 'passage', 'offset': 528, 'length': 8, 'segment': 'Банфилде'}], 'main_answers': [{'label': 'passage', 'offset': 350, 'length': 16, 'segment': 'Втором дивизионе'}], 'episode': [18], 'perturbation': 'multiq' }
An example in English for illustration purposes:
{ 'support_text': 'Gerard McBurney (b. June 20, 1954, Cambridge) is a British arranger, musicologist, television and radio presenter, teacher, and writer. He was born in the family of American archaeologist Charles McBurney and secretary Anna Frances Edmonston, who combined English, Scottish and Irish roots. Gerard's brother Simon McBurney is an English actor, writer, and director. He studied at Cambridge and the Moscow State Conservatory with Edison Denisov and Roman Ledenev.', 'main_text': 'Simon Montague McBurney (born August 25, 1957, Cambridge) is an English actor, screenwriter, and director.\\n\\nBiography.\\nFather is an American archaeologist who worked in the UK. Simon graduated from Cambridge with a degree in English Literature. After his father's death (1979) he moved to France, where he studied theater at the Jacques Lecoq Institute. In 1983 he created the theater company "Complicity". Actively works as an actor in film and television, and acts as a playwright and screenwriter.', 'question': 'Where was Gerard McBurney's brother born?', 'bridge_answers': [{'label': 'passage', 'length': 14, 'offset': 300, 'segment': 'Simon McBurney'}], 'main_answers': [{'label': 'passage', 'length': 9, 'offset': 47, 'segment': Cambridge'}], 'episode': [15], 'perturbation': 'multiq' }Data Fields
The dataset consists of a training set with labeled examples and a test set in two configurations:
Each training episode in the dataset corresponds to seven test variations, including the original test data and six adversarial test sets, acquired through the modification of the original test through the following text perturbations:
The following table contains the number of examples in each data split:
Split | Size (Original/Perturbed) |
---|---|
Train.raw | 1056 |
Test.raw | 1000 |
Train.episodes | 64 |
Test.episodes | 1000 / 7000 |
The data for the dataset is sampled from Wikipedia and Wikidata.
Data CollectionThe data for the dataset is sampled from Wikipedia and Wikidata.
The pipeline for dataset creation looks as follows:
First, we extract the triplets from Wikidata and search for their intersections. Two triplets (subject, verb, object) are needed to compose an answerable multi-hop question. For instance, the question "Na kakom kontinente nakhoditsya strana, grazhdaninom kotoroy byl Yokhannes Blok?" (In what continent lies the country of which Johannes Block was a citizen?) is formed by a sequence of five graph units: "Blok, Yokhannes" (Block, Johannes), "grazhdanstvo" (country of citizenship), "Germaniya" (Germany), "chast’ sveta" (continent), and "Yevropa" (Europe).
Second, several hundreds of the question templates are curated by a few authors manually, which are further used to fine-tune ruT5-large to generate multi-hop questions given a five-fold sequence.
Third, the resulting questions undergo paraphrasing and several rounds of manual validation procedures to control the quality and diversity.
Finally, each question is linked to two Wikipedia paragraphs, where all graph units appear in the natural language.
The design of our benchmark allows us to alleviate the problems of a large carbon footprint (Bender et al., 2021) and keep computational costs accessible to academic and industrial fields (Couldry and Mejias, 2020) . In particular, our evaluation approach does not consider LMs' fine-tuning and relies on a limited amount of episodes, while the number of attacks and perturbations can be adjusted based on the user's needs. However, achieving high robustness and task generalization may require additional computational costs based on the few-shot learning and prompting method.
The framework's usage implies working concerning zero-shot and few-shot practices, such as controlling that the test data is excluded from the pre-training corpus. Our train sets Dtrain are publicly available, and it is not anticipated that the users will apply this data for fine-tuning. Lack of control may lead to indicative and biased model evaluation.
Ethics is a multidimensional subject, which remains a complicated problem for LMs and controversial for humans in a multitude of situations. Our approach is closely related to (Hendrycks et al., 2021) , who introduce the ETHICS benchmark for evaluating LMs' ability to predict ethical judgments about diverse text situations. Although our methodology spans general concepts in normative ethics, we acknowledge that it can be challenging to perform objective ethical judgments about some situations (Martineau, 2006t) . For instance, judgments about law are based on formal criteria (e.g., the criminal code), morality may rely on public sentiment, while justice may heavily rely on private sentiment and human worldview. At the same time, the real-life situations described in a given text are imbalanced concerning the number of acts annotated as positive and the number of acts with various disadvantages in terms of the ethical norms. In practice, this leads to the moderate inter-annotator agreement and approximate human and model performance estimates. Furthermore, other data-dependent problems can be indicated, such as genre bias and author's bias in specific publicly available text sources.
Ekaterina Taktasheva , Tatiana Shavrina , Alena Fenogenova , Denis Shevelev , Nadezhda Katricheva , Maria Tikhonova , Albina Akhmetgareeva, Oleg Zinkevich, Anastasiia Bashmakova, Svetlana Iordanskaia, Alena Spiridonova, Valentina Kurenshchikova, Ekaterina Artemova , Vladislav Mikhailov
Apache 2.0
@article{taktasheva2022tape, title={TAPE: Assessing Few-shot Russian Language Understanding}, author={Taktasheva, Ekaterina and Shavrina, Tatiana and Fenogenova, Alena and Shevelev, Denis and Katricheva, Nadezhda and Tikhonova, Maria and Akhmetgareeva, Albina and Zinkevich, Oleg and Bashmakova, Anastasiia and Iordanskaia, Svetlana and others}, journal={arXiv preprint arXiv:2210.12813}, year={2022} }