数据集:
clarin-pl/2021-punctuation-restoration
Restore punctuation marks from the output of an ASR system.
Speech transcripts generated by Automatic Speech Recognition (ASR) systems typically do not contain any punctuation or capitalization. In longer stretches of automatically recognized speech, the lack of punctuation affects the general clarity of the output text [1]. The primary purpose of punctuation (PR) and capitalization restoration (CR) as a distinct natural language processing (NLP) task is to improve the legibility of ASR-generated text, and possibly other types of texts without punctuation. Aside from their intrinsic value, PR and CR may improve the performance of other NLP aspects such as Named Entity Recognition (NER), part-of-speech (POS) and semantic parsing or spoken dialog segmentation [2, 3]. As useful as it seems, It is hard to systematically evaluate PR on transcripts of conversational language; mainly because punctuation rules can be ambiguous even for originally written texts, and the very nature of naturally-occurring spoken language makes it difficult to identify clear phrase and sentence boundaries [4,5]. Given these requirements and limitations, a PR task based on a redistributable corpus of read speech was suggested. 1200 texts included in this collection (totaling over 240,000 words) were selected from two distinct sources: WikiNews and WikiTalks. Punctuation found in these sources should be approached with some reservation when used for evaluation: these are original texts and may contain some user-induced errors and bias. The texts were read out by over a hundred different speakers. Original texts with punctuation were forced-aligned with recordings and used as the ideal ASR output. The goal of the task is to provide a solution for restoring punctuation in the test set collated for this task. The test set consists of time-aligned ASR transcriptions of read texts from the two sources. Participants are encouraged to use both text-based and speech-derived features to identify punctuation symbols (e.g. multimodal framework [6]). In addition, the train set is accompanied by reference text corpora of WikiNews and WikiTalks data that can be used in training and fine-tuning punctuation models.
The purpose of this task is to restore punctuation in the ASR recognition of texts read out loud.
Input ('tokens*'* column): sequence of tokens
Output ('tags*'* column): sequence of tags
Measurements : F1-score (seqeval)
Example :
Input: ['selekcjoner', 'szosowej', 'kadry', 'elity', 'mężczyzn', 'piotr', 'wadecki', 'ogłosił', '27', 'marca', '2008', 'r', 'szeroki', 'skład', 'zawodników', 'którzy', 'będą', 'rywalizować', 'o', 'miejsce', 'w', 'reprezentacji', 'na', 'tour', 'de', 'pologne', 'lista', 'liczy', '22', 'nazwiska', 'zawodników', 'zarówno', 'z', 'zagranicznych', 'jaki', 'i', 'polskich', 'ekip', 'spośród', '22', 'wybrańców', 'selekcjonera', 'do', 'składu', 'dostanie', 'się', 'tylko', 'ośmiu', 'kolarzy', 'którzy', 'we', 'wrześniu', 'będą', 'rywalizować', 'z', 'najlepszymi', 'grupami', 'kolarskimi', 'na', 'świecie', 'w', 'kręgu', 'zainteresowania', 'wadeckiego', 'znajduje', 'się', 'także', 'pięciu', 'innych', 'zawodników', 'ale', 'oni', 'prawdopodobnie', 'wystartują', 'w', 'polskim', 'tourze', 'w', 'szeregach', 'swoich', 'ekip', 'szeroka', 'kadra', 'na', 'tour', 'de', 'pologne', 'dariusz', 'baranowski', 'łukasz', 'bodnar', 'bartosz', 'huzarski', 'błażej', 'janiaczyk', 'tomasz', 'kiendyś', 'mateusz', 'komar', 'tomasz', 'lisowicz', 'piotr', 'mazur', 'jacek', 'morajko', 'przemysław', 'niemiec', 'marek', 'rutkiewicz', 'krzysztof', 'szczawiński', 'mateusz', 'taciak', 'adam', 'wadecki', 'mariusz', 'witecki', 'piotr', 'zaradny', 'piotr', 'zieliński', 'mateusz', 'mróz', 'marek', 'wesoły', 'jarosław', 'rębiewski', 'robert', 'radosz', 'jarosław', 'dąbrowski']
Input (translated by DeepL): the selector of the men's elite road cycling team piotr wadecki announced on march 27, 2008 a wide line-up of riders who will compete for a place in the national team for the tour de pologne the list includes 22 names of riders both from foreign and Polish teams out of the 22 selected by the selector only eight riders will get into the line-up who in September will compete with the best cycling groups in the world wadecki's circle of interest also includes five other cyclists, but they will probably compete in the Polish tour in the ranks of their teams wide cadre for the tour de pologne dariusz baranowski łukasz bodnar bartosz huzarski błażej janiaczyk tomasz kiendyś mateusz komar tomasz lisowicz piotr mazur jacek morajko przemysław german marek rutkiewicz krzysztof szczawiński mateusz taciak adam wadecki mariusz witecki piotr zaradny piotr zieliński mateusz mróz marek wesoły jarosław rębiewski robert radosz jarosław dąbrowski
Output: ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-.', 'O', 'O', 'B-,', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-.', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-,', 'O', 'O', 'O', 'B-.', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-,', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-.', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-,', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-.', 'O', 'O', 'O', 'O', 'O', 'B-:', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
WikiPunct is a crowdsourced text and audio data set of Polish Wikipedia pages read out loud by Polish lectors. The dataset is divided into two parts:conversational(WikiTalks)and information (WikiNews). Over a hundred people were involved in the production of the audio component. The total length of audio data reaches almost thirty-six hours, including the test set. Steps were taken to balance the male-to-female ratio.
WikiPuncthas over thirty-two thousand texts and 1200 audio files, one thousand in the training set and two hundred in the test set. There is a transcript of automatically recognized speech and force-aligned text for each text. The details behind the data format and evaluation metrics are presented below in the respective sections.
Statistics:
Data splits
Subset | Cardinality (texts) |
---|---|
train | 800 |
dev | 0 |
test | 200 |
Class distribution (without "O")
Class | train | validation | test |
---|---|---|---|
B-. | 0.419 | - | 0.416 |
B-, | 0.406 | - | 0.403 |
B-- | 0.097 | - | 0.099 |
B-: | 0.037 | - | 0.052 |
B-? | 0.032 | - | 0.024 |
B-! | 0.005 | - | 0.004 |
B-; | 0.004 | - | 0.002 |
Punctuation for raw text:
symbol | mean | median | max | sum | included | |
---|---|---|---|---|---|---|
fullstop | . | 12.44 | 7.0 | 1129.0 | 404 378 | yes |
comma | , | 10.97 | 5.0 | 1283.0 | 356 678 | yes |
question_mark | ? | 0.83 | 0.0 | 130.0 | 26 879 | yes |
exclamation_mark | ! | 0.22 | 0.0 | 55.0 | 7 164 | yes |
hyphen | - | 2.64 | 1.0 | 363.0 | 81 190 | yes |
colon | : | 1.49 | 0.0 | 202.0 | 44 995 | yes |
ellipsis | ... | 0.27 | 0.0 | 60.0 | 8 882 | yes |
semicolon | ; | 0.13 | 0.0 | 51.0 | 4 270 | no |
quote | " | 3.64 | 0.0 | 346.0 | 116 874 | no |
words | 169.50 | 89.0 | 17252.0 | 5 452 032 | - |
The dataset is divided into two parts: conversational (WikiTalks) and information (WikiNews).
Part 1. WikiTalks
Data scraped from Polish Wikipedia Talk pages. Talk pages, also known as discussion pages, are administration pages with editorial details and discussions for Wikipedia articles.. Talk pages were scrapped from the web using a list of article titles shared alongside Wikipedia dump archives.
Wikipedia Talk pages serve as conversational data. Here, users communicate with each other by writing comments. Vocabulary and punctuation errors are expected. This data set covers 20% of the spoken data.
Example:
Part 2. WikiNews
Wikinews is a free-content news wiki and a project of the Wikimedia Foundation. The site works through collaborative journalism. The data was scraped directly from wikinews dump archive. The overall text quality is high, but vocabulary and punctuation errors may occur. This data set covers 80% of the spoken data.
Example:
Input is a TSV file with two columns:
The output should have the same number of lines as the input file, in each line the text with punctuation marks should be given.
We use force-aligned transcriptions of the original texts to approximate ASR output. Files in the .clntmstmp format contain forced-alignment of the original text together with the audio file read out by a group of volunteers. The files may contain errors resulting from incorrect reading of the text (skipping fragments, adding words missing from the original text) and alignment errors resulting from the configuration of the alignment tool for text and audio files. The configuration targeted Polish; names from foreign languages may be poorly recognised, with the word duration equal to zero (start and end timestamps are equal). Data is given in the following format:
(timestamp_start,timestamp_end) word
...
</s>
where </s> is a symbol of the end of recognition.
Example:
(990,1200) Rosja
(1230,1500) zaczyna
(1590,1950) powracać
(1980,2040) do
(2070,2400) praktyk
(2430,2490) z
(2520,2760) czasów
(2820,3090) zimnej
(3180,3180) wojny.
(3960,4290) Rosjanie
(4380,4770) wznowili
(4860,5070) bowiem
(5100,5160) na
(5220,5430) stałe
(5520,5670) loty
(5760,6030) swoich
(6120,6600) bombowców
(6630,7230) strategicznych
(7350,7530) poza
(7590,7890) granice
(8010,8010) kraju.
(8880,9300) Prezydent
(9360,9810) Władimir
(9930,10200) Putin
(10650,10650) wyjaśnił,
(10830,10920) iż
(10980,11130) jest
(11160,11190) to
(11220,11520) odpowiedź
(11550,11640) na
(11670,12120) zagrożenie
(12240,12300) ze
(12330,12570) strony
(12660,12870) innych
(13140,13140) państw.
</s>
Baseline results will be provided in final evaluation.
During the task the following punctuation marks will be evaluated:
Punctuation mark | symbol |
---|---|
fullstop | . |
comma | , |
question mark | ? |
exclamation mark | ! |
hyphen | - |
colon | : |
ellipsis | ... |
blank (no punctuation) |
Note that semi-colon ( ; ) is disregarded here.
The output to be evaluated is just the text with punctuation marks added.
Final results are evaluated in terms of precision, recall, and F1 scores for predicting each punctuation mark separately. Submissions are compared with respect to the weighted average of F1 scores for each punctuation mark.
Per-document score: Global score per punctuation mark p :Final scoring metric calculated as weighted average of global scores per
We would like to invite participants to discussion about evaluation metrics, taking into account such factors as:
Data has been published in the following repository: https://github.com/poleval/2021-punctuation-restoration
Training data is provided in train/*.tsv. Additional data can be downloaded from Google Drive. Below is a list of file names along with a description of what they contain.
The competition in September 2021. Now the challenge is in the after-competition stage. You can submit solutions, but they will be marked with a different color.
Creative Commons - Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0)