数据集:
KTH/waxholm
The Waxholm corpus was collected in 1993 - 1994 at the department of Speech, Hearing and Music (TMH), KTH. It is described in several publications. Two are included in this archive. Publication of work using the Waxholm corpus should refer to either of these. More information on the Waxholm project can be found on the web page http://www.speech.kth.se/waxholm/waxholm2.html
The .smp files contain the speech signal. The identity of the speaker is coded by the two digits after 'fp20' in the file name. The smp file format was developed by TMH. Recording information is stored in a header as a 1024 byte text string. The speech signal in the Waxholm corpus is quantised into 16 bits, 2 bytes/sample and the byte order is big-endian (most significant byte first). The sampling frequency is 16 kHz. Here is an example of a file header:
>head -9 fp2001.1.01.smp file=samp ; file type is sampled signal msb=first ; byte order sftot=16000 ; sampling frequency in Hz nchans=1 ; number of channels preemph=no ; no signal preemphasis during recording view=-10,10 born=/o/libhex/ad_da.h25 range=-12303,11168 ; amplitude range =
Normally, each sample file has a label file. This has been produced in four steps. The first step was to manually enter the orthographic text by listening. From this text a sequence of phonemes were produced by a rule-based text-to-phoneme module. The endpoint time positions of the phonemes were computed by an automatic alignment program, followed by manual correction. Some of the speech files have no label file, due to different problems in this process. These files should not be used for training or testing.
The labels are stored in .mix files. Below is an example of the beginning of a mix file.
>head -20 fp2001.1.01.smp.mix CORRECTED: OK jesper Jesper Hogberg Thu Jun 22 13:26:26 EET 1995 AUTOLABEL: tony A. de Serpa-Leitao Mon Nov 15 13:44:30 MET 1993 Waxholm dialog. /u/wax/data/scenes/fp2001/fp2001.1.01.smp TEXT: jag vill }ka h{rifr}n . J'A:+ V'IL+ "]:K'A H'[3RIFR]N. CT 1 Labels: J'A: V'IL "]:KkA H'[3RIFR]N . FR 11219 #J >pm #J >w jag 0.701 sec FR 12565 $'A: >pm $'A:+ 0.785 sec FR 13189 #V >pm #V >w vill 0.824 sec FR 13895 $'I >pm $'I 0.868 sec FR 14700 $L >pm $L+ 0.919 sec
The orthographic text representation is after the label 'TEXT:' CT is the frame length in number of sample points. (Always = 1 in Waxholm mix files) Each line starting with 'FR' contains up to three labels at the phonetic, phonemic and word levels. FR is immediately followed by the frame number of the start of the segment. Since CT = 1, FR is the sample index in the file. If a frame duration is = 0, the label has been judged as a non-pronounced segment and deleted by the manual labeller, although it was generated by the text-to-phoneme or the automatic alignment modules. Column 3 in an FR line is the phonetic label. Initial '#' indicates word initial position. '$' indicates other positions. The optional label '>pm' precedes the phonemic label, which has been generated by the text-to-phoneme rules. Often, the phonemic and the phonetic labels are identical. The optional '>w' is followed by the identity of the word beginning at this frame. The phoneme symbol inventory is mainly STA, used by the KTH/TMH RULSYS system. It is specified in the included file 'sampa_latex_se.pdf'.
Some extra labels at the phonetic level have been defined. The most common ones are:
sm | lip or tongue opening |
p: | silent interval |
pa | aspirative sound from breathing |
kl | click sound |
v | short vocalic segment between consonants |
upper case of stops | occlusion |
lower case of stops | burst |
The label 'Labels:' before the FR lines is a text string assembled from the FR labels
The mix files in this archive correspond to those with the name extension .mix.new in the original corpus. Besides a few other corrections, the main difference is that burst segments after retroflex stops were not labelled as retroflex in the original .mix files ( d, t after 2D and 2T have been changed to 2d and 2t).
Bertenstam, J., Blomberg, M., Carlson, R., Elenius, K., Granström, B., Gustafson, J., Hunnicutt, S., Högberg, J., Lindell, R., Neovius, L., Nord, L., de Serpa-Leitao, A., and Ström, N.,(1995). "Spoken dialogue data collected in the WAXHOLM project" STL-QPSR 1/1995, KTH/TMH, Stockholm.
Bertenstam, J., Blomberg, M., Carlson, R., Elenius, K., Granström, B., Gustafson, J., Hunnicutt, S., Högberg, J., Lindell, R., Neovius, L., de Serpa-Leitao, A., Nord, L., & Ström, N. (1995). The Waxholm application data-base. In Pardo, J.M. (Ed.), Proceedings Eurospeech 1995 (pp. 833-836). Madrid.
Comments and error reports are welcome. These should be sent to: Mats Blomberg matsb@speech.kth.se or Kjell Elenius kjell@speech.kth.se