THE WAXHOLM CORPUS

The Waxholm corpus was collected in 1993 - 1994 at the department of Speech, Hearing and Music (TMH), KTH. It is described in several publications. Two are included in this archive. Publication of work using the Waxholm corpus should refer to either of these. More information on the Waxholm project can be found on the web page http://www.speech.kth.se/waxholm/waxholm2.html

FILE INFORMATION

SAMPLED FILES

The .smp files contain the speech signal. The identity of the speaker is coded by the two digits after 'fp20' in the file name. The smp file format was developed by TMH. Recording information is stored in a header as a 1024 byte text string. The speech signal in the Waxholm corpus is quantised into 16 bits, 2 bytes/sample and the byte order is big-endian (most significant byte first). The sampling frequency is 16 kHz. Here is an example of a file header:

>head -9 fp2001.1.01.smp
file=samp               ; file type is sampled signal
msb=first               ; byte order
sftot=16000             ; sampling frequency in Hz
nchans=1                ; number of channels
preemph=no              ; no signal preemphasis during recording
view=-10,10
born=/o/libhex/ad_da.h25
range=-12303,11168      ; amplitude range
=

LABEL FILES

Normally, each sample file has a label file. This has been produced in four steps. The first step was to manually enter the orthographic text by listening. From this text a sequence of phonemes were produced by a rule-based text-to-phoneme module. The endpoint time positions of the phonemes were computed by an automatic alignment program, followed by manual correction. Some of the speech files have no label file, due to different problems in this process. These files should not be used for training or testing.

The labels are stored in .mix files. Below is an example of the beginning of a mix file.

>head -20 fp2001.1.01.smp.mix
CORRECTED: OK jesper    Jesper Hogberg Thu Jun 22 13:26:26 EET 1995
AUTOLABEL: tony       A. de Serpa-Leitao Mon Nov 15 13:44:30 MET 1993
Waxholm dialog. /u/wax/data/scenes/fp2001/fp2001.1.01.smp
TEXT:
jag vill }ka h{rifr}n .
J'A:+  V'IL+ "]:K'A  H'[3RIFR]N.


CT 1
Labels:  J'A: V'IL "]:KkA H'[3RIFR]N .
FR      11219    #J     >pm #J  >w jag   0.701 sec
FR      12565    $'A:   >pm $'A:+        0.785 sec
FR      13189    #V     >pm #V  >w vill  0.824 sec
FR      13895    $'I    >pm $'I  0.868 sec
FR      14700    $L     >pm $L+  0.919 sec

The orthographic text representation is after the label 'TEXT:' CT is the frame length in number of sample points. (Always = 1 in Waxholm mix files) Each line starting with 'FR' contains up to three labels at the phonetic, phonemic and word levels. FR is immediately followed by the frame number of the start of the segment. Since CT = 1, FR is the sample index in the file. If a frame duration is = 0, the label has been judged as a non-pronounced segment and deleted by the manual labeller, although it was generated by the text-to-phoneme or the automatic alignment modules. Column 3 in an FR line is the phonetic label. Initial '#' indicates word initial position. '$' indicates other positions. The optional label '>pm' precedes the phonemic label, which has been generated by the text-to-phoneme rules. Often, the phonemic and the phonetic labels are identical. The optional '>w' is followed by the identity of the word beginning at this frame. The phoneme symbol inventory is mainly STA, used by the KTH/TMH RULSYS system. It is specified in the included file 'sampa_latex_se.pdf'.

Some extra labels at the phonetic level have been defined. The most common ones are:

sm	lip or tongue opening
p:	silent interval
pa	aspirative sound from breathing
kl	click sound
v	short vocalic segment between consonants
upper case of stops	occlusion
lower case of stops	burst

The label 'Labels:' before the FR lines is a text string assembled from the FR labels

The mix files in this archive correspond to those with the name extension .mix.new in the original corpus. Besides a few other corrections, the main difference is that burst segments after retroflex stops were not labelled as retroflex in the original .mix files ( d, t after 2D and 2T have been changed to 2d and 2t).

REFERENCES

Bertenstam, J., Blomberg, M., Carlson, R., Elenius, K., Granström, B., Gustafson, J., Hunnicutt, S., Högberg, J., Lindell, R., Neovius, L., Nord, L., de Serpa-Leitao, A., and Ström, N.,(1995). "Spoken dialogue data collected in the WAXHOLM project" STL-QPSR 1/1995, KTH/TMH, Stockholm.

Bertenstam, J., Blomberg, M., Carlson, R., Elenius, K., Granström, B., Gustafson, J., Hunnicutt, S., Högberg, J., Lindell, R., Neovius, L., de Serpa-Leitao, A., Nord, L., & Ström, N. (1995). The Waxholm application data-base. In Pardo, J.M. (Ed.), Proceedings Eurospeech 1995 (pp. 833-836). Madrid.

Comments and error reports are welcome. These should be sent to: Mats Blomberg matsb@speech.kth.se or Kjell Elenius kjell@speech.kth.se

作者:

KTH

数据集大小:

190.22 MB