Pilot corpus - SEGmentation of oral CORpora

Data came from the databases conceived by ICAR, IDS and LLL research teams :

Corpus selection

In order to study spoken language in different interactional contexts, we decided to compile excerpts of ten types of settings in each language, more or less spontaneous with a various number of speakers. We found comparable interactional situations in the German FOLK corpus and in the CLAPI & ESLO2 corpora in French:

Table talk
Preparing a meal together
Meeting in a social institution
Phone call
Service encounter
Panel discussion/debate
School lesson
Media interview
Sociolinguistic interview
Expert/academic talk
Reading to a child Access to data

The pilot corpus is sampled from 10 minutes of recordings of each interaction type, totalling in two times about 1 h 40 min of transcribed interactions in German and French. This permits us to build robust guidelines taking into account different types of turns and to identify a set of critical cases.

The collective work on data from different databases requires a process of homogenisation of the data including their conversion to a common format (TEI in its ISO/DIN specification), a consistent (re-)encoding of pauses and other segmentation cues, a separating out of existing (and potentially inconsistent) segmentations, a mapping of corpus-specific part-of-speech tags to a common superset, a word-to-speech alignment, and also a syllable segmentation for the prosodic annotation.

The pilot corpus was used for the iterative process of manual data segmentation for the development of the guidelines of all levels, for machine learning development and automation tests on chunking level, and to highlight common critical cases in a contrastive perspective.

In addition, an extended test corpus was defined for German to improve results on a larger set of data for some annotation levels, excluding the interactional level where a manual segmentation is impossible to do on a large dataset. The extended test corpus consists of a larger number of transcript excerpts and increases the variation with respect to interaction and speaker types. In order to validate results drawn from the pilot corpus’ annotations with respect to the interaction type, the extension of the pilot corpus contains – where it was possible – different transcript excerpts of the same interaction types and was extended by even more interaction types. For German, the extended test corpus was used for iterative validation and final development of the guidelines. In its annotated version (the GOLD standard), it served as training data for the development of automatic segmentation methods.

Both pilot and extended test corpus are disseminated in their annotated versions via the download pages of the Database for Spoken German.

Corpora Extracts :

From CLAPI :

From ESLO2 :

From FOLK: