Goal - SEGmentation of oral CORpora

A great variety of segmentation principles for oral language have been proposed since the beginning of research on talk-in-interaction. However, we still lack a segmentation system that is both theoretically well-founded and practically operationalizable for large and diverse corpora of spoken interaction, and this impairs the use of such corpora for linguistic analysis, for language teaching, for contrastive studies and for the development of language technology.

The project has therefore set itself the aim to develop a method of segmentation that is adequate for the analysis of data from talk-in-interaction at different levels and for various communities of researchers. It evaluates and further develops approaches to segmentation put forward in the literature on conversation analysis, interactional linguistics, pragmatics and corpus linguistics by applying them to samples from three large collections of French and German audio and video recordings of various interaction types (the databases CLAPI, ESLO and FOLK, respectively). The project will result in a systematic segmentation guideline applicable across different interaction types and to French as well as German data.

The project is the first approach to segmentation that is both based on comprehensive data treatment of a sufficiently large and diverse empirical basis and takes into account the cross-linguistic dimension. The results will improve the usability of the three databases, contribute to best practices for the work with oral corpora on a more general level, and enhance our understanding of structures of talk-in-interaction. The project will thus address current needs in conversation analysis, corpus-based language teaching, contrastive analysis of spoken German and French and in the development of language technology for interaction data.

Methodologically, the project is based on two different perspectives: 1) a qualitative, multidimensional approach which takes into account segmentation indices, problems and criteria and leads to tested and improved segmentation guidelines and 2) a quantitative, unidimensional approach based on selected criteria where possible boundaries are automatically identified and classified by human annotators according to their relevance for segmentation. Both approaches initially use a pilot test corpus of 10 excerpts of around 10 minutes each for each language which represents the overall data diversity in terms of situation types. Over the course of the project, the corpus will be extended to 5 hours for each language and takes into account findings from the initial phase. From the beginning of the project, contrastive aspects will be considered particularly.