Details - SEGmentation of oral CORpora

Current situation concerning segmentation of units

Segmentation of units in existing corpora is handled rather differently, often not consistently even within a single corpus and not based on concise theoretical underpinnings. The current segmentation of turns at talk in the corpora on which this project is based, CLAPI, ESLO and FOLK (as well as in other corpora like the spoken data in ORFEO or conversational data in Talkbank) varies depending on several factors:

there are different preferences concerning the length of segments, the relevance of pauses and pause length for segmentation and the practices of (non-)subsegmenting turns at talk;
segmentation depends on the types of data which prevail in the individual corpus, i.e. highly interactive encounters (FOLK, CLAPI), which often involve several participants and the handling of objects, vs. dyadic biographic interviews with long responding turns (ESLO);
Variation arises also from differences in individual styles of transcription which are not sufficiently restrained, given the absence of explicit, reliable guidelines for segmentation in transcripts.

The lack of a principle-based segmentation solidly rooted in knowledge about the production of turns at talk, leads to several problems for the usability and exploitation of existing corpora:

systematic searches concerning the internal organization of turns and the positions of linguistic items and structures with respect to turn-structure are not possible, or, if implemented as in CLAPI, not sufficiently reliable;
prosodic information is annotated to different degrees;
inter-corpus (and therefore also inter-language) comparison with respect to these issues is not possible;
readability and analysability of transcripts is impaired, because transcripts do not sufficiently represent how participants in talk-in-interaction produce units and cesura of talk, which are relevant for interpretation and organization of discourse;
automatic processing of the data with existing NLP tools (taggers, parsers etc.) is made more difficult in the absence of homogeneous principles of segmentation.

SegCor contributions est-ce qu’on détaille les différentes strates d’annotations ou seulement dans la méthodologie?

This problematic situation is the motivation for the SegCor project. Based on a large variety of different kinds of data from talk-in-interaction, we will analyze the theoretical options that can be used as a basis for establishing robust segmentation principles. This implies testing the above mentioned models for segmentation, in order to propose an implementation which is solidly based in theory and which is practically efficient for all kinds of data of spoken interaction in French and German.

The project will approach the question of segmentation from two different angles:
1) A qualitative, multidimensional approach: based on existing literature and in collaboration with a pool of experts, an inventory of segmentation indices, problems and criteria on different linguistic levels will be compiled. These criteria were operationalized as segmentation guidelines, and the guidelines will be tested and improved by applying them to contiguous excerpts from a test corpus.
2) A quantitative, unidimensional approach: selected single criteria with their corresponding segmentation rules in the guidelines are used as the basis for annotation experiments in which possible boundaries (e.g. pauses or lexical cues) will be automatically identified in a test corpus and the resulting isolated excerpts classified by human annotators according to their relevance for segmentation.

Whereas the first approach makes sure that the problem of segmentation is treated in its full complexity on different linguistic levels and that interactional structure is accounted for, the second approach evaluates the usability and applicability of the segmentation guidelines to a larger corpus and eventually enables a (partial) automation of the segmentation process. The two approaches are complementary and converge towards a single solution.

Starting from an inventory of segmentation problems and solutions discussed in the literature, the segmentation guidelines were developped in an iterative process with several operationalization cycles. Each of them begins with an exhaustive corpus analysis on a “training” data set, a procedure which has never before been used in studying segments and boundaries in talk. On this basis, a set of rules for consistent segmentation of these data were formulated.
A qualitative, multidimensional approach aims of identifying convergence (i.e. cases where segmentations on different levels coincide), simple divergence (i.e. cases where disagreement between segmentations on different levels are recurrent and systematic), and complex divergence (i.e. cases where segmentation criteria on different levels collide in a non-trivial manner) of segmentation criteria. Criteria drawn from pragmatics, prosody/phonetics, syntax, and lexis (e.g. discourse markers, interjections). The analysis aims at an emic understanding of the units of talk-in-interaction which respects speakers’ own orientations to the completion, expansion or cut-off of turns under way. This requires a dynamic on-line approach, which allows for different degrees of completion of segments and which takes phenomena of re-negotiation and re-completion of segments into account. To do this, the segmentation procedure has to account systematically for the contribution of different linguistic levels to the constitution of units in talk, also considering cases in which units are expanded by another speaker or by the same speaker after an intervention by another speaker. The applicants are well aware that the dynamic and open-ended character of the construction of turns poses problems for an automatic approach to segmentation. A major task consists in modelling the flexibility of boundaries and their revision over time.
This analysis makes it possible to identify sufficiently robust segmentation rules for the quantitative, unidimensional approach on which the annotation experiments in WP 3 will focus. The project tests which criteria of segmentation lend themselves to an operationalized definition allowing for an automatic procedure. Such criteria are to be distinguished from more interpretive criteria or from criteria resting on non-transcribed parameters of talk which may only be attributed semi-automatically (i.e. based on non-deterministic candidate realizations whose situated relevance has to be checked) or which have to be assigned manually. Multiply annotated data sets and corresponding inter-annotator agreement measures is a method for judging the quality of guidelines.