Main issues - SEGmentation of oral CORpora

In this project, we have given particular importance to reusing existing models to propose new documented and illustrated guidelines. We devoted part of the budget to the annotation of the data by researchers and students of different levels of expertise to test their robustness and resolve points of divergence.

The main challenge we had to face was to find a compromise between the theoretical choices governing syntax (micro and macro syntax) and the reproducibility of the models on interactive data. The contribution of this work lies thus in adaptations of pre-existing models aiming at optimizing these resources for interactive spoken data.

The creation of a model for the interactional units required numerous work sessions and a lot of manual annotation work by a team of expert annotators, in order to decide on the relevant units and their granularity. This task was based on previous work in Conversation Analysis on Turn Constructional Units (Sacks, Schegloff & Jefferson 1974, Selting 2000 and Ford, Fox & Thompson 2013).

Automation of data processing was implemented when possible. We proposed automation for segmentation into chunks and units based on the intonative period. In order to reuse the existing data, we tested several automatic tools to pre-segment the data without losing the text-sound alignment (PoS annotators: Treetagger, SEM; token parsers : Easyalign, Jtrans, DisMo). As far as prosody is concerned, the annotation of prominences was done automatically with ANALOR and manually by three annotators following the Rhapsody coding protocol. The comparison shows a significant discrepancy between the tool’s results and the manual annotations, with the tool detecting more prominences. Manual annotation was therefore retained and used, in particular for the exploratory development of a segmented speech unit segmentation based on prosodic data.

Regarding the contrastive point of view, the fruitful collaboration between both French and German teams made it possible to work on both languages : on the one hand by elaborating a list of comparable critical cases in macrosyntax in both languages, and on the other hand by proposing an innovative exploratory model for segmentation into interactional units. Interestingly, whereas two different models had to be adopted in macrosyntax for French and German, we noted a convergence in interactional segmentation in both languages.

The SegCor project makes following deliverables available to the scientific community :

a pilot corpus of 10 different types of interactions for each language, enriched with a multi-level annotation reusable for future linguistic analyses, an exploitation in didactics of French and German as a Foreign Language or as a reference corpus for Automatic Language Processing;
several guidelines adapted to speech-in-interaction, documented and enriched with numerous examples, attested in different contexts ;
a stand-alone automatic segmenter for French (PoS and chunks) and its tutorial, which can be used on any other corpus of language data transcriptions;
a automatic segmenter in pauses developed on german data;
an innovative exploratory model of segmentation into interactional units and a joint reference article on French and German;
an identification of common critical cases of oral interaction in French and German with proposals for segmentation solutions in macro-syntax;
numerous papers in national and international conferences on linguistics and several articles in different journals;
a special issue on the Segmentation of Oral Corpuses currently under study will give an account of this multidisciplinary approach.