Tasks automatisation - SEGmentation of oral CORpora

Methodological approach on automatic tasks

Segcor project plans to propose different segmentations according to the different domains of the linguistics : prosody, syntax, interaction. We have to decide which segmentations could be processed automatically, then a first result of SegCor project is our common reflexion on what makes sense in segmentation automation of interaction corpora. Indeed delivering an automatic but approximate segmentation in too broad categories is not scientifically satisfactory and will not be efficient or a complexful task of automatisation with bad results (one error per 5 annotations for example).

According to the expertise of our research teams, we decided to reserve automatic annotations to syntax for German and chunks and periods annotation in Fribourg macrosyntax for French.

For German, we provided :

An automatic tool for segmentation into syntactic segments which takes as an input FOLKER transcripts (FLN-XML format) and automatically calculates boundaries of syntactic segments, merging and splitting the respective contributions as needed, and adjusting the alignment through an appropriate interpolation. (cf Deliverables)

Developers : Ines Rehbein & Josef Ruppenhofer

For French, we provided :

a stand-alone turn-key chunks segmentation tool with its french tutorial available (named CHOUCAS) : this tool aims to segment automatically into chunks and POS french language corpora (with or without sound signal).
Chunks are defined as continuous, non-recursive constituents [Abney, 1991] that identify the superficial syntactic structure of a statement [Eshkol-Taravella et al. 2020] [Rossi-Gensane et al. 2020]. Automatic chunking is based on morphosyntactic labeling.
Developer : Flora Badin – Contributors : François Delafontaine, Iris Eshkol-Taravella, Mariame Maarouf, Marie Skrovec

an exploratory study to design a tool for automatic segmentation on macro-syntactic periods using CRF models [Kalashnikova et al. 2020]. Differents CRF models are trained using morpho-syntactic and prosodic features. The performance of CRF models exceeds the performance of a semi-automatic tool, Analor (which detects prosodic periods), for interactional data implying several speakers in interactive settings.

Contributors : François Delafontaine, Iris Eshkol-Taravella, Loïc Grobol, Natalia Kalashnikova