Guidelines conception - SEGmentation of oral CORpora

Pour chaque niveau d’annotations 5-6 lignes incluant les principes (état de l’art, projet /outils existants), notre méthode, une information sur le caractère automatique/manuel de la segmentation, s’il s’agit d’une adaptation ou d’une nouvelle proposition, les annotateurs permanents et non-permanents

For German and French

German and French Interaction segmentation

Principles
According to interactional linguistics, interaction and interactional units are emergent. In other words, they are not pre-defined but co-constructed on-line (Auer), each step depending on what happened before and projecting a next step. Segmentation is thus a task which is continuously achieved by the participants all along the unfolding of interaction. Nevertheless interactional linguistics draws on interactional units to describe different structures, principles and practices. But there is no shared conception of these units (suc ash Turn Constructional Unit TCU). Our work therefore consisted in exploring different possibilities to segment interaction and to delimitate possible boundaries.

Method

In a first step we identified possible segmentation units and discussed nature and status of TCUs from a product-oriented and from a process-oriented perspective. This comparison led us to choose a product-oriented perspective for segmentation. In this perspective, one of the main criteria for identifying TCUs was the Transition Relevance Place (TRP). We then contrasted the segmentation according to TRPs with an action-based segmentation.

In a second step, expert annotators segmented different excerpts of the pilot corpus (German and French data) according to TRPs and actions. We then compared and discussed the results for each category and language and proposed pre-guidelines.

In a third step, we decided to test the best criteria to define segmentation boundaries on our pilot corpus with two non-expert annotators (master students). This procedure aimed to identify eventual problems with regard to either the pre-guidelines or the categories we used.

In German and French, it seems that actions are more efficient than TRPs to identify boundaries in our pilot corpus

Annotators

German expert annotators: Arnulf Deppermann and Henrike Helmer
French expert annotators: Heike Baldauf-Quilliatre, Véronique Traverso and Biagio Ursi
French students annotators : Lydia Heiden and Laurène Smykowski

Segmentation : manual, automatic process impossible

Guidelines : being finalised

For German

Pauses / Gaps

Principles

Potentially, every completed word in spoken language is a candidate for a segment boundary, but not all candidates are equally likely to actually be boundaries. Intuitively, words immediately followed by a gap – that is, an interval where the speaker’s speech is interrupted for a short, but noticeable amount of time –, have an increased likelihood of constituting a segment boundary. Gaps can either be pauses (that is, « empty » silences), or they can be « filled » by another speaker’s speech.

The study conducted in Segcor aimed at finding statistical evidence for a correlation between certain properties of such gaps and whether they lie between two syntactic segments or within a syntactic segment.

The following hypotheses were tested:

The length of a gap can indicate whether there is a syntactic boundary or not,
the type of a gap can indicate whether there is a syntactic boundary or not, and
the parts of speech surrounding the gap can indicate whether there is a syntactic boundary or not.

Method

The sample used for the experiment was drawn from altogether 259 interactions in version 2.8 of the FOLK corpus which cover a large variety of interaction types. In a first step, we randomly selected 200 of the 259 interactions and from each extracted, randomly again, 5 pairs of contributions C1 and C2 with the following properties:

C1 and C2 have the same speaker.
The time interval (i.e. the « gap ») between the end of C1 and the start of C2 is at most 2.0 seconds long.
Neither C1 nor C2 contain speech which was marked as incomprehensible by the transcriber.

This resulted in a random sample of 1.000 such pairs of contributions.

Gaps were classified according to two criteria. The first criterion is their duration, i.e. the length of the interval between the end of the first and the start of the second contribution. The second criterion is the gap: we differentiate between gaps which are silences and gaps which are not.

In an online annotation environment, users were asked to classify a random selection of the 1000 pairs according to whether they constitute a segment boundary or not. Evaluation of the experiment showed a clear correlation between « boundariness » and gap length/type.

Annotators / contributors :

Experiment setup: Thomas Schmidt and Swantje Westpfahl
Annotators: various participants in the online experiment

Syntax

Principles

The first segmentation phase started with experiments for comparing different approaches to segmentation. Using the pilot test corpus, annotators were instructed to apply existing annotation schemes based on prosodic criteria (GAT), on pragmatic criteria, on syntactic criteria and on the hybrid approach of « macro-syntaxe ». The experiments clearly showed that corpus-wide segmentation is best approached via syntactic criteria for reasons of robustness, inter-rater reliability and efficiency.

Method

Consequently, the first version of the guidelines (Westpfahl et al. 2018c) was developed as a detailed syntactic annotation scheme based on the theory of topological fields. Categories underlying the segmentation process – fields, clauses and maximal syntactic units — were made explicit in these guidelines in order to make annotators’ decisions maximally transparent and, on that basis, iteratively refine the guidelines by identifying and clarifying difficult cases. The complete pilot corpus was annotated according to these guidelines and evaluated for inter-annotator agreement. Results of this work package were published as Westpfahl/Gorisch (2018b).

With the first version of the guidelines sufficiently stabilized and validated, and the extended test corpus ready, the final version of the guidelines (Westpfahl et al. 2019b) was developed as a simplified version of Westpfahl et al. (2018c). It concentrates on maximal syntactic units and their classification into classes of sentential vs. non-sentential, simple vs. complex and completed vs. aborted units (with a remainder category « uninterpretable »), formulates concrete practical instructions for use with the FOLKER annotation tool, and thus optimizes the efficiency of the segmentation process. The final version of the guideline was applied to the entire extended test corpus and validated again for inter-annotator agreement. The resulting segmented data was analysed quantitatively with respect to dependencies between segment and interaction types (Westpfahl/Gorisch/Schmidt 2018e), and a study on syntactic disruptions (Strub/Westpfahl 2018) carried out.

Annotators / contributors :

Guideline development: Swantje Westpfahl
Annotators (Student Assistants): Isabell Neise, Melanie Hobich, Julia Larbig, Anton Borlinghaus, Hanna Strub, Arthur Bergs

For French

Chunking

Principles
In microsyntax, it was relevant to work on chunk, a unit designating continuous and non-recursive constituents (Abney 1991). The process of segmentation into chunks, or chunking, identifies the superficial syntactic structure of an utterance and can be done automatically from a previously automatic performed morphosyntactic labelling.

Method
In order to develop a segmentation tool, we choose the method of supervised machine learning with CRFs, which was shown by previous research as successful (Sha and Pereira, 2003, Tellier et al, 2012, 2014, Tsuruoka et al, 2009). For this purpose, we needed a manually annotated reference corpus. We chose to work on spoken data of different nature, selecting from the pilot corpus 1) a monologue prepared during a lecture and 2) a spontaneous discussion between three people during a meal.
The two pre-processed corpuses were first annotated by two researchers (Iris Eshkol-Taravella and Marie Skrovec) according to an established typology that was based on previous works (Tellier et al., 2014), with some adjustments for spoken language, as the addition of two categories (articulators, and core forms according to Benzitoun et al. 2012). Based on a systematic listening of the spoken data, the annotation was done in two stages: each researcher first annotated the data separately, before they confronted bith versions in order to produce a compromise version.

Contributors : Flora Badin, François Delafontaine, Iris Eshkol-Taravella, Mariame Maarouf, Marie Skrovec

Segmentation : automatic

Guidelines

Prosody

Principles
Acoustic properties impact several linguistic levels and constitute their own structure of units delimited by accents (Garde 1968) or more recently by the concept of prominence (Lacheret & Victorri 2002). With both macrosyntactic and interactional levels relying on prominences, their detection is of particular importance.

Method
The annotation of prominences is based on previously developed guidelines (Rhapsodie project, Lacheret et al. 2014) on prosody, using the ANALOR tool (Avanzi, Lacheret & Victorri 2008) for the automatic annotation and 3 annotators for a manual one. Results led to suggestions to improve the Rhapsodie guidelines.

Annotators / contributors : François Delafontaine, Biagio Ursi, Luisa Acosta, Mathieu Avanzi

Segmentation : manual and automatic

Micro-syntax

Principles
Micro-syntax is useful when macrosyntactic segmentation is not abble to fix boundaries. First, we study if micro-syntax developped in Orfeo ANR project matches our needs with the interactional data of our pilot corpus.

Method
We need to redefine the microsyntactic guidelines of Orfeo for some common cases in interaction:

Pivots as central units
Dependent elements
Actualizers…
Propositional units, with further segmentation (P, D…)
Disfluencies

Then a new manual segmentation based on syntactic dependency relations is defined and adapted to these interactional disfluencies.

Annotators : Nathalie Rossi-Gensane, Biagio Ursi, Margot Lambert

Segmentation : manual, an automatic process could be studied later on

Guidelines : in translation

Aix’s Macrosyntax

Principles
We study the two different solutions developed in Rhapsodie ANR project and Orfeo ANR project before taking decision .

Method
A new manual segmentation stands for a compromise between the Orfeo categories suitable for automated tool and the Rhapsodie categories too complex for non experts users, it was tested and adapted to the common critical cases.

Annotators : Nathalie Rossi-Gensane, Biagio Ursi, Luisa Fernanda Acosta Cordoba

Segmentation : manual, an automatic process is maybe possible but we need to check its error rate on some current constructions in interaction

Friburg’s Macrosyntax

Principles
The Freiburg macrosyntactic approach starts where microsyntactic relations are exhausted, producing new units that increment the discourse memory in an ostensive-inferential framework (Groupe de Fribourg 2012).

Method
A manual annotation by a single annotator sought to establish simplified guidelines for corpus annotation by non-experts. The resulting experimental guidelines offer continous minimal units delimitated by rection breaks and maximal units delimitated by prosody and turn break.

Annotators / contributors : François Delafontaine, Marie Skrovec, Gilles Corminboeuf, Marie-José Béguelin, Alain Berrendonner

Segmentation : manual

Guidelines