ICAR - CLAPI - SEGmentation of oral CORpora

The ICAR research laboratory is locally and internationally renowned for its theoretical approach to interaction, which takes into account both linguistic (lexicon, syntax, semantics) and multimodal (gesture, gaze, body movement and orientation) resources, as well as for its original methodology (use of sophisticated audio and video recordings, collection of artifacts, documents and computer traces) employed for the collection of naturally occurring language realisations in several distinct face-to-face and technologically mediated socio-cultural situations. It has developed the CLAPI Workbench (http://clapi.icar.cnrs.fr) which articulates a databank of spoken French with a set of search tools.
In the databank are gathered corpora of natural occurring interactions recorded in a large range of different situations (professional, private, medical, commercial, institutional, etc.) fully described by around 70 metadata sets. The project was launched in the 1990s with a patrimonial purpose and has evolved into a workbench dedicated to talk in interaction. It includes older corpora, interesting from a legacy perspective and for a diachronic approach to interactional data, as well as video recorded new corpora, produced within state-of-the-art recording practices and transcribed, aligned, and tagged according to contemporary standards.
The set of tools works on the most relevant metadata, lexicon and interactional phenomena (pauses, overlapped and overlapping segments, place in the turn, size of the turn, etc.). Queries can be done on concordances, co-occurrences, self-repeats, other-repeats, frequencies, and enable users to explore transcripts and underline recurrences in an automatic process or with a multi-criteria search tool. All the results are displayed with both aligned transcript extract, audio or video recording extract by streaming or download and metadata.
In February 2019, CLAPI contains 60 corpora and 360 recordings (audio and video) with a total of approximately 200 hours. 46 hours are freely downloadable and 63 hours are browsable without access right (connected as guest). Metadata and transcripts are available in standardized format (Dublin Core) and TEI to make them reusable by other platforms or the TAL community.
CLAPI benefits from ongoing research project for new data (currently 20 corpora are in the process of integration) as well as new tools (currently quantitative report of lexical variety, part-of-speech annotations from the ORFEO ANR Project).
Research on units segmentation has been carried out on different levels: articulation of gestural and verbal components, non verbal vocal production, markers and particles as well as issues of oral language segmentation. An ongoing part of research deals with the tool-supported interactional research.

The following papers give more information on different aspects of CLAPI :

BALDAUF-QUILLIATRE H. , COLON DE CARVAJAL I., ETIENNE C. , JOUIN-CHARDON E. , TESTON-BONNARD S. , TRAVERSO V. (2016), «CLAPI, une base de données multimodale pour la parole en interaction : apports et dilemmes », Cahiers Corpus n°15, Corpus de français parlés et français parlés des corpus.
BERT M., BRUXELLES S., ETIENNE C., MONDADA L., TRAVERSO V. (2010), «Grands corpus et linguistique outillée pour l’étude du français en interaction (plateforme CLAPI et corpus CIEL)», Pratiques – Interactions et corpus oraux, 17-34.

Team

Véronique TRAVERSO, French Coordinator, Research Director
Heike BALDAUF-QUILLIATRE, Senior Lecturer
Nathalie ROSSI-GENSANE, Professor
Carole ETIENNE, Research Engineer
Biagio URSI, Post-doc
Luisa Fernanda ACOSTA CORDOBA, Phd Student
Margot LAMBERT, Trainee
Lydia Heiden, Master Student
Laurène Smykowski, Master Student