LLL - ESLO - SEGmentation of oral CORpora

Laboratoire Ligérien de Linguistique (LLL)

The ESLO corpus (Enquêtes sociolinguistiques à Orléans) is a project of the Laboratoire Ligérien de Linguistique. Developed at the University of Orléans, the corpus is available online, and also consultable on several platforms (CoCoON, Ortolang, Isidore), as well as indexed and stored at the Bibliothèque Nationale de France as a legacy for patrimonial purposes.

The ESLO corpus constitutes one of the largest databases for spoken French gathering spoken data from the geographic area Orléans across two periods.

Britannic scholars made a first collection campaign between 1968 and 1971 originally for didactic purposes (teaching of French as a foreign language at public school in UK). This first set of data (ESLO1) is now closed and contains about 200 interviews linked with sociolinguistic and situational metadata, i. e. about 300h of text-sound aligned spoken data including also several types of recordings (phone conversation, commercial interaction, public meetings, dinner conversation, medical interviews, etc.) and is now available online.

The second collection period (ESLO2) started in 2008 and aims at the constitution of a comparable and representative set of data 40 years after, permitting a microdiachronic approach on spoken French. The corpus ESLO2 is an open set of data containing a wide variety of activity types and recording contexts like semi-directive interviews (with a range of speakers considering sociological diversity, including for instance young adults or personalities of local politics), recordings at school, in the bakery or in shops, street interview asking for directions, conferences, general assembly, official discourse, dinner parties, child-parent interactions by reading a book, etc.

The whole corpus contains 654 recording hours and 478 text-sound aligned data with metadata, amounting to about 7,5 million words, of which 296 hours of text-sound aligned data with metadata are available online (i.e. about 5 million words). An enriched version of the corpus (public part) is now available on the Ortolang online platform, with several formats including TRS and TEI for transcription data as well as CMDI and DC-OLAC for metadata.

Previous work of members of the LLL deals with the work process of corpus constitution, enrichment and annotation in the special case of spoken interaction , according to deontic, legal and scientific good practices. Current research of the team mainly aims at understanding variation in language and structure of speech, and deal in particular with the (semi-)automatic processing of speech, and more generally with the question of the dissemination of spoken corpus data and tools to the research community and other users (for instance for didactic purposes).

The following papers give more information on different aspects of ESLO :

ESHKOL-TARAVELLA I., BAUDE O., MAUREL D., HRIBA L., DUGUA C., TELLIER I., (2012), « Un grand corpus oral « disponible » : le corpus d’Orléans 1968-2012 », Ressources linguistiques libres, TAL. Volume 52 – n° 3/2011, 17-46.
ABOUDA, L. & BAUDE, O. (2007), « Constituer et exploiter un grand corpus oral : choix et enjeux théoriques. Le cas des Eslo », in F. Rastier et M. Ballabriga (dir.), Corpus en Lettres et Sciences sociales. Des documents numériques à l’interprétation, Actes du XXVIIe Colloque d’Albi, 161-168.

Team

Marie SKROVEC, Senior Lecturer, Orléans Research Coordinator
Iris ESHKOL, Professor
Layal KANAAN-CAILLOL, Senior Lecturer
Flora BADIN, Studies Engineer
François DELAFONTAINE, Phd Student
Mariame MAAROUF, Trainee
Natalia KALASHNIKOVA, Trainee