Lexicon and used software - SEGmentation of oral CORpora

Lexicon

LEFFF

LExique des Formes Fléchies du Français (on licence « LGPLLR » (Lesser General Public License For Linguistic Resources) with POS and lemma was used in order to have a segmentation for words et multiwords expressions like « aujourd’hui », « ciné-club » and « par exemple ».

http://www.labri.fr/perso/clement/lefff/

Used software

Transcriber

This software was used before the project both in ESLO and CLAPI to align and transcribe corpora easily, then we imported this format file (trs) in eXmaralda.

http://perso.ens-lyon.fr/matthieu.quignard/Transcriber/

EXMARaLDA

The EXMARaLDA tools (FOLKER, Partitur-Editor) are used for most of the segmentation and annotation tasks in this project. Our results will be based on this software.

https://exmaralda.org/en/

Praat

This software was used for the french pilot corpus to annotate precisely the signal to identify the proeminences and the disfluences, as well as for interactions of more than 3 speakers that were difficult to process within eXmalrada.

http://www.fon.hum.uva.nl/praat/

ELAN

This software is an annotation tool. The results of the CHOUCAS tool (one of the automatic tools created) are viewable on this software.

https://archive.mpi.nl/tla/elan

In order to create our automatic tools (cf Automatic tools), some preexisting software have been used :

Wapiti

Wapiti was used for segmenting and labeling sequences with discriminative models (maxent models, maximum entropy Markov models and linear-chain CRF) for the chunker.

https://wapiti.limsi.fr/

Treetagger

The TreeTagger was used to annotate transcriptions with part-of-speech and lemma information for french corpus and also for the chunker (machine learning).

https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

Dismo

This software was used for the french pilot corpus in order to have word-to-speech alignment, useful for all the other annotation created.

https://sourceforge.net/projects/dismo/

Jtrans

JTrans was used for his automatic word-to-speech alignment for the chunker.

https://github.com/synalp/jtrans

Teicorpo

This software is a helpful conversion tool from Elan, Clan, Transcriber and Praat files to TEI files and back, used for the chunker.

http://ct3.ortolang.fr/tei-corpo/