
Aspects of NLP in language documentation: the case of the DoBeS Kyanga/Shanga project (Niger-Congo, Eastern-Mande)
The DoBeS projects (Volkswagen Foundation) have (had) a focus on endangered languages that are no longer regularly transmitted in the respective speech communities and face to be extinct within the next years or a decade. The purpose of language documentation is not merely linguistic description but the creation of comprehensive records of mostly yet undocumented languages. The annotated corpora auf audio-visual ethnographic materials function as a multimodal resource for the speech communities and their future generations. Nevertheless, these repositories are as well a resource for linguistic research and in particular for data oriented approaches of linguistic typology, which derive features from annotation and not from grammatical description.
It seems to be obvious that the starting constellation of a documentation project– more exactly: a tight time frame in field research, small to medium size teams of trained consultants and researchers, and the task of annotating a large digital corpus– demands effective corpus tools and semi-automatic methods for annotation and linguistic analysis on different levels. However, current approaches in corpus linguistics are largely based on statistical techniques that already require an existing annotation model and, furthermore, large corpora of annotated data for training. This situation creates for the documentationist the sparse data paradox,which says that a large annotated corpus is needed in order to process large corpora. Furthermore, the resulting models produced by current approaches in corpus linguistics are rarely rule based. Their outcome is hardly readable to the human processor and thus little applicable for linguistic analysis and annotation.
We will present the data processing workflow of our DoBeS research project on the undocumented Kyanga language (Eastern, Niger-Volta branch) and discuss this approach against the backdrop of a more general discussion of how to use NLP methods and general typological knowledge about the “grammar codec” for language documentation in order to produce effectively large audio-visual ethnographic corpora. The presentation will focus mainly on the relation of tokenization and word formation and approaches to unorthodox cross-linguistic POS tagging.
2022
2021
2020
2019
- Home
- Schedule
- Workshops
- Lectures (public)
- Projects (public)
- Poster Session (public)
- Panel (public)
- Teasers (public)
- Cultural programme
- Experts
- Lecturers
- Scientific Committee
- Important dates (new)
- Application
- Scholarships (updated)
- Participation fees
- Refund policy
- T-Shirts
- Child care
- Birthday thoughts












