Compilation, Annotation and Analysis of Written Text Corpora. Introduction to Methods and Tools
Corpora, i.e. collections of linguistic data (texts or conversations), are a fundamental asset of digital humanities research. A ubiquitous task for linguists is to explore linguistic questions in corpora:
- Which forms of the verb *to be* occur in a given text?
- Is it common that number words are preceded by articles?
- What types of phrase do occur as direct object in philosophical texts?
- What kinds of speech acts occur in modern conversations?
- What words can occur as hesitation markers in spoken modern German?
- Are texts by women longer (or shorter) than texts by men?
To answer such questions, it is necessary to select and prepare data. We will discuss different approaches to compilation and annotation of corpora.
Linguistic questions can only be approached if an adequate selection of texts is available; for instance, one will not find much evidence about conversational data in parliament speeches or mathematical papers. Hence, we will first be concerned with criteria and methods for compiling corpora: selecting texts based on extra- and intralinguistic criteria, including property rights.
Linguistic data must be described by metadata, so that one can find e.g. utterances by female native speakers of Southern German in the second half of the 20th century about political developments in an informal setting. Approaches to metadata will be explored.
It is often useful to annotate linguistic data with respect to:
- pragmatic structure such as speech acts or rhetorical relations;
- semantic elements, e.g. named entities or points in time;
- syntactic structures, e.g. dependencies or phrases;
- or morphological relations, e.g.
by assigning parts of speech (*verb*, *noun*) or lemmatizing (reducing *lemmatizing* to *lemmatize*).
Some of these annotations can be carried out (partly) automatically. We will discuss what tools exist and are available.
The course will be mainly concerned with textual corpora, but as searching on speech or multimodal corpora is generally carried out on the transcription and annotation layers, it will also be useful to researchers dealing with such data.
2022
2021
2020
2019
- Schedule
- Workshops
- XML-TEI document encoding, structuring, rendering and transformation
- Hands on Humanities Data Workshop - Creation, Discovery and Analysis
- Manuscripts in the Digital Age: XML-Based Catalogues and Editions
- Digital Annotation and Analysis of Literary Texts with CATMA 6.0
- Compilation, Annotation and Analysis of Written Text Corpora. Introduction to Methods and Tools
- Searching Linguistic Patterns in Text Corpora for Digital Humanities Research
- All About Data – Exploratory Data Modelling and Practical Database Access
- Stylometrie
- Humanities Data and Mapping Environments
- Images of Image Machines. Theory and Practice of Interpretable Machine Learning for the Digitial Humanities
- An Introduction to Neural Networks for Natural Language Processing - Applications and Implementation
- Lectures (public)
- Projects (public)
- Poster Session (public)
- Panel (public)
- Teasers (public)
- Cultural programme
- Experts
- Lecturers
- Scientific Committee
- Important dates (new)
- Application
- Scholarships (updated)
- Participation fees
- Refund policy
- T-Shirts
- Child care
- Birthday thoughts