Searching Linguistic Patterns in Text Corpora for Digital Humanities Research
Corpora, i.e. collections of linguistic data (texts or conversations), are a fundamental asset of digital humanities research. A ubiquitous task for linguists is to find linguistic patterns in corpora:
- Which forms of the verb *to be* occur in a given text?
- Is it common that number words are preceded by articles?
- What is the average length of a DP/NP in a modern English text, potentially compared to modern Dutch?
- What types of phrase do occur as direct object in philosophical texts?
- What words can occur as hesitation markers in spoken modern German?
Depending on the type of processing and annotation (e.g., lemmatization), such questions are more or less difficult to answer as finding the corresponding data and counting them can be difficult or very easy. This course will present fundamental techniques for searching in corpora, viz.
- searching for single word forms,
- searching with wild cards or distance operators,
- regular expressions to search for similar word forms,
- searching in hierarchical annotation to find syntactic or semantic configurations.
You will learn about different query languages used for searching in corpora.
The course will be mainly concerned with textual corpora, but as searching on speech or multimodal corpora is generally carried out on the transcription and annotation layers, it will also be useful to researchers dealing with such data.
2022
2021
2020
2019
- Schedule
- Workshops
- XML-TEI document encoding, structuring, rendering and transformation
- Hands on Humanities Data Workshop - Creation, Discovery and Analysis
- Manuscripts in the Digital Age: XML-Based Catalogues and Editions
- Digital Annotation and Analysis of Literary Texts with CATMA 6.0
- Compilation, Annotation and Analysis of Written Text Corpora. Introduction to Methods and Tools
- Searching Linguistic Patterns in Text Corpora for Digital Humanities Research
- All About Data – Exploratory Data Modelling and Practical Database Access
- Stylometrie
- Humanities Data and Mapping Environments
- Images of Image Machines. Theory and Practice of Interpretable Machine Learning for the Digitial Humanities
- An Introduction to Neural Networks for Natural Language Processing - Applications and Implementation
- Lectures (public)
- Projects (public)
- Poster Session (public)
- Panel (public)
- Teasers (public)
- Cultural programme
- Experts
- Lecturers
- Scientific Committee
- Important dates (new)
- Application
- Scholarships (updated)
- Participation fees
- Refund policy
- T-Shirts
- Child care
- Birthday thoughts