Jump to Navigation

"Culture & Technology" European Summer University in Digital Humanities
University of Leipzig

Corpus Linguistics for Digital Humanities. Introduction to Methods and Tools

The course can be attended without specific prerequisites.

Corpora, i.e. collections of linguistic data (texts or conversations), are a fundamental asset of digital humanities research. A ubiquitous task for humanists is to explore questions related to language:

  • Which forms of the verb to be occur in a given text?
  • What are common ways to refer to certain entities, such as kings or VIPs?
  • Is it common that number words are preceded by articles? Do elaborate number words (two thousand three hundred and four) or numbers (12449223) occur at all?
  • What types of phrase do occur as direct object in philosophical texts?
  • What kinds of ellipsis occur in section headings?
  • What kinds of speech acts occur in modern conversations?
  • Are texts by women longer (or shorter) than texts by men?

To answer such questions, it is necessary to select and prepare data. We will discuss different approaches to compilation and annotation of corpora. The methodology stems from computational and corpus linguistics, but is used more widely in processing linguistic data in the digital humanities.

Such questions as those sketched above can only be approached if an adequate selection of texts is available; for instance, one will not find much evidence about conversational practices in parliament speeches or mathematical papers. Hence we will first be concerned with criteria and methods for compiling corpora: selecting texts based on extra- and intralinguistic criteria, including property rights.

Furthermore, linguistic data must be described by metadata, so that one can find e.g. utterances by female native speakers of Southern German in the second half of the 20th century about political developments in an informal setting. Approaches to metadata will be explored.

It is also often useful to annotate linguistic data with respect to: pragmatic structure such as speech acts or rhetorical relations; semantic elements such as named entities, e.g. all the kings and queens in Europe or there republican counterparts; linguistic information, e.g. dependencies, parts of speech of words, or lemmatizing (reducing went to go). Moreover, annotating information on text structure or layout may be useful, e.g., text in headings, footers, footnotes, italics or bold face.

Some of these annotations can be carried out (partly) automatically. We will discuss what tools exist and are available.

Depending on the type of processing and annotation, questions such as the ones given above are more or less difficult to answer as finding the corresponding data and counting them can be difficult or very easy. This course will present fundamental techniques for searching in corpora, viz.

  • searching for single word forms,
  • searching with wild cards or distance operators,
  • regular expressions to search for similar word forms,
  • searching in hierarchical annotation to find syntactic or semantic configurations.

You will learn about different query languages used for searching in corpora and, time permitting, also consider simple statistical evaluations of texts.

The course will be mainly concerned with textual corpora, but as searching on speech or multimodal corpora is generally carried out on the transcription and annotation layers, it will also be useful to researchers dealing with such data.

To sum up, we will approach:

  • corpus construction and annotation in the first week and
  • corpus search and evaluation in the second week.
  • Deutsch
  • The Name
  • Background
  • Mission
  • Audience
  • Workshops
  • Lectures
  • Projects
  • Round Tables
  • Working Languages
  • Impressum
  • Kontakt

2022

  • Important dates
  • Application
  • Workshops
  • Experts
  • ConfTool
  • Scholarships etc.
  • Participation fees
  • Moodle
  • Scientific Committee

2021

  • ESU DH C&T 2021
  • Important dates 2021
  • ConfTool
  • Programme
  • Workshops
  • Experts
  • Application
  • Lectures
  • Scholarships
  • Participation fees
  • Moodle
  • Scientific Committee

2020

  • Important dates
  • Schedule
  • Workshops
    • OCR4all – An Open Source Tool Providing a Full OCR Workflow For Creating Digital Corpus From Printed Sources
    • XML-TEI document encoding, structuring, rendering and transformation
    • Hands on Humanities Data Workshop - Creation, Discovery and Analysis
    • Recording, Transcription and Analysis of Spoken Language Data
    • Digital Annotation and Analysis of Literary Texts with CATMA 6
    • Corpus Linguistics for Digital Humanities. Introduction to Methods and Tools
    • Institutional Communication: Corpora, Analysis, Application
    • Neural Networks for Natural Language Processing - An Introduction
    • Stylometry
    • Distant Reading in R. Analyse the text & visualize the Data
    • Image Processing and Machine Learning for the Digital Humanities
    • Humanities Data and Mapping Environments
    • Manuscripts in the Digital Age: XML-Based Catalogues and Editions
    • Digital Archives: Reading and Manipulating Large-Scale Catalogues, Curating and Creating Small-Scale Archives
    • Making an edition of a text in many versions
  • Lectures (public)
  • Panel (public)
  • Experts
  • Lecturers
  • Application
  • Scholarships
  • Participation fees

2019

  • Schedule
  • Workshops
  • Lectures (public)
  • Projects (public)
  • Poster Session (public)
  • Panel (public)
  • Teasers (public)
  • Cultural programme
  • Experts
  • Lecturers
  • Scientific Committee
  • Important dates (new)
  • Application
  • Scholarships (updated)
  • Participation fees
  • Refund policy
  • T-Shirts
  • Child care
  • Birthday thoughts

2018

  • Schedule
  • Workshops
  • Lectures (public)
  • Projects (public)
  • Posters (public)
  • Panel discussion (public)
  • Teasers (public)
  • Cultural Programme
  • Experts
  • Lecturers
  • Scientific Committee
  • Important dates
  • Application
  • Scholarships
  • Fees
  • Refund policy
  • T-Shirt
  • The logo riddle
  • Child Care

2017

  • Schedule
  • Workshops
  • Lectures (public)
  • Projects (public)
  • Panel (public)
  • Teasers / Specials
  • Cultural Programme
  • Experts
  • Lecturers
  • Scientific Committee
  • Important dates
  • Application
  • Scholarships
  • Fees
  • Refund Policy
  • T-Shirt
  • Flyer
  • Child care

2016

  • Schedule
  • Workshops
  • Lectures (public)
  • Projects & Posters (public)
  • Panel
  • Teasers (public)
  • Slams
  • Experts
  • Lecturers
  • Scientific Committee
  • Important dates
  • Application
  • Scholarships
  • Fees
  • Refund policy
  • Flyer
  • Child Care

2015

  • Schedule
  • Workshops
  • Lectures
  • Projects
  • Posters
  • Panel
  • Teaser / Special sessions
  • Workshop Slams
  • Experts
  • Lecturers
  • Scientific Committee
  • Important dates
  • Application
  • Scholarships
  • Fees
  • Refund policy
  • Child Care
  • T-Shirt 2015
  • Flyer and Poster
  • Sponsorship
  • Questions

2014

  • Schedule
  • Workshops
  • Lectures
  • Projects
  • Panel
  • Experts
  • Lecturers
  • Scientific Committee
  • Important dates
  • Application
  • Scholarships
  • Fees
  • Child care
  • Flyer
  • Sponsorship

2013

  • Schedule
  • Workshops
  • Lectures
  • Projects & Posters
  • Panel
  • Experts
  • Lecturers
  • Project Presenters
  • Scientific Committee
  • Important dates
  • Application
  • Bursaries
  • Fees
  • Refund Policy
  • T-Shirt
  • Certificate
  • Sponsorship

2012

  • Home
  • Schedule
  • Workshops
  • Lectures
  • Project Presentations
  • Poster Slam & Session
  • Panel Discussions
  • Excursion
  • Lecturers
  • Certificate
  • Scientific Committee
  • Important Dates
  • Duration & Structure
  • Application
  • Registration Fees
  • Bursaries

2010

  • Schedule
  • Workshops
  • Instructors
  • Lectures
  • Round table
  • Important dates
  • Application
  • Fees
  • Bursaries

2009

  • Schedule
  • Workshops
  • Instructors
  • Lectures
  • Project presentations
  • Round tabel

Leipzig

  • Contact
  • Mailinglist
  • Host
  • Venue
  • Moodle
  • Accommodation (updated)
  • City Map
  • Arrival
  • Events
  • Weather

What the ESU means to me

ESU in the Media

ESU 2019 Experiences (DARIAH-EU)
ESU 2018 Experiences (CLARIN-D)
ESU 2017 (CLARIN-D Blog)
CLARIN-D at ESU 2015 (YouTube)
CLARIN-D ESU 2015 (YouTube)
Mephisto 97.6 10.07.13
Campus Online 10.08.2012
Mephisto 97.6 26.07.2010
infotvleipzig 26.07.2010
In India 03.09.2010

Reviews

INFOtheka: Review of ESU DH 2009
INFOtheka: Review of ESU DH 2012
Infoclio.ch: Review of ESU DH
2013

Publications

Multimodal Analysis of “well”

Users

  • Login

DAAD

 

CLARIN ERIC

 

Sächsische Akademie der Wissenschaften

 

Universität Leipzig

 

BMBF

 

Electronic Textual Cultures Lab at the University of Victoria & Digital Humanities Summer Institute

CLARIN-D

 

DARIAH-EU

 

Slovenian Language Technologies Society (SDJT)

 

Parthenos

International Centre/AAA

 

Computational Humanities

 

Oxygen XML Editor

 

Universitätsbibliothek