Jump to Navigation

"Culture & Technology" European Summer University in Digital Humanities
University of Leipzig

Compilation, Annotation and Analysis of Written Text Corpora. Introduction to Methods and Tools

Corpora, i.e. collections of linguistic data (texts or conversations), are a fundamental asset of digital humanities research. A ubiquitous task for linguists is to explore linguistic questions in corpora:

  • Which forms of the verb *to be* occur in a given text?
  • Is it common that number words are preceded by articles?
  • What types of phrase do occur as direct object in philosophical texts?
  • What kinds of speech acts occur in modern conversations?
  • What words can occur as hesitation markers in spoken modern German?
  • Are texts by women longer (or shorter) than texts by men?

To answer such questions, it is necessary to select and prepare data. We will discuss different approaches to compilation and annotation of corpora.

Linguistic questions can only be approached if an adequate selection of texts is available; for instance, one will not find much evidence about conversational data in parliament speeches or mathematical papers. Hence, we will first be concerned with criteria and methods for compiling corpora: selecting texts based on extra- and intralinguistic criteria, including property rights.

Linguistic data must be described by metadata, so that one can find e.g. utterances by female native speakers of Southern German in the second half of the 20th century about political developments in an informal setting. Approaches to metadata will be explored.

It is often useful to annotate linguistic data with respect to:

  • pragmatic structure such as speech acts or rhetorical relations;
  • semantic elements, e.g. named entities or points in time;
  • syntactic structures, e.g. dependencies or phrases;
  • or morphological relations, e.g.

by assigning parts of speech (*verb*, *noun*) or lemmatizing (reducing *lemmatizing* to *lemmatize*).

Some of these annotations can be carried out (partly) automatically. We will discuss what tools exist and are available.

The course will be mainly concerned with textual corpora, but as searching on speech or multimodal corpora is generally carried out on the transcription and annotation layers, it will also be useful to researchers dealing with such data.

  • Deutsch
  • The Name
  • Background
  • Mission
  • Audience
  • Workshops
  • Lectures
  • Projects
  • Round Tables
  • Working Languages
  • Impressum
  • Kontakt

2022

  • Important dates
  • Application
  • Workshops
  • Experts
  • ConfTool
  • Scholarships etc.
  • Participation fees
  • Moodle
  • Scientific Committee

2021

  • ESU DH C&T 2021
  • Important dates 2021
  • ConfTool
  • Programme
  • Workshops
  • Experts
  • Application
  • Lectures
  • Scholarships
  • Participation fees
  • Moodle
  • Scientific Committee

2020

  • Important dates
  • Schedule
  • Application
  • Workshops
  • Lectures (public)
  • Panel (public)
  • Scholarships
  • Participation fees
  • Experts
  • Lecturers

2019

  • Schedule
  • Birthday thoughts
  • T-Shirts
  • Workshops
    • XML-TEI document encoding, structuring, rendering and transformation
    • Hands on Humanities Data Workshop - Creation, Discovery and Analysis
    • Manuscripts in the Digital Age: XML-Based Catalogues and Editions
    • Digital Annotation and Analysis of Literary Texts with CATMA 6.0
    • Compilation, Annotation and Analysis of Written Text Corpora. Introduction to Methods and Tools
    • Searching Linguistic Patterns in Text Corpora for Digital Humanities Research
    • All About Data – Exploratory Data Modelling and Practical Database Access
    • Stylometrie
    • Humanities Data and Mapping Environments
    • Images of Image Machines. Theory and Practice of Interpretable Machine Learning for the Digitial Humanities
    • An Introduction to Neural Networks for Natural Language Processing - Applications and Implementation
  • Teasers (public)
  • Projects (public)
  • Poster Session (public)
  • Lectures (public)
  • Panel (public)
  • Cultural programme
  • Scientific Committee
  • Experts
  • Lecturers
  • Important dates (new)
  • Application
  • Participation fees
  • Refund policy
  • Scholarships (updated)
  • Child care

2018

  • Important dates
  • Schedule
  • The logo riddle
  • T-Shirt
  • Workshops
  • Teasers (public)
  • Projects (public)
  • Posters (public)
  • Lectures (public)
  • Panel discussion (public)
  • Cultural Programme
  • Experts
  • Lecturers
  • Application
  • Scholarships
  • Fees
  • Scientific Committee
  • Child Care
  • Refund policy

2017

  • Important dates
  • Schedule
  • Workshops
  • Teasers / Specials
  • Lectures (public)
  • Projects (public)
  • Panel (public)
  • Cultural Programme
  • Experts
  • Lecturers
  • ConfTool
  • Fees
  • Refund Policy
  • T-Shirt
  • Child care
  • Flyer
  • Scientific Committee
  • Scholarships
  • Application

2016

  • Important dates
  • Schedule
  • Workshops
  • Teasers (public)
  • Lectures (public)
  • Projects & Posters (public)
  • Panel
  • Slams
  • Experts
  • Lecturers
  • T-Shirt 2016
  • Scientific Committee
  • Application
  • ConfTool
  • Scholarships
  • Fees
  • Refund policy
  • Flyer
  • Child Care

2015

  • Important dates
  • Schedule
  • T-Shirt 2015
  • Workshops
  • Teaser / Special sessions
  • Workshop Slams
  • Lectures
  • Projects
  • Posters
  • Panel
  • Experts
  • Lecturers
  • Child Care
  • Scholarships
  • Fees
  • Application
  • Sponsorship
  • Refund policy
  • Scientific Committee
  • Questions
  • Flyer and Poster

2014

  • Important dates
  • Schedule
  • Child care
  • Workshops
  • Lectures
  • Projects
  • Panel
  • Experts
  • Lecturers
  • Application
  • Fees
  • Questions
  • Scholarships
  • Scientific Committee
  • Flyer

2013

  • Important dates
  • Schedule
  • T-Shirt
  • Workshops
  • Lectures
  • Projects & Posters
  • Panel
  • Experts
  • Lecturers
  • Project Presenters
  • Certificate
  • Sponsorship
  • Bursaries
  • Application
  • Fees
  • Refund Policy
  • Scientific Committee

2012

  • Home
  • Schedule
  • Workshops
  • Lectures
  • Project Presentations
  • Poster Slam & Session
  • Panel Discussions
  • Excursion
  • Lecturers
  • Certificate
  • Scientific Committee
  • Duration & Structure
  • Important Dates
  • Application
  • Registration Fees
  • Bursaries

2010

  • Wichtige Termine
  • Programm
  • Workshops
  • Lehrende
  • Vorlesungen
  • Podiumsdiskussion
  • Bewerbung
  • Teilnahmegebühren
  • Stipendien

2009

  • Programm
  • Workshops
  • Lecturers
  • Projektpräsentationen
  • Lectures
  • Podiumsdiskussion

Leipzig

  • Contact
  • Mailinglist
  • Host
  • Venue
  • Moodle
  • Accommodation (updated)
  • City Map
  • Arrival
  • Events
  • Weather

What the ESU means to me

ESU in the Media

ESU 2019 Experiences (DARIAH-EU)
ESU 2018 Experiences (CLARIN-D)
ESU 2017 (CLARIN-D Blog)
CLARIN-D at ESU 2015 (YouTube)
CLARIN-D ESU 2015 (YouTube)
Mephisto 97.6 10.07.13
Campus Online 10.08.2012
Mephisto 97.6 26.07.2010
infotvleipzig 26.07.2010
In India 03.09.2010

Reviews

INFOtheka: Review of ESU DH 2009
INFOtheka: Review of ESU DH 2012
Infoclio.ch: Review of ESU DH
2013

Publications

Multimodal Analysis of “well”

Users

  • Login

DAAD

 

CLARIN ERIC

 

Sächsische Akademie der Wissenschaften

 

Universität Leipzig

 

BMBF

 

Electronic Textual Cultures Lab at the University of Victoria & Digital Humanities Summer Institute

CLARIN-D

 

DARIAH-EU

 

Slovenian Language Technologies Society (SDJT)

 

Parthenos

International Centre/AAA

 

Computational Humanities

 

Oxygen XML Editor

 

Universitätsbibliothek