Query in Text Corpora 2014
The course will cover text encoding, character encoding, regular expressions, search with regular expressions, search in unannotated corpora, simple text search, and search in annotated corpora with a corpus query language (for instance CQP) and search in XML documents using XQuery. The course will be delivered in English, but focuses on German corpora. No prerequisites necessary.
Block A (week 1): The first block will cover basic aspects of work with digital texts and construction of web corpora, such as:
- text encoding; the most popular standards of text encoding will be presented,
- regular expressions; the concept of a regular expression will be introduced; Students will have a chance to discover the utility of regular expressions and to learn how to formulate regular expressions corresponding to their queries.
- annotation; different kinds of annotation (structural mark-up, part-of-speech tagging, morphosyntactic annotation, parsing) and different annotation schemes will be presented and their utility for different kinds of research questions will be discussed.
Block B (week 2): The second block will cover issues concerning the exploration of language corpora. The students will be instructed how to find the desired information based on data in a corpus. This part of the course will present different types of metadata (editorial, analytic, descriptive…) and discuss their importance for various research questions. Moreover, the concept of a Query Language will be introduced and a variety of QLs will be presented. Students will be given an opportunity to retrieve information from corpora using specialized queries formulated in a QL.
Block B is based on Block A, but the two blocks can be attended independently.