Text Mining with Canonical Text Services
A Canonical Text Service provides text passages that are specified by URN like references. It is specified in a way that allows to create CTS URNs for any possible text passage in a document.
The data can be requested using GET requests that are provided in an URL. Each request must contain one parameter request which specifies the CTS function to use. Function specific parameters - like the URN - are added as additional GET parameters.
For example, the following CTS request returns the text content of chapter 3 of the book Genesis of the English King James Bible: http://cts.informatik.uni-leipzig.de/pbc/cts/?request=GetPassage&urn=urn:cts:pbc:bible.parallel.eng.kingjames:1.3
Further information about CTS can be researched here.
The workshop aims to introduce the CTS protocol to new users and provides the tools to set up individual instances of CTS based on prepared data sets. At the end of the first two days, each participant can expect to have a running instance of CTS available online.
Once the CTS instances are up and running, participants will learn, how text data can be shared with other researchers and cloned between different instances of the system. The various tools and methods will be introduced, including two text alignment tools, a comprehensive CTS text mining framework and a workflow for citation analysis.
Programming skills are not required. Graphical managment tools for the work with CTS instances are available. The work with the text mining framework and the citation analysis requires a basic understanding of command line terminals (UNIX). Participants will work on pre prepared virtual machines. It is expected that participants are familiar with TEI/XML markup for digital documents. Teaching TEI/XML is not part of the workshop.
Participants may bring their own data sets into the workshop. For compliance, these documents should be encoded as UTF-8 and use a generic "TEI/XML div-type notation" similiar to this example. Other TEI/XML formats will propably also work. Non TEI/XML is currently not supported. Every participant must make sure that online publication of the texts does not violate license agreements.
Participants will get access to the programs and the freely available data sets that are part of Leipzigs CTS infrastructure, including documents from the Parallel Bible Corpus, the Deutsche Textarchiv, the TED Talk Transcripts and many more and are invited to use them after the workshop.
2022
2021
2020
2019
2018
- Schedule
- Workshops
- XML-TEI document encoding, structuring, rendering and transformation
- Hands on Humanities Data Workshop - Creation, Discovery and Analysis
- Collocations from a multilingual perspective: theory, tools, and applications
- Reflected Text Analysis in the Digital Humanities
- Humanities Data and Mapping Environments
- Building and analysing multimodal corpora
- Stylometry
- Asking questions to data in the humanities: right, correct, efficient (Introducing and comparing XQuery, SQL, SPARQL for data from the humanities)
- Computer Vision Intervention. How digital methods help to visually understand corpora of art and cultural heritage
- Integrating Human Science Data using CIDOC-CRM as Formal Ontology: a practical approach
- The humanities scholar's perspective on rule based machine translation
- Word Vectors and Corpus Text Mining with Python
- Text Mining with Canonical Text Services
- How Research Infrastructures empower eHumanities and eHeritage Research(ers)
- Introduction to Project Management
- Lectures (public)
- Projects (public)
- Posters (public)
- Panel discussion (public)
- Teasers (public)
- Cultural Programme
- Experts
- Lecturers
- Scientific Committee
- Important dates
- Application
- Scholarships
- Fees
- Refund policy
- T-Shirt
- The logo riddle
- Child Care