Jump to Navigation

"Culture & Technology" European Summer University in Digital Humanities
University of Leipzig

OCR4all – An Open Source Tool Providing a Full OCR Workflow For Creating Digital Corpus From Printed Sources

Why OCR?

A growing number of scholars in humanities increasingly need digital versions of originally printed or written text corpora, which must match the original at ≥ 99.95 %, or even 100% when it comes to historical-critical digital editions. Until very recently, the standard procedure used to reach this outcome was ‘double-keying’, wherein two people independently transcribe the same text manually and both versions are subsequently merged. However, a new generation of neural networks now makes the optical character recognition (OCR) of early modern prints (such as incunabula) as well as Arabic texts and South Indian scripts possible. With an accuracy of ≥ 99.97 %, the font-face-specific models deliver results that can also be used for historical-critical digital editions and only require a reasonable amount of post-correction. Therefore, it represents a cheaper alternative to the ‘double-keying’ used so far.

OCR4all

The Workshop introduces students of all humanities to OCR4all, an open source tool for OCR developed at the University of Würzburg. Individuals and small research groups with little technical experience can perform text recognition with very good to excellent recognition rates—related to a specific print—and independently create digital full texts. The workflow implemented in OCR4all is easily understandable and independently applicable. It specifically addresses users with little to no IT background—and so does the workshop—and combines different tools within a uniform user interface.

Results

The primary material in high-resolution TIFF format (200+ dpi according to DFG Practical Guidelines on Digitisation [12/16]) will remain unchanged. The final results will consist of

  1. correct full text,
  2. platform- and software-independent PAGE-XML files containing descriptions of text and line regions’s positions on all images, and
  3. OCR models—related to specific print and font-face—that can be used for text recognition on other printed texts.

After a week participants are able, independently, to create digital text versions of digital images and to evaluate OCR training models regarding their recognition rate. For this, both basic work steps in OCR as such and their implementation with OCR4all are presented in alternance. All contents comply with requirements of DFG Practical Guidelines on Digitisation. Conformity to the latter makes it easier for participants when later applying for the funding of OCR projects in Germany.

Syllabus

Participants carry out the following work steps, independently and under lecturer guidance:

  1. image pre-processing,
  2. segmenation of regions
  3. automatic segmenation of lines,
  4. manual creation of ground truth, and
  5. text recognition, and
  6. text output.

Best preparation—highest possible benefit

This one-week workshop will explicitly support you in the practical implementation of your own digitization project. This works best if:

  1. you have already your own images at hand, which are best uncompressed, truecolour (ie. rgb 8 Bit) TIFF with resolution of 200+ dpi, and maximum size of 50 MB per image,
  2. text to be digitized comprises of at least 200 lines á 20 characters, and
  3. when working with a text set neither in Roman nor in Gothic typeface, if you have already manually transcribed 50 lines of 20 characters each into a text file.

Alternatively, digital images and Ground Truth are made available.

Requirements

No previous knowledge is required. You will need your own laptop with at least 8 GB RAM, and 20 GB free hard disk or SSD memory, a quad-core-processor, as well as a current browser. OCR4all runs on Windows, macOS and UNIX operating systems. Participants will be informed about details regarding installation of software before event starts.

This one-week workshop will be repeated in second week with identical contents.

  • Deutsch
  • The Name
  • Background
  • Mission
  • Audience
  • Workshops
  • Lectures
  • Projects
  • Round Tables
  • Working Languages
  • Impressum
  • Kontakt

2022

  • Home
  • Important dates
  • Application
  • Workshops
  • Experts
  • ConfTool
  • Scholarships etc.
  • Participation fees
  • Moodle
  • Scientific Committee

2021

  • Home
  • ESU DH C&T 2021
  • Important dates 2021
  • ConfTool
  • Programme
  • Workshops
  • Experts
  • Application
  • Lectures
  • Scholarships
  • Participation fees
  • Scientific Committee

2020

  • Home
  • Important dates
  • Schedule
  • Workshops
    • OCR4all – An Open Source Tool Providing a Full OCR Workflow For Creating Digital Corpus From Printed Sources
    • XML-TEI document encoding, structuring, rendering and transformation
    • Hands on Humanities Data Workshop - Creation, Discovery and Analysis
    • Recording, Transcription and Analysis of Spoken Language Data
    • Digital Annotation and Analysis of Literary Texts with CATMA 6
    • Corpus Linguistics for Digital Humanities. Introduction to Methods and Tools
    • Institutional Communication: Corpora, Analysis, Application
    • Neural Networks for Natural Language Processing - An Introduction
    • Stylometry
    • Distant Reading in R. Analyse the text & visualize the Data
    • Image Processing and Machine Learning for the Digital Humanities
    • Humanities Data and Mapping Environments
    • Manuscripts in the Digital Age: XML-Based Catalogues and Editions
    • Digital Archives: Reading and Manipulating Large-Scale Catalogues, Curating and Creating Small-Scale Archives
    • Making an edition of a text in many versions
  • Lectures (public)
  • Panel (public)
  • Experts
  • Lecturers
  • Application
  • Scholarships
  • Participation fees

2019

  • Home
  • Schedule
  • Workshops
  • Lectures (public)
  • Projects (public)
  • Poster Session (public)
  • Panel (public)
  • Teasers (public)
  • Cultural programme
  • Experts
  • Lecturers
  • Scientific Committee
  • Important dates (new)
  • Application
  • Scholarships (updated)
  • Participation fees
  • Refund policy
  • T-Shirts
  • Child care
  • Birthday thoughts

2018

  • Home
  • Schedule
  • Workshops
  • Lectures (public)
  • Projects (public)
  • Posters (public)
  • Panel discussion (public)
  • Teasers (public)
  • Cultural Programme
  • Experts
  • Lecturers
  • Scientific Committee
  • Important dates
  • Application
  • Scholarships
  • Fees
  • Refund policy
  • T-Shirt
  • The logo riddle
  • Child Care

2017

  • Home
  • Schedule
  • Workshops
  • Lectures (public)
  • Projects (public)
  • Panel (public)
  • Teasers / Specials
  • Cultural Programme
  • Experts
  • Lecturers
  • Scientific Committee
  • Important dates
  • Application
  • Scholarships
  • Fees
  • Refund Policy
  • T-Shirt
  • Flyer
  • Child care

2016

  • Home
  • Schedule
  • Workshops
  • Lectures (public)
  • Projects & Posters (public)
  • Panel
  • Teasers (public)
  • Slams
  • Experts
  • Lecturers
  • Scientific Committee
  • Important dates
  • Application
  • Scholarships
  • Fees
  • Refund policy
  • Flyer
  • Child Care

2015

  • Home
  • Schedule
  • Workshops
  • Lectures
  • Projects
  • Posters
  • Panel
  • Teaser / Special sessions
  • Workshop Slams
  • Experts
  • Lecturers
  • Scientific Committee
  • Important dates
  • Application
  • Scholarships
  • Fees
  • Refund policy
  • Child Care
  • T-Shirt 2015
  • Flyer and Poster
  • Sponsorship
  • Questions

2014

  • Home
  • Schedule
  • Workshops
  • Lectures
  • Projects
  • Panel
  • Experts
  • Lecturers
  • Scientific Committee
  • Important dates
  • Application
  • Scholarships
  • Fees
  • Child care
  • Flyer
  • Sponsorship

2013

  • Home
  • Schedule
  • Workshops
  • Lectures
  • Projects & Posters
  • Panel
  • Experts
  • Lecturers
  • Project Presenters
  • Scientific Committee
  • Important dates
  • Application
  • Bursaries
  • Fees
  • Refund Policy
  • T-Shirt
  • Certificate
  • Sponsorship

2012

  • Home
  • Schedule
  • Workshops
  • Lectures
  • Project Presentations
  • Poster Slam & Session
  • Panel Discussions
  • Excursion
  • Lecturers
  • Certificate
  • Scientific Committee
  • Important Dates
  • Duration & Structure
  • Application
  • Registration Fees
  • Bursaries

2010

  • Home
  • Schedule
  • Workshops
  • Instructors
  • Lectures
  • Round table
  • Important dates
  • Application
  • Fees
  • Bursaries

2009

  • Home
  • Schedule
  • Workshops
  • Instructors
  • Lectures
  • Project presentations
  • Round tabel

Leipzig

  • Contact
  • Mailinglist
  • Host
  • Venue
  • Accommodation (updated)
  • City Map
  • Arrival
  • Weather

Experiences

What the ESU means to me
ESU 2022 (Dariah-EU)
ESU 2021 (Dariah-EU)
ESU 2019 Experiences (DARIAH-EU)
ESU 2018 Experiences (CLARIN-D)

ESU in the Media

ESU DH C&T in Zenodo
ESU 2017 (CLARIN-D Blog)
CLARIN-D at ESU 2015 (YouTube) english
CLARIN-D ESU 2015 (YouTube) deutsch
Mephisto 97.6 10.07.13
Campus Online 10.08.2012
Mephisto 97.6 26.07.2010
infotvleipzig 26.07.2010
In India 03.09.2010

Reviews

ESU 2021 (DiCultHer) How to Move a Summer University in Digital Humanities Online and Keep It Human
INFOtheka: Review of ESU DH 2009
INFOtheka: Review of ESU DH 2012
Infoclio.ch: Review of ESU DH 2013

Publications

Multimodal Analysis of “well”

Users

  • Login

DAAD

 

CLARIN ERIC

 

Sächsische Akademie der Wissenschaften

 

Universität Leipzig

 

BMBF

 

Electronic Textual Cultures Lab at the University of Victoria & Digital Humanities Summer Institute

CLARIN-D

 

DARIAH-EU

 

Slovenian Language Technologies Society (SDJT)

 

Parthenos

International Centre/AAA

 

Computational Humanities

 

Oxygen XML Editor

 

Universitätsbibliothek