From analog text to e-book

Pilot project “Architektur als Ideologie”

The text is a lecture my mother held 1968 in Amsterdam.

1 — Scan the text

I cheated. I already got scans of 15 type writer pages. Thanks to the Het Nieuwe Instituut.

2 — Clean the scans

With Apple Preview I enhanced the contrast of the original scans. Any graphic editor will do the job here.

Original scan on the right. Enhanced text on the right
Leveling up the contrast for better readability for humans and bots

3 — Create a searchable PDF

My next step was to create a PDF containing all the pages with selectable text layered on top:

Selectable text on top of images in a PDF
# Create a PDF from an image 
$ tesseract -l deu 01.jpg pdf

4 — Create a Markdown source file

I ran tesseract again to create text files this time.

# Create text from images
$ tesseract -l deu 01.jpg text

5 — Clean up the OCR text

Now, the manual labor began. OCR is never perfect. So there are a lot of small typos to be manually corrected. I did this within my text editor VS Code Studio. A spell checker helped me to catch most misspelled words. The most common issue was a “c” instead of an “e”.

Manual word un-breaking in progress

6 — Create the final output formats

After all the hard work, I was ready to create the final output formats. The popular Pandoc tool helped. It takes the Markdown text file as the source to create a .epub file. In addition a more print-friendly .pdf file was also created.

7 — Publish on GitHub

The final step was to publish the text in it’s new format. I pushed the whole project containing the scans, the Markdown and the ebook file to GitHub. It is published under a Creative Commons license. Now the text is accessible for future thinkers and researchers:

Obstacles

Such a project is fun for the issues you encounter along the way:

Dealing with text from before the orthography reform

In 1996 new rules to orthography were applied. The spell checker only knows about the new ones. So I feeded the spell checker all these words in a workspace dictionary. There is hopefully a smarter way.

Setting typographic quotation marks

Quoted text in German begins with a „ and ends with a “ character. I saw no option with Pandoc to change that. It does translate dump quotation marks to smart quotation marks though. I did not followed up on this any more.

Using CSS to hide meta data output

I have enriched the book with some frontend matter meta data attributes. Pandoc is printing these by default, but without label. So I decided to hide them with CSS.

Battling with UTF-16 Gremlins

My first .epub files where corrupted. I used the Apple Books.app to check the result. It showed that there is some issue and that it can not be fully rendered. But the error message was not very clear to me. I then proceeded to validator.idpf.org which makes use EPUBcheck and the error message this time was more clear. It directly pointed to an invalid character.

Next project “Die Natur der Stadt”

I have already started the next — much bigger — project. My aim is to digitize a whole book of 240 pages following the same procedures. This book was scanned with the Google Books project. But it is not accessible.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Frank Lämmer

Frank Lämmer

grown up graffiti kid @fortrabbit