From analog text to e-book
My mother Heide Berndt was a professor working in sociology and architecture. I set myself the task to digitize some of her texts. This article documents my experience working with specific free and open source software.
Here are the steps I took to create an e-book from a physical source of text:
Pilot project “Architektur als Ideologie”
The text is a lecture my mother held 1968 in Amsterdam.
1 — Scan the text
I cheated. I already got scans of 15 type writer pages. Thanks to the Het Nieuwe Instituut.
2 — Clean the scans
With Apple Preview I enhanced the contrast of the original scans. Any graphic editor will do the job here.
3 — Create a searchable PDF
My next step was to create a PDF containing all the pages with selectable text layered on top:
tesseract is an Optical Character Recognition (OCR) command line tool. I installed the base program and the language pack (text is in German) via homebrew.
# Create a PDF from an image
$ tesseract -l deu 01.jpg pdf
In a next step I merged the single PDF pages in Apple Preview. There are also command line tools to do so.
4 — Create a Markdown source file
I ran tesseract again to create text files this time.
# Create text from images
$ tesseract -l deu 01.jpg text
I then concatenated all the text to one long text file in Markdown format.
5 — Clean up the OCR text
Now, the manual labor began. OCR is never perfect. So there are a lot of small typos to be manually corrected. I did this within my text editor VS Code Studio. A spell checker helped me to catch most misspelled words. The most common issue was a “c” instead of an “e”.
Sometimes the “recognized” words were not readable any more. Now my searchable PDF came in handy, by searching for the gibberish text string I could quickly jump to the visual presentation of it.
The text now still contains many typos. I consider it good enough.
6 — Create the final output formats
After all the hard work, I was ready to create the final output formats. The popular Pandoc tool helped. It takes the Markdown text file as the source to create a .epub file. In addition a more print-friendly .pdf file was also created.
7 — Publish on GitHub
The final step was to publish the text in it’s new format. I pushed the whole project containing the scans, the Markdown and the ebook file to GitHub. It is published under a Creative Commons license. Now the text is accessible for future thinkers and researchers:
Such a project is fun for the issues you encounter along the way:
Dealing with text from before the orthography reform
In 1996 new rules to orthography were applied. The spell checker only knows about the new ones. So I feeded the spell checker all these words in a workspace dictionary. There is hopefully a smarter way.
Setting typographic quotation marks
Quoted text in German begins with a „ and ends with a “ character. I saw no option with Pandoc to change that. It does translate dump quotation marks to smart quotation marks though. I did not followed up on this any more.
Using CSS to hide meta data output
I have enriched the book with some frontend matter meta data attributes. Pandoc is printing these by default, but without label. So I decided to hide them with CSS.
Battling with UTF-16 Gremlins
My first .epub files where corrupted. I used the Apple Books.app to check the result. It showed that there is some issue and that it can not be fully rendered. But the error message was not very clear to me. I then proceeded to validator.idpf.org which makes use EPUBcheck and the error message this time was more clear. It directly pointed to an invalid character.
To manually dig into a .epub you can just change the file extension to .zip and unzip it. There is a predefined directory structure. In my case the content was in just one HTML file. Finding the character was simple. It was some white space character that the OCR text scan has made up. With the surrounding context known I was able to remove these from the source files as well.
Tinkering within the .epub was most fun. I am an avid .epub consumer. I like that they can be so small and so versatile and dynamic. The HTML structure and setup with CSS felt convenient but also a bit antiquated. How long will that format last? The original text was produced over 50 years ago.
Next project “Die Natur der Stadt”
I have already started the next — much bigger — project. My aim is to digitize a whole book of 240 pages following the same procedures. This book was scanned with the Google Books project. But it is not accessible.
This time I also need to do the scanning. Luckily there is scan software that is able to detect the curvature of book pages and create the text and the searchable PDF right away — all on a phone.
Manual post-processing of the text will be work for many weekends. Let’s see. Volunteers welcome!