Document Digitization

The most efficient, fastest and best-quality digitization is processed with document scanners. These models are able to scan both sides of an A3 + (up to 30.7 cm width) document at high speed and in high quality at the same time. The roller system and scanning technology of the scanners are extremely gentle, so we can handle poor quality, fragmented, torn or even strongly acidic pages with great safety. Often we encounter very long documents even up to 1meter length. With our professional devices we handle them without any problem. Output formats can be edited flexibly from 200 DPI black and white scanning to 600 DPI uncompressed TIFF format.

The next step in the processing of printed documents is the so called text recognition (OCR), where text will be transferred into the image. The efficiency and accuracy of today's softwares are very high. The text recognition of a 19th century print is at 98-99%, in case of high-quality prints it can reach 99.5% accuracy. The result of automatic text recognition is the so-called double-layer PDF, where the top layer is the scanned image and the lower layer is the text itself. With the help of that methodology the user see the authentic image while the search takes place on the text.

We insert bookmarks into the double-layered PDFs, which can be the title, author, date, year or title of a book chapter. The result is a standard double layer PDF, which is suitable for publishing on the Internet.

 

For publishing double-layered PDFs we use self-developed software that enables sophisticated high-speed and full-text search, browsing between search words, displaying and highlighting results. During the search users can use logical (AND, OR, NOT) and proximity operators (two or more words to be next to each other) as well, even truncation of a search word from the right or from the left, or inside the word is possible. For presenting these PDF pages we have developed an own program which is also able to highlight the results, scale and download pages.

Arcanum’s manufacturing technology and device system is appropriate for digitization and text recognition of any type, size and quality of documents and for publishing the double-layer PDFs online through a fast and sophisticated search and display system.

Applications

Arcanum Digitheca (http://adtplus.arcanum.hu): Hundreds of printed Hungarian scientific, official journals, newspapers, weekly magazines, books, with about 4-5 million new pages a year.

The HUNGARICANA Public Library and Document Library (http://library.hungaricana.hu) contains 100 public collections, kept and digitized publications: documents, archives, museum yearbooks, printed archives, school notices, etc.