

We surveyed available OCR systems and most successful algorithms. In 2017, the project team started a regular seminar devoted to OCR with emphasis on these languages. To date, the Afghanistan Digital Collection provides preservation of and access to an extensive collection of Afghan related materials (1.7+ million pages) and 500k pageviews in usage since its first availability in 2010. Yan Han and his colleague Atifa Rawan have been working with ACKU for the past 11 years on providing access and preservation of Afghanistan materials. Marek approached Yan about the Afghan Digital Collection, the largest digital repository related to Afghanistan and related areas, at the University of Arizona Libraries. The current team started studying the OCR problem in Fall 2017. Other project staff include Raymundo Navarette, Dwight Nwaigwe, Sayyed Mohsen Vazirzade. Marek Rychlik is the Principal Investigator. The project is currently funded by the National Endowment of Humanities (NEH) for $75,000 for the period of 2019.

The proposed advanced implementation will be a next generation of OCR software capable of handling complex layouts. Some of the most spectacular failures of OCR software are not a result of its inability to recognize characters, but are unable to perform accurate layout analysis by identifying text regions, images, and text direction. 90%+ accuracy rate is a threshold for OCR to be useful. Many implementations of OCR software exist, both commercial and open source, but they do not produce useful results for Traditional Chinese, Pashto, and Persian literature. Its OCR function uses certain parts of latest Tesseract OCR (https:\///tesseract-ocr/tesseract) for character recognition. Currently it consists of major components: segmentation, layout analysis and OCR. This package contains an OCR engine specifically for Pashto, Persian, and Traditional Chinese. Eventually, it helps to target for a large-scale, open source, global language and culture data bank.

On the test set (online handwritten texts) of ICDAR 2011 Chinese handwriting recognition competition, the proposed method outperforms the best system in competition.The goal is to build a next generation of OCR technology. On the test sets of databases CASIA-OLHWDB (Chinese) and TUAT Kondate (Japanese), the character level correct rates are 95.20 and 95.44 percent, and the accurate rates are 94.54 and 94.55 percent, respectively. We evaluate the performance of the proposed method on unconstrained online handwritten text lines of three databases. A forward-backward lattice pruning algorithm is proposed to reduce the computation in training when trigram language models are used, and beam search techniques are investigated to accelerate the decoding speed.

Based on given models of character recognition and compatibilities, the fusion parameters are optimized by minimizing the negative log-likelihood loss with a margin term on a training string sample set. The high-order semi-CRF model is defined on a lattice containing all possible segmentation-recognition hypotheses of a string to elegantly fuse the scores of candidate character recognition and the compatibilities of geometric and linguistic contexts by representing them in the feature functions. This paper proposes a method for handwritten Chinese/Japanese text (character string) recognition based on semi-Markov conditional random fields (semi-CRFs).
