DIA IMPRINT

New Introducing interactive demos for document processing

This project investigates a set of sub-problems related to recognition and retrieval of degraded and challenging document images in Indian languages. Traditionally the problem of recognition is called OCR. However, OCRs are reliable only when the document is printed and reasonably clean. Many practically important documents in Indian context (such as massive collection of manuscripts available in courts, historical newspaper articles, handwritten notes of freedom fighters) have variable inprint style, are affected by ageing related noise and varying scan settings.

We focus on the content aware image processing algorithms for robust and efficient recognition and retrieval from Indian language document images. Our image processing algorithms aim at improving the quality of document images by removing the noise and low resolution artifacts by adopting content aware operations. We also work on developing recognizers using state of the art machine learning techniques such as deep learning for handwritten Indian language text. In this project, we specifically work on

pre-processing, super-resolution and clean up
annotations, tools and data creation
recognition of handwriting
post-processing and accuracy enhancements
features and matching
retrieval from a collection.

Some of the results and publications for this project have been added here.

Information Access from Document Images of Indian Languages

Information Access from Document Images of Indian Languages

Information Access from Document Images of Indian Languages

Information Access from Document Images of Indian Languages

Information Access from Document Images of Indian Languages