Benchmark Portal

Design of English-Hindi Translation Memory for Efficient Translation

Nisheeth Joshi, Iti Mathur

Developing parallel corpora is an important and a difficult activity for Machine Translation. This requires manual annotation by Human Translators. Translating same text again is a useless activity. There are tools available to implement this for European Languages, but no such tool is available for Indian Languages. In this paper we present a tool for Indian Languages which not only provides automatic translations of the previously available translation but also provides multiple translations, in cases where a sentence has multiple translations, in ranked list of suggestive translations for a sentence. Moreover this tool also lets translators have global and local saving options of their work, so that they may share it with others, which further lightens the task.

Real-time scene text localization and recognition

Lukáš Neumann , Jiří Matas

An end-to-end real-time scene text localization and recognition method is presented. The real-time performance is achieved by posing the character detection problem as an efficient sequential selection from the set of Extremal Regions (ERs). The ER detector is robust to blur, illumination, color and texture variation and handles low-contrast text. In the first classification stage, the probability of each ER being a character is estimated using novel features calculated with O(1) complexity per region tested. Only ERs with locally maximal probability are selected for the second stage, where the classification is improved using more computationally expensive features. A highly efficient exhaustive search with feedback loops is then applied to group ERs into words and to select the most probable character segmentation. Finally, text is recognized in an OCR stage trained using synthetic fonts. The method was evaluated on two public datasets. On the ICDAR 2011 dataset, the method achieves state-of-the-art text localization results amongst published methods and it is the first one to report results for end-to-end text recognition. On the more challenging Street View Text dataset, the method achieves state-of-the-art recall. The robustness of the proposed method against noise and low contrast of characters is demonstrated by “false positives” caused by detected watermark text in the dataset.

Non-equilibrium thermodynamics. IV: Generalization of Maxwell, Claussius-Clapeyron and Response Functions Relations, and the Prigogine-Defay Ratio for Systems in Internal Equilibrium

P. D. Gujrati, P. P. Aung

We follow the consequences of internal equilibrium in non-equilibrium systems that has been introduced recently [Phys. Rev. E 81, 051130 (2010)] to obtain the generalization of Maxwell's relation and the Clausius-Clapeyron relation that are normally given for equilibrium systems. The use of Jacobians allow for a more compact way to address the generalized Maxwell relations; the latter are available for any number of internal variables. The Clausius-Clapeyron relation in the subspace of observables show not only the non-equilibrium modification but also the modification due to internal variables that play a dominant role in glasses. Real systems do not directly turn into glasses (GL) that are frozen structures from the supercooled liquid state L; there is an intermediate state (gL) where the internal variables are not frozen. Thus, there is no single glass transition. A system possess several kinds of glass transitions, some conventional (L \rightarrow gL; gL\rightarrow GL) in which the state change continuously and the transition mimics a continuous or second order transition, and some apparent (L\rightarrow gL; L\rightarrow GL) in which the free energies are discontinuous so that the transition appears as a zeroth order transition, as discussed in the text. We evaluate the Prigogine-Defay ratio {\Pi} in the subspace of the observables at these transitions. We find that it is normally different from 1, except at the conventional transition L\rightarrow gL, where {\Pi}=1 regardless of the number of internal variables.

Non-equilibrium thermodynamics.III. Thermodynamic Principles, Entropy Continuity during Component Confinement, Energy Gap and the Residual Entropy

P. D. Gujrati

To investigate the consequences of component confinement such as at a glass transition and the well-known energy or enthalpy gap (between the glass and the perfect crystal at absolute zero, see text), we follow our previous approach [Phys. Rev. E 81, 051130 (2010)] of using the second law applied to an isolated system {\Sigma}_0 consisting of the homogeneous system {\Sigma} and the medium {\Sigma}. We establish on general grounds the continuity of the Gibbs free energy G(t) of {\Sigma} as a function of time at fixed temperature and pressure of the medium. It immediately follows from this and the observed continuity of the enthalpy during component confinement that the entropy S of the open system {\Sigma} must remain continuous during a component confinement such as at a glass transition. We use these continuity properties and the recently developed non-equilibrium thermodynamics to formulate thermodynamic principles of additivity, reproducibility, continuity and stability that must also apply to non-equilibrium systems in internal equilibrium. We find that the irreversibility during a glass transition only justifies the residual entropy S_{R} to be at least as much as that determined by disregarding the irreversibility, a common practice in the field. This justifies a non-zero residual entropy S_{R} in glasses, which is also in accordance with the energy or enthalpy gap at absolute zero. We develop a statistical formulation of the entropy of a non-equilibrium system, which results in the continuity of entropy during component confinement in accordance with the second law and sheds light on the mystery behind the residual entropy, which is consistent with the recent conclusion [Symmetry 2, 1201 (2010)] drawn by us.

A Framework for Devanagari Script-based Captcha

Sushma Yalamanchili, M. Kameswara Rao

Human Interactive Proofs (HIPs) are automatic reverse Turing tests designed to distinguish between various groups of users. Completely Automatic Public Turing test to tell Computers and Humans Apart (CAPTCHA) is a HIP system that distinguish between humans and malicious computer programs. Many CAPTCHAs have been proposed in the literature that text-graphical based, audio-based, puzzle-based and mathematical questions-based. The design and implementation of CAPTCHAs fall in the realm of Artificial Intelligence. We aim to utilize CAPTCHAs as a tool to improve the security of Internet based applications. In this paper we present a framework for a text-based CAPTCHA based on Devanagari script which can exploit the difference in the reading proficiency between humans and computer programs. Our selection of Devanagari script-based CAPTCHA is based on the fact that it is used by a large number of Indian languages including Hindi which is the third most spoken language. There is potential for an exponential rise in the applications that are likely to be developed in that script thereby making it easy to secure Indian language based applications.

Segmentation of Degraded Malayalam Words: Methods and Evaluation

Devendra Sachan, Shrey Dutta , T.S. Naveen, C.V. Jawahar

In most of the Optical Character Recognition soft-wares, a substantial percentage of errors are caused by the incor-rect segmentation of degraded words. This is especially true forrecognizing old books, newspapers and historical manuscripts.In this paper, we propose multiple segmentation methods whichaddress the problem of cuts and merges in degraded words. Wehave created an annotated dataset of 1034 word images withpixel level ground truth for quantitative evaluation of the meth-ods. We compare the methods with a baseline implementationbased on connected component analysis. We report substantialimprovement in accuracy both at character and at word level.Keywords-Character Segmentation; Degradation Correction;Malayalam; Indian Language;

An Efficient Character Recognition System for Handwritten Malayalam Characters Based on Intensity Variations

Abdul Rahiman M , Rajasree M S

People start learning to read and write during the early stage of education. As years pass by they may have acquired good reading and writing skills. It may not be difficult for them to read any kind of either printed or handwritten characters. Most people have no problem in reading any kind of light prints or heavy prints, upside down prints, prints of different fonts and styles, handwritten whether it is neatly or sloppily written. But Computers may find difficultly in deciphering many kinds of printed characters which is of different fonts and styles or handwritten characters. To evolve a panacea to this problem human brains have been indulging in various research activities. This paper is a humble attempt for the recognition of handwritten Malayalam (a South Indian Language) characters. In our study we have classified the connected characters into 3 categories. Here we propose an algorithm which uses the inveterate characteristic features to recognize these characters with perceptive accuracy by utilizing the intensity variations in the way in which they may be written. This algorithm recognizes the antediluvian script of Malayalam characters which are connected in nature. Here the input is a 24-bit bmp image which can be enscribed using the Light pen. The output is editable version of the recognized Malayalam characters. In our study we have classified the connected characters into 3 categories. The algorithm is tested for 3 sets of samples ranging 402 letters in noiseless environment and produces accuracy of 94%.

Recognition of handwritten Malayalam characters using vertical & horizontal line positional analyzer algorithm

M Abdul Rahiman , M S Rajasree , N Masha , M Rema , R Meenakshi , G Manoj Kumar

This paper proposes an algorithm for the recognition of handwritten characters in Malayalam, a South Indian language. It introduces the salient features of Malayalam script and lists the approaches used for character recognition. Malayalam scripts are rich in patterns because of their complex curved form, larger number of basic elements and the presence of conjuncts. The combinations of such patterns make the recognition of characters much complex and these patterns should be exploited to arrive at the solution. Here an image of handwritten Malayalam characters is given as the input and an editable document of Malayalam characters in a predefined format is produced as output. In this paper, initially the overall structure of OCR system is presented. Then, the OCR process is presented in three modules: Pre-processing, Skeletonization and Recognition. In Pre-processing, we scan the input image and separate each character from it. In Skeletonization, we obtain one pixel thick skeleton of the character. In Recognition, we classify the characters based on their features. The features of the characters are extracted based on the analysis of position and count of the horizontal and vertical lines.

Word level Script Identification from Bangla and Devanagri Handwritten Texts mixed with Roman Script

Ram Sarkar, Nibaran Das, Subhadip Basu, Mahantapas Kundu, Mita Nasipuri, Dipak Kumar Basu

India is a multi-lingual country where Roman script is often used alongside different Indic scripts in a text document. To develop a script specific handwritten Optical Character Recognition (OCR) system, it is therefore necessary to identify the scripts of handwritten text correctly. In this paper, we present a system, which automatically separates the scripts of handwritten words from a document, written in Bangla or Devanagri mixed with Roman scripts. In this script separation technique, we first, extract the text lines and words from document pages using a script independent Neighboring Component Analysis technique. Then we have designed a Multi Layer Perceptron (MLP) based classifier for script separation, trained with 8 different wordlevel holistic features. Two equal sized datasets, one with Bangla and Roman scripts and the other with Devanagri and Roman scripts, are prepared for the system evaluation. On respective independent text samples, word-level script identification accuracies of 99.29% and 98.43% are achieved.

Smart Bengali Cell Phone Keypad Layout

Md. Abul Kalam Azad, Rezwana Sharmeen, Shabbir Ahmad, S. M. Kamruzzaman

Nowadays cell phone is the most common communicating used by mass people. SMS based communication is a cheap and popular communication method. It is human tendency to have the opportunity to write SMS in their mother language. Text input in mother language is more flexible when the alphabets of that language are printed on the keypad. Bangla mobile keypad based on phonetics has been proposed earlier. But the keypad is not scientific from frequency and flexibility point of view. Since it is not a feasible solution in this paper we have proposed an efficient Bengali keypad for cell phone and other cellular device. The proposed keypad is based on the frequency of the alphabets in Bengali language and also with the view of structure of human finger movements. We took the two points in count to provide a flexible and fast cell phone keypad.