Since the Urdu language has more isolated letters than Arabic and Farsi, a research on Urdu handwritten word is desired. This is a novel approach to use the compound features and a Support Vector Machine (SVM) in offline Urdu word recognition. Due to the cursive style in Urdu, a classification using a holistic approach is adapted efficiently. Compound feature sets, which involves in structural and gradient features (directional features), are extracted on each Urdu word. Experiments have been conducted on the CENPARMI Urdu Words Database, and a high recognition accuracy of 97.00% has been achieved.
A natural language (or ordinary language) is a language that is spoken, written, or signed by humans for general-purpose communication, as distinguished from formal languages (such as computer-programming languages or the "languages" used in the study of formal logic). The computational activities required for enabling a computer to carry out information processing using natural language is called natural language processing. We have taken Assamese language to check the grammars of the input sentence. Our aim is to produce a technique to check the grammatical structures of the sentences in Assamese text. We have made grammar rules by analyzing the structures of Assamese sentences. Our parsing program finds the grammatical errors, if any, in the Assamese sentence. If there is no error, the program will generate the parse tree for the Assamese sentence
Artificial Neural Network (ANN) s has widely been used for recognition of optically scanned character, which partially emulates human thinking in the domain of the Artificial Intelligence. But prior to recognition, it is necessary to segment the character from the text to sentences, words etc. Segmentation of words into individual letters has been one of the major problems in handwriting recognition. Despite several successful works all over the work, development of such tools in specific languages is still an ongoing process especially in the Indian context. This work explores the application of ANN as an aid to segmentation of handwritten characters in Assamese- an important language in the North Eastern part of India. The work explores the performance difference obtained in applying an ANN-based dynamic segmentation algorithm compared to projection- based static segmentation. The algorithm involves, first training of an ANN with individual handwritten characters recorded from different individuals. Handwritten sentences are separated out from text using a static segmentation method. From the segmented line, individual characters are separated out by first over segmenting the entire line. Each of the segments thus obtained, next, is fed to the trained ANN. The point of segmentation at which the ANN recognizes a segment or a combination of several segments to be similar to a handwritten character, a segmentation boundary for the character is assumed to exist and segmentation performed. The segmented character is next compared to the best available match and the segmentation boundary confirmed.
Malayalam is an Indian language spoken by 40 million people with its own script. It has a rich literary tradition. A character recognition system for this language will be of immense help in a spectrum of applications ranging from data entry to reading aids. The Malayalam script has a large number of similar characters making the recognition problem challenging. In this chapter, we present our approach for recognition of Malayalam documents, both printed and handwritten. Classification results as well as ongoing activities are presented.
We consider a general incompressible finite model protein of size M in its environment, which we represent by a semiflexible copolymer consisting of amino acid residues classified into only two species (H and P, see text) following Lau and Dill. We allow various interactions between chemically unbonded residues in a given sequence and the solvent (water), and exactly enumerate the number of conformations W(E) as a function of the energy E on an infinite lattice under two different conditions: (i) we allow conformations that are restricted to be compact (known as Hamilton walk conformations), and (ii) we allow unrestricted conformations that can also be non-compact. It is easily demonstrated using plausible arguments that our model does not possess any energy gap even though it is supposed to exhibit a sharp folding transition in the thermodynamic limit. The enumeration allows us to investigate exactly the effects of energetics on the native state(s), and the effect of small size on protein thermodynamics and, in particular, on the differences between the microcanonical and canonical ensembles. We find that the canonical entropy is much larger than the microcanonical entropy for finite systems. We investigate the property of self-averaging and conclude that small proteins do not self-average. We also present results that (i) provide some understanding of the energy landscape, and (ii) shed light on the free energy landscape at different temperatures.
Hidden Markov models (HMM) have long been a popular choice for Western cursive handwriting recognition following their success in speech recognition. Even for the recognition of Oriental scripts such as Chinese, Japanese and Korean, hidden Markov models are increasingly being used to model substrokes of characters. However, when it comes to Indie script recognition, the published work employing HMMs is limited, and generally focussed on isolated character recognition. In this effort, a data-driven HMM-based online handwritten word recognition system for Tamil, an Indie script, is proposed. The accuracies obtained ranged from 98% to 92.2% with different lexicon sizes (IK to 20 K words). These initial results are promising and warrant further research in this direction. The results are also encouraging to explore possibilities for adopting the approach to other Indie scripts as well.
In a country like India, a single text line of most of the official documents contains two different script words. Under two-language formula, the Indian documents are written in English and the state official language. For Optical Character Recognition (OCR) of such a document page, it is necessary to separate different script words before feeding them to the OCRs of individual scripts. In this paper a robust technique is proposed to extract word-wise script identification from Indian doublet form documents. Here, at first, the document is segmented into lines and then the lines are segmented into words. Using different topological and structural features (like number of loops, headline feature, water reservoir concept based features, profile features, etc.) individual script words are identified from the documents. The proposed scheme is tested on 24210 words of different doublets and we received more than 97% accuracy, on average.
The bare chi characterizing polymer blends plays a significant role in their macroscopic description. Therefore, its experimental determination, especially from small-angle-neutron-scattering experiments on isotopic blends, is of prime importance in thermodynamic investigations. Experimentally extracted quantity, commonly known as the effective chi is affected by thermodynamics, in particular by polymer connectivity, and density and composition fluctuations. The present work is primarily concerned with studying four possible effective chi's, one of which is closely related to the conventionally defined effective chi, to see which one plays the role of a reliable estimator of the bare chi. We show that the conventionally extracted effective chi is not a good measure of the bare chi in most blends. A related quantity that does not contain any density fluctuations, and one which can be easily extracted, is a good estimator of the bare chi in all blends except weakly interacting asymmetric blends (see text for definition). The density fluctuation contribution is given by (Delta v^bar)**2/2TK_T, where Delta v^bar is the difference of the partial monomer volumes and K_T is the compressibility. Our effective chi's are theory-independent. From our calculations and by explicitly treating experimental data, we show that the effective chi's, as defined here, have weak composition dependence and do not diverge in the composition wings. We elucidate the impact of compressibility and interactions on the behavior of the effective chi's and their relationship with the bare chi.
The anusaaraka system (a kind of machine translation system) makes text in one Indian language accessible through another Indian language. The machine presents an image of the source text in a language close to the target language. In the image, some constructions of the source language (which do not have equivalents in the target language) spill over to the output. Some special notation is also devised. Anusaarakas have been built from five pairs of languages: Telugu,Kannada, Marathi, Bengali and Punjabi to Hindi. They are available for use through Email servers. Anusaarkas follows the principle of substitutibility and reversibility of strings produced. This implies preservation of information while going from a source language to a target language. For narrow subject areas, specialized modules can be built by putting subject domain knowledge into the system, which produce good quality grammatical output. However, it should be remembered, that such modules will work only in narrow areas, and will sometimes go wrong. In such a situation, anusaaraka output will still remain useful.
This paper describes the character recognition process from printed documents containing Hindi and Telugu text. Hindi and Telugu are among the most popular languages in India. The bilingual recognizer is based on Principal Component Analysis followed by support vector classification. This attains an overall accuracy of approximately 96.7%. Extensive experimentation is carried out on an independent test set of approximately 200000 characters. Applications based on this OCR are sketched.