Script and Language Identification - IIIT Hyderabad

Script and Language Identification - IIIT Hyderabad

Script and Language Identification in Document Images and Scene Texts Ajeet Kumar Singh Adviser: Prof. C. V. Jawahar Center for Visual Information Technology IIIT-Hyderabad, India IIIT 1 Contents 1. Motivation 2. Script Identification in Wild 3. Script and Language Identification in Document Images IIIT Contents 1. Motivation 2. Script Identification in Wild 3. Script and Language Identification in Document Images IIIT Motivation TextWhats Extractor or OCR the script?

IIIT Motivation Typical Optical Character Recognition System IIIT Manual Intervention Motivation Automated Optical Character Recognition System IIIT Motivation Why do we need Script and Language Identification? (a) (b) (c) IIIT 7 Motivation Why do we need Script and Language Identification? (a)

(b) Red: German; Green: French; Blue: Spanish (c) IIIT Purple: Hindi; Orange: Telugu; Brown: Malayalam 8 Motivation Script Identification Kannada or Telugu? Telugu or Kannada? Bangla or Assamese? Assamese or Bangla? Language Identification Hindi or Marathi? German or French? French or German? IIIT Marathi or Hindi?

Motivation Script Identification Kannada Telugu Bangla Assamese Language Identification Marathi Hindi IIIT German French Contents 1. Need of Script and Language Identification 2. Script Identification in Wild 3. Script and Language Identification in Document Images IIIT The Problem Statement

Script_1 Script_2 Automatic Identification Of Scripts Recognized Text Script_N Engine Challenges Lack of context Contributions A simple and effective solution Stylish fonts IIIT Complex background Indian Language Scene Text Dataset Indian Language Scene Text Dataset (ILST) Largest scene text dataset for the Indian languages.

The ILST dataset can be used for: Script Identification Word Recognition Text Localization IIIT GT Bounding box, script and text Few Cropped Images from dataset Indian Language Scene Text Dataset (ILST) Dataset Statistics Script # Scene Images # Word Images Mode of Collection Hindi 76 514 Authors, Google Images Malayalam

121 515 Authors, Google Images Kannada 115 534 Char74K1 Tamil 59 563 Authors Telugu 79 510 Authors English

128 850 Authors Total 578 3486 - IIIT LabelMe2 is used for the annotations Train and test splits of the dataset 1 2 T. E. de Campos, B. R. Babu, and M. Varma, Character recognition in natural images, in VISAPP, 2009. LabelMe - The Open Annotation Tool, http://labelme.csail.mit.edu/ Proposed Method Mid Level Features Distinctive Robust Better than nave bag-of-visual-words based features Captures the larger context Can be grouped into three categories: Supervised Weakly-supervised

Unsupervised IIIT Proposed Method .. Training Images Local Features Visual Words Local Histogram of Visual Words Mid Level Feature based representation: Represent each images as a set of descriptors Descriptors are clustered to obtain visual words Mid-Level Features = ={ 1 , 2 , , } Assignment to obtain the feature-visual word pair . Local histograms of features . Local histograms are clustered again to obtain the midlevel feature representations. IIIT and are used to represent each image as mid-level

feature ={1 , 2 , , } Proposed Method .. Training Images Local Features Visual Words Feature Computation Local Histogram of Visual Words Mid-Level Features SIFT descriptors are computed on rectangular grid with spacing of pixels. Descriptors are computed over four circular support patches with different radii of . IIIT Learn multiple descriptors to allow the scale variation. Proposed Method .. Training Images Local Features

Visual Words Local Histogram of Visual Words Mid-Level Features Selecting the best mid-level representation for the task: Not all mid-level features are relevant for script identification. For each feature , a relevance score is computed, where and are the descriminativity and representavity of the feature . and are calculated by the entropy based formula. IIIT ( )= ( ) ( ) Experiments Cropped word script identification Comparison with Baseline Methods Results on ILST Dataset Results on CVSI1 Dataset End-to-End pipeline IIIT 1. N. Sharma, R. Mandal, R. Sharma, U. Pal, and M. Blumenstein,ICDAR2015 Competition on Video Script Identification(CVSI 2015), in ICDAR, 2015 Cropped Word Script Identification Comparison with baseline methods Comparison of proposed method with various algorithms used in document script identification.

Task Accuracy (%) Baseline Methods Gabor Features 59.25 Gradient features 47.74 Profile Features 49.24 Linear Binary Patterns 78.08 Proposed Method 88.67 IIIT Cropped Word Script Identification Cropped word script identification Results on Indian Language Scene Text Dataset (ILST) - Quantitative

Best performance: Hindi (95.71%) Least performance: Tamil (77.00%) Confusion between Tamil and Malayalam due to script similarity. Confusion between Kannada and Telugu IIIT Cropped Word Script Identification Cropped word script identification Results on Indian Language Scene Text Dataset (ILST) - Qualitative Hindi Tamil Kannada Telugu

Malayalam Tamil Tamil Malayalam Telugu Kannada English Kannada IIIT Success Cases Failure Cases Cropped Word Script Identification

Results on CVSI1 Dataset ICDAR2015 Competition on Video Script Identification 10 scripts from India. (Hindi, Bangla, Arabic, English, Gujarati, Kannada, Oriya, Punjabi, Tamil, Telugu) Tasks:.. Task-1: Script Identification on script triplets Task-2: North Indian Script Identification Task-3: South Indian Script Identification Task-4: Script Identification across all languages Methods Task IIIT C-DAC CUK HUST CVC-1 CVC-2 Google

Shi et al.2 Ours Task 1 91.75 - 99.07 95.92 95.91 99.51 - 99.10 Task 2 96.79 79.50 97.69 95.73 95.91

99.19 93.80 97.99 Task 3 86.95 79.14 97.53 95.38 95.75 98.95 96.70 96.11 Task 4 84.66 74.06 96.69

95.88 96.00 98.91 94.30 96.70 1. 2. N. Sharma, R. Mandal, R. Sharma, U. Pal, and M. Blumenstein,ICDAR2015 Competition on Video Script Identification(CVSI 2015), in ICDAR, 2015 B. Shi, X. Bai, and C. Yao, Script Identification in the Wild via Discriminative Convolutional Neural Network, in Pattern Recognition, 2015. End-to-End Pipeline Given a scene image, our goal is to: Localize the Text1,2 Identify the script of the text. Evaluation of End-to-End Script Identification We use precision, recall and measure. no. of correctly identified words total no. of identified words. total no. of ground truth words, then IIIT = 1. 2.

= =2 + D. K. Lluis Gomez, A Fast Hierarchical Method for Multi-script and Arbitrary Oriented Scene Text Extraction, in arXiv:1407.7504, 2014. Tesseract OCR, http://code.google.com/p/tesseract-ocr/ End-to-End Pipeline Quantitative Results Script Telugu Tamil Malayalam Kannada Hindi English Precision Recall f-score 0.47 0.41 0.54 0.44

0.50 0.42 0.49 0.39 0.42 0.45 0.47 0.48 0.47 0.43 0.45 0.46 0.56 0.50 IIIT End-to-End Pipeline Qualitative Results Images showing results of successful script identification IIIT Contents 1. Need of Script and Language Identification 2. Script Identification in Wild 3. Script and Language Identification in Document

Images IIIT The Problem Statement Can identify the scripts? vs. Gurumukhi Hindi Can identify the languages? vs. French Spanish IIIT Overview of Script Identification Many methods has been approached in the past : Page Level, Line Level, Word Level Use the texture and orientation of image segments like: Upward Concavities, Optical Densities Character height densities and top-bottom profiles Textures have also been used extensively. Some of them are: Multi-channel Gabor filters.

Gray-level co-occurrence matrix Gabor energy Local Binary Pattern IIIT Convolutional Neural Networks(CNN) are used to learn the discriminative features. Overview of Language Identification Language identification is hard because: Inherent scripts are same Many attempts in textual domain like: Using the statistics (e.g. gram probabilities.) In image domain, the language identification is attempted only at page level or paragraph level. A class of methods was approached uses: Character ascenders and descenders for recognition. Characters are grouped together to form a word token Calculate the frequency of single word, word pair word trigram. IIIT Document Vectorization method Document is converted into vertical cut vector Position and number of vertical cut vector determines the language of the document image. for Script and Language Identification Recurrent Neural Networks with Long-Short Term Memory (LSTM Networks)

feed forward neural networks with cyclical connections handles the sequential data powerful classification tool preserve the contextual information of previous states. IIIT does not require any explicit labeling of all the feature sequences for Script and Language Identification Recurrent Neural Networks with Bidirectional Long-Short Term Memory (BLSTM Networks) Two LSTM networks One takes input from beginning to end Other takes the input from end to beginning Output of both the networks, used to predict the final output. Connectionist Temporal Classification (CTC) is used at the output. For script/language identification system, the objective function of is: = IIIT

( , ) ln () are the features and label respectively Script and Language Identification Architecture IIIT Representation of Words and Lines Grey Image IIIT

Binarized Image Datasets To validate the proposed method for script and language identification, we use: Indian Multilingual Dataset1 which contains: 12 Indic scripts and 3 Latin-based languages. 55K Pages and 15M words. Scripts/ Languages IIIT Books Pages Lines Words Hindi 34

5.0K 133K 1.66M Malayalam 31 5.0K 93K 0.96M Gurumukhi 33 5.0K 125K 1.62M Kannada 27 3.8K

90K 0.72M Tamil 23 4.8K 88K 0.64M Telugu 28 5.0K 102K 0.83M Bangla 14 2.8K 50K 0.95M

Marathi 20 5.0K 127K 1.44M Gujarati 26 5.2K 124K 1.25M Assamese 19 3.5K 73K 0.59M Manipuri

25 3.6K 69K 0.72M Odiya 17 5.0K 109K 1.44M French 6 1.9K 51K 0.71M German 4 2.1K

55K 0.74M Spanish 5 1.9K 48K 0.63M Datasets To test the generality of our method, we compare it with the method [1] on the reported dataset (D2) . The dataset contains 220K words from 11 Indic scripts Implementation Details Experiment s Script Script Language Language Testing* Training

Time * Testing Time* 60K 1.003M 4.11 0.5 960K 960K 240K 240K 11.64M 11.64M 3.75 3.75 0.1 30K 30K 600K 600K

15K 15K 150K 150K 100K 100K 1.000M 1.000M 0.80 0.80 2.00 2.00 0.5 Type Training* Line 120K Words Words Line Line Words Words Validation*

IIIT Implementation Details to train and test the for script and language Identification. * Approx. 0.1 Results and Discussions Script Identification: Quantitative Results on Our Dataset1 Scripts/ Languages Accuracies (in %) IIIT Lines Words Hindi 96.6 85.8 Malayalam 99.2 99.0

Gurumukhi 97.9 93.2 Kannada 98.0 93.8 Tamil 98.5 98.1 Telugu 98.4 96.0 Bangla 98.6 98.5 Marathi 97.6

95.8 Gujarati 98.6 98.4 Assamese 95.3 93.3 Manipuri 98.2 71.4 Odiya 99.5 97.5 1. C. V. Jawahar and A. Kumar, Content-level Annotation of Large Collection of Printed Document Images, in ICDAR, 2007 Results and Discussions Comparison with state-of-the-art Pati[1]

words; used for training, used for testing Scripts/ Languages Accuracies (in %) Ours Pati uses Gabor features with classifier in hierarchical setting. Hindi 93.6 96.2 Malayalam 96.5 93.3 Gurumukhi

93.9 93.6 Our method achieves an error reduction of when compared with . Kannada 94.4 93.8 Tamil 96.7 95.2 Telugu 93.1 92.3 Bangla 94.3 96.2

Marathi - - Gujarati 96.2 95.5 Assamese - - Manipuri - - 97.1 94 Our method is simple and uses multiclass handcrafted classifier architecture unlike . Odiya

IIIT 1. Peeta Basa Pati and A. G. Ramakrishnan, Word Level Multi-script Identification, Pattern Recognition Letters, 2008 Results and Discussions Script Identification: Qualitative Results Script/ Language Hindi Correctly Identified Words Wrong Words Marathi Malayalam Tamil Gurumukhi Bangla Kannada Telugu Tamil Malayalam Telugu

Kannada Bangla Hindi Marathi Hindi Gujarati Hindi Assamese Manipuri IIIT Odiya Manipuri Assamese Bangla Results and Discussions Script Identification: Confusion Matrix IIIT Results and Discussions Language Identification: Quantitative Results Languages

Conf. Matrix (Word Level) French German Accuracy(%) Spanish Line Word French 93.32 3.47 3.21 94.51 93.32 German 5.44 92.19 2.37 94.77

92.19 Spanish 3.63 1.70 94.67 96.47 94.67 Lexical Similarity: Measure of the degree to which the word sets of two given languages are similar. IIIT Results and Discussions Language Identification: Qualitative Results Script/ Language Correctly Identified Words Wrong Words French Spanish German

French Spanish French Lexical Similarity:1 French and Spanish: French and German: IIIT 1. Ethnologue - webpage, https://www.ethnologue.com/. Multilingual OCR: To Separate or Not? Two ways to build a Multilingual OCR (mOCR): Hindi Malayalam Gurumukhi RNN Text Outputs Tamil Telugu A Single OCR trained for all the scripts and languages A single OCR would, require very large dataset cardinality of output space would increase manifolds.

long time to train the OCR IIIT Multilingual OCR: To Separate or Not? Two ways to build a Multilingual OCR (mOCR): Hindi Hindi Bangla Script Separation Module Telugu Bangla Text Output ``Telugu OCRs for different scripts and languages are trained separately, individually. Multiple OCRs would, require less data small output space less time to train the OCR IIIT Experiments Script Identification Results: Two groups

South Indian Scripts + English North Indian Scripts + English Training: 100K words from each script Testing: 25K words from each script North Indian Scripts Accuracy South Indian Scripts Accuracy English 99.99 English 99.98 Hindi 98.40 Kannada 98.78 Bangla 99.16

Malayalam 99.57 Gurumukhi 98.63 Tamil 99.13 Gujarati 99.29 Telugu 99.15 IIIT Experiments Need of Script Separation Module in OCR Bilingual OCR ( and Trilingual OCR are trained. are: English + Hindi English + Bangla English + Kannada

are: English + Hindi + Bangla English + Kannada + Telugu A (hierarchical system) which constitutes a script separation module and script/language specific OCRs. IIIT Experiments Comparison of , with L1+L2 (Bi/Tri)lingual B1 L1+L2+L3 B2 B3 B4 T1 T2 OCR Char. Error Rate

Relative Error Reduction 3.88 2.87 26 % 2.16 1.65 24 % 2.31 2.13 8% 1.11 0.61 45 % 3.85 3.31 2.65

14 % Hierarchical OCR is better than flat bi-lingual or tri-lingual OCRs. 2.02 24 % IIIT Summary Indian Language Scene Text (ILST) Dataset Over 600 scene images containing 3500 words from 5 Indic Scripts Text Localization, Text Detection and Text Recognition in the wild Script Identification in the Wild A Simple and Effective Solution for script identification Uses robust mid-level features with SVM Script and Language Identification in Document Images Recurrent Neural Networks for Script and Language Identification Hierarchical Multilingual OCR with script and language identification module for one-stop recognition solution IIIT Possible Extensions Script Identification in Wild Use CNN for end-to-end script identification task Exploring the usage of multiple cues from the texts Extending the ILST dataset to 10-15 popular Indic scripts Script and Language Identification in Document Images Integration of the script identification and text recognition modules Script and Language Identification in Handwritten document images

A fully automated multilingual OCR for scene texts and document images IIIT Related Publications 1. Ajeet Kumar Singh and C.V. Jawahar: Can RNNs Reliably Separate Script and Languages in Document Images. 13th IAPR International Conference on Document Analysis and Recognition (ICDAR). 2015 2. Ajeet Kumar Singh, Anand Mishra, Pranav Dabral and C.V. Jawahar: A Simple and Effective Solution for Script Identification in Wild, 12th IAPR International Workshop on Document Analysis and Systems (DAS), 2016 3. Minesh Mathew, Ajeet Kumar Singh and C.V. Jawahar: Multilingual OCR for Indic Scripts, 12th IAPR International Workshop on Document Analysis and Systems (DAS), 2016 Thank You IIIT

Recently Viewed Presentations

  • Civil Rights in Alaska Child Nutrition Programs

    Civil Rights in Alaska Child Nutrition Programs

    Civil Rights in Alaska Child Nutrition Programs. USDA Civil Rights Requirements and Child Nutrition Programs. This training presentation is developed and provided by the AlaskaDepartment of Education & Early Development, August 2011.
  • Section 12.1 - The Causes of Weather - North Allegheny

    Section 12.1 - The Causes of Weather - North Allegheny

    Imbalanced Heating. Earth's axis is tilted to different parts of the Earth receive different amount of sunlight at different times of the year. Imbalanced Heating. The Earth is a sphere so different places on Earth are at different angles to...
  • The Jewish American Experience

    The Jewish American Experience

    Origins of Social Justice. Although the term was first used around 1840 by Catholic theologian Luigi Taparelli, a Jesuit priest, social justice has been a concept in Judaism dating back to ancient times.. There are numerous injunctions in the Torah,...
  • Ball and Socket, Pivot, and Gliding: Keeping Your Female ...

    Ball and Socket, Pivot, and Gliding: Keeping Your Female ...

    Ball and Socket, Pivot, and Gliding: Keeping Your Female Hinges Healthy. Joanna Wilson, D.O. HerCare at ADC
  • In this unit you will learn: What makes

    In this unit you will learn: What makes

    Marine ecosystems on beaches struggle against erosion of the beach. Beaches in Hawaii often fight erosion. 25% of the beaches on Oahu have lost significant amounts of sand. Loss of sand also means loss of habitat, which could lead to...
  • What to Expect for 9th Grade?

    What to Expect for 9th Grade?

    Even better, have students check StudentVue. Helpful Tips to Make 9th Grade Successful. Tip #9. Get involved! Over 100 Clubs and Societies at W-L. Sports. Studies show that the more students are involved, the better they do in school.
  • www.rrnmf.com Pattern Recognition of Myopathic Disorders Richard J.

    www.rrnmf.com Pattern Recognition of Myopathic Disorders Richard J.

    Pattern Recognition of Myopathic Disorders. Richard J. Barohn, MD. Chair, Department of Neurology. Gertrude and Dewey Ziegler Professor of Neurology
  • Test

    Test

    Factor Analysis, DR group3/ lab3_fa1.R. lab3_dr1.R # real life research example