Translingual Topic Tracking with PRISE

Translingual Topic Tracking with PRISE

Multilingual Information Retrieval Doug Oard College of Information Studies and UMIACS University of Maryland, College Park USA January 14, 2019 AFIRM Global Trade 2.5 USA 2.0 Exports (Trillions of USD) EU China 1.5 1.0 Hong Kong 0.5

0.0 0.0 Japan South Korea 0.5 1.0 1.5 2.0 2.5 Imports (Trillions of USD) Source: Wikipedia (mostly 2017 estimates) English Mandarin Chinese Hindi Spanish French Modern Std Arabic Russian Bengali Portuguese

Indonesian Urdu German Japanese Swahili Western Punjabi Javanese Wu Chinese Telugu Turkish Korean Marathi Tamil Yue Chinese Vietnamese Italian Hausa Thai Persian Southern Min 0 Most Widely-Spoken Languages L2 speakers L1 speakers

200 400 600 800 1,000 1,200 Billions of Speakers Source: Ethnologue (SIL), 2018 Global Internet Users 3.69% 2.44% 3.70% 4.04% 3.79% 1.78% 4.65%

0.33% 4.79% 33.20% 8.36% 5.11% 1.78% 5.46% 6.14% 4.46% Web Pages 64.21% 5.02% 9.49% 27.53% English Chinese Spanish Japanese Portuguese German

Arabic French Russian Korean What Does Multilingual Mean? Mixed-language document Document containing more than one language Mixed-language collection Collection of documents in different languages Multi-monolingual systems Can retrieve from a mixed-language collection Cross-language system Query in one language finds document in another (Truly) multingual system A Story in Two Parts IR from the ground up in any language Focusing on document representation Cross-Language IR To the extent time allows

Query Documents Representation Function Representation Function Query Representation Document Representation Comparison Function Index Hits ASCII American Standard Code for Information Interchange ANSI X3.4-1968 |

| | | | | | | | | | | | | | | | | | | | | | | | | | | |

| | | 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

24 25 26 27 28 29 30 31 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI DLE DC1 DC2

DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US | | | | | | | | | | | | | |

| | | | | | | | | | | | | | | | | | 32 33 34 35 36 37 38 39 40

41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 64 SPACE ! " #

$ % & ' ( ) * + , . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?

| | | | | | | | | | | | | | | | | | | | | | | | | | | |

| | | | 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86

87 88 89 90 91 92 93 94 95 @ A B C D E F G H I J K L M N O P Q

R S T U V W X Y Z [ \ ] ^ _ | | | | | | | | | | | | |

| | | | | | | | | | | | | | | | | | | 96 97 98 99 100 101 102 103

104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 ` a b

c d e f g h i j k l m n o p q r s t u v w x y z { | } ~

DEL | | | | | | | | | | | | | | | | | | | | | | | | | |

| | | | | | The Latin-1 Character Set ISO 8859-1 8-bit characters for Western Europe French, Spanish, Catalan, Galician, Basque, Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English Printable Characters, 7-bit ASCII Additional Defined Characters, ISO 8859-1 Other ISO-8859 Character Sets -2 -6 -3 -7 -4

-8 -5 -9 East Asian Character Sets More than 256 characters are needed Two-byte encoding schemes (e.g., EUC) are used Several countries have unique character sets GB in Peoples Republic of China, BIG5 in Taiwan, JIS in Japan, KS in Korea, TCVN in Vietnam Many characters appear in several languages Research Libraries Group developed EACC Unified CJK character set for USMARC records Unicode Single code for all the worlds characters ISO Standard 10646 Separates code space from encoding Code space extends Latin-1 The first 256 positions are identical UTF-7 encoding will pass through email Uses only the 64 printable ASCII characters

UTF-8 encoding is designed for disk file systems Limitations of Unicode Produces larger files than Latin-1 Fonts may be hard to obtain for some characters Some characters have multiple representations e.g., accents can be part of a character or separate Some characters look identical when printed But they come from unrelated languages Encoding does not define the sort order Strings and Segments Retrieval is (often) a search for concepts But what we actually search are character strings What strings best represent concepts? In English, words are often a good choice Well-chosen phrases might also be helpful In German, compounds may need to be split Otherwise queries using constituent words would fail In Chinese, word boundaries are not marked

Thissegmentationproblemissimilartothatofspeech Tokenization Words (from linguistics): Morphemes are the units of meaning Combined to make words Anti (disestablishmentarian) ism Tokens (from computer science) Doug s running late ! Morphological Segmentation Swahili Example a + li + ni + andik +

ish + a he + past-tense + me + write + causer-effect + Declarative-mode

Credit: Ramy Eskander Morphological Segmentation Somali Example cun + t + aa eat + sh e + present-tense Credit: Ramy Eskander

Stemming Conflates words, usually preserving meaning Rule-based suffix-stripping helps for English {destroy, destroyed, destruction}: destr Prefix-stripping is needed in some languages Arabic: {alselam}: selam [Root: SLM (peace)] Imperfect: goal is to usually be helpful Overstemming {centennial,century,center}: cent Understamming: {acquire,acquiring,acquired}: acquir {acquisition}: acquis Snowball: rule-based system for making stemmers Longest Substring Segmentation Greedy algorithm based on a lexicon Start with a list of every possible term For each unsegmented string Remove the longest single substring in the list Repeat until no substrings are found in the list Longest Substring Example Possible German compound term (!):

washington List of German words: ach, hin, hing, sei, ton, was, wasch Longest substring segmentation was-hing-ton Roughly translates as What tone is attached? oil petroleum cymbidium goeringii restrain probe survey take samples probe survey oil take samples petroleum Probabilistic Segmentation For an input string c1 c2 c3 cn

Try all possible partitions into w1 w2 w3 c1 c2 c3 cn c1 c2 c3 c3 cn c1 c2 c3 cn etc. Choose the highest probability partition Compute Pr(w1 w2 w3 ) using a language model Challenges: search, probability estimation Non-Segmentation: N-gram Indexing Consider a Chinese document c1 c2 c3 cn Dont segment (you could be wrong!) Instead, treat every character bigram as a term c1 c2 , c2 c3 , c3 c4 , , cn-1 cn Break up queries the same way A Term is Whatever You Index

Word sense Token Word Stem Character n-gram Phrase Summary A term is whatever you index So the key is to index the right kind of terms! Start by finding fundamental features We have focused on character coded text Same ideas apply to handwriting, OCR, and speech Combine characters into easily recognized units Words where possible, character n-grams otherwise Apply further processing to optimize results A Story in Two Parts IR from the ground up in any language Focusing on document representation Cross-Language IR To the extent time allows

Query-Language CLIR Somali Document Collection Translation System Results select Retrieval Engine English Document Collection English queries examine Document-Language CLIR Somali Document Collection Somali documents Retrieval Engine

Somali queries Translation System Results select English queries examine Query vs. Document Translation Query translation Efficient for short queries (not relevance feedback) Limited context for ambiguous query terms Document translation Rapid support for interactive selection Need only be done once (if query language is same) Indexing Time: Statistical Document Translation Indexing time (sec)

500 monolingual cross-language 400 300 200 100 0 Thousands of documents Language-Neutral Retrieval Somali Query Terms Query Translation English Document Terms Document Translation

Interlingual Retrieval 1: 0.91 2: 0.57 3: 0.36 Translation Evidence Lexical Resources Phrase books, bilingual dictionaries, Large text collections Translations (parallel) Similar topics (comparable) Similarity Similar writing (if the character set is the same) Similar pronunciation People May be able to guess topic from lousy translations Types of Lexical Resources Ontology Organization of knowledge Thesaurus Ontology specialized to support search

Dictionary Rich word list, designed for use by people Lexicon Rich word list, designed for use by a machine Bilingual term list Pairs of translation-equivalent terms Full Query Named entities added Named entities from term list Named entities removed Backoff Translation Lexicon might contain stems, surface forms, or some combination of the two. Document Translation Lexicon mangez mangez - eat

mange - eats surface form mangez surface form mange stem mange surface form mangez mange stem surface form eat mangez

mange - eat mangent mange - eat stem stem Hieroglyphic Egyptian Demotic Greek Types of Bilingual Corpora Parallel corpora: translation-equivalent pairs Document pairs Sentence pairs Term pairs Comparable corpora: topically related Collection pairs Document pairs Some Modern Rosetta Stones News:

DE-News (German-English) Hong-Kong News, Xinhua News (Chinese-English) Government: Canadian Hansards (French-English) Europarl (Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portugese, Spanish, Swedish) UN Treaties (Russian, English, Arabic, ) Religion Bible, Koran, Book of Mormon Word-Level Alignment English Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform German English Madam President , I had asked the administration Seora Presidenta, haba pedido a la administracin del Parlamento Spanish A Translation Model From word-aligned bilingual text, we induce a translation model

p ( f i | e) where, p( f Example: p( |survey) = 0.4survey) = 0.4 p( |survey) = 0.4survey) = 0.3 p( |survey) = 0.4survey) = 0.25 p( |survey) = 0.4survey) = 0.05 fi i | e) 1 Using Multiple Translations Weighted Structured Query Translation Takes advantage of multiple translations and translation probabilities TF and DF of query term e are computed using TF and DF of its translations: TF (e, Dk ) p ( f i | e) TF ( f i , Dk ) fi

DF (e) p ( f i | e) DF ( f i ) fi BM-25 term frequency [log eQ (2.2 * tf (e, d k )) ( N df (e) 0.5) 8 * qtf (e) ][ ] dl (d k ) (df (e) 0.5) (0.3 0.9 * tf (e, d k )) 7 qtf (e) avdl document frequency document length Retrieval Effectiveness 110% DAMM

IMM PSQ MAP: CLIR/Monolingual 100% 90% 80% 70% 60% 50% 40% 0.0 0.1 0.2 0.3 0.4 0.5 0.6

0.7 0.8 0.9 1.0 Cum ulative Probability Threshold CLEF French Bilingual Query Expansion source language query Source Language IR Query Translation Target Language IR expanded source language query

source language collection Pre-translation expansion results expanded target language terms target language collection Post-translation expansion Query Expansion Effect Mean Average Precision 0.35 0.30 0.25 Both 0.20 Post

0.15 Pre None 0.10 0.05 0.00 0 5,000 10,000 15,000 Unique Dutch Terms Paul McNamee and James Mayfield, SIGIR-2002 Cognate Matching Dictionary coverage is inherently limited Translation of proper names Translation of newly coined terms Translation of unfamiliar technical terms Strategy: model derivational translation

Orthography-based Pronunciation-based Matching Orthographic Cognates Retain untranslatable words unchanged Often works well between European languages Rule-based systems Even off-the-shelf spelling correction can help! Subword (e.g., character-level) MT Trained using a set of representative cognates Matching Phonetic Cognates Forward transliteration Generate all potential transliterations Reverse transliteration Guess source string(s) that produced a transliteration Match in phonetic space Cross-Language Retrieval Query Query Translation

Translated Query Search Ranked List Uses of MT in CLIR Term Translation Query Formulation Query Term Matching Query Translation Translated Query Snippet Translation Indicative Translation Search Ranked List

Selection Document Informative Translation Examination Document Query Reformulation Use Interactive Cross-Language Question Answering Users with Correct Answers 8 7 6 5

4 3 2 1 0 8 11 13 4 16 6 14 7 2 10

15 12 1 3 9 5 Question Number iCLEF 2004 Questions, Grouped by Difficulty 8 11 13 4 16 6 Who is the managing director of the International Monetary Fund? Who is the president of Burundi? Of what team is Bobby Robson coach? Who committed the terrorist attack in the Tokyo underground?

Who won the Nobel Prize for Literature in 1994? When did Latvia gain independence? 14 When did the attack at the Saint-Michel underground station in Paris occur? 7 How many people were declared missing in the Philippines after the typhoon Angela? 2 How many human genes are there? 10 How many people died of asphyxia in the Baku underground? 15 How many people live in Bombay? 12 What is Charles Millon's political party? 1 3 9 5 What year was Thomas Mann awarded the Nobel Prize? Who is the German Minister for Economic Affairs? When did Lenin die? How much did the Channel Tunnel cost? For Further Reading Multilingual IR Paul McNamee et al, Addressing Morphological Variation in Alphabetic Languages, SIGIR, 2009 African-Language IR Open CLIR Challenge (Swahili), IARPA, 2018 Nkosana Malumba et al, AfriWeb: A Search Engine for

a Marginalized Language, ICADL, 2015 Cross-Language IR Jian-Yun Nie, Cross-Language Information Retrieval, Synthesis Lectures in HLT, Morgan&Claypool, 2010 Jianqiang Wang and Douglas W. Oard, Matching Meaning for Cross-Language Information Retrieval, Information Processing and Management, 2012

Recently Viewed Presentations

  • ANALYSIS OF FACTORS INFLUENCING THE USE OF HEART

    ANALYSIS OF FACTORS INFLUENCING THE USE OF HEART

    Therefore, it is difficult to compare directly the HRV of a subject during standing meditation and during sitting meditation. On the other hand, we did find that for those subjects who were in the sitting position throughout the experiment, their...
  • What actually happen in Boscatle - Think Geography

    What actually happen in Boscatle - Think Geography

    What actually happen in Boscatle Once you have decided what should happen at Boscastle… lets see what actually happened Embankments (on outside of meander) 2. Replaced low foot bridge 3.Channel enlargement 4. Flood Warning system 5. Land use zoning 6....
  • From Panic to Possibility: Enabling Spatial Data Transformation

    From Panic to Possibility: Enabling Spatial Data Transformation

    Map both object names and values. INSPIRE geographic names example: TYPE_LOC => typeLocal. If CNTRY_NAME = Austria, name.GeographicalName_language = German. Apply to whole dataset or to specific feature types. Domain experts can easily maintain rules in external spreadsheet. Define once...
  • Genetics - Beaufort County Schools

    Genetics - Beaufort County Schools

    Mendel and his peas: experiment. Terminology of chromosomes and genes, non-existence for Mendel. Mendel described the basic patterns of inheritance before the mechanism for inheritance was even discovered.. Controlled reproduction of plants and studied traits expressed in offspring.
  • Models of the web graph - Ryerson University

    Models of the web graph - Ryerson University

    McMaster University. Mathematics & Statistics Colloquium. Land acknowledgement. We acknowledge the privilege of working on the traditional territory of the Haudensaunee, Mississauga and Anishnaabeg peoples, and within the lands protected by the Dish With One Spoon Wampum agreement.
  • Chapter 12 Cash Flow Estimation and Risk Analysis

    Chapter 12 Cash Flow Estimation and Risk Analysis

    Stand-alone risk is the easiest to measure. Firms often focus on stand-alone risk when making capital budgeting decisions. Focusing on stand-alone risk is not theoretically correct, but it does not necessarily lead to poor decisions.
  • Diabetes and Reproductive Health: Pharmacological Treatment ...

    Diabetes and Reproductive Health: Pharmacological Treatment ...

    Share the definition of a migraine with aura: Recurrent attacks, lasting minutes, of unilateral fully reversible visual, sensory or other central nervous system symptoms that usually develop gradually and are usually followed by headache and associated migraine symptoms. From international...
  • 投影片 1

    投影片 1

    簡介 中央氣象局氣候監測預報與分析作業系統 發展現況 盧孟明 中央氣象局科技中心 2006.12.19 National Central University