6. Volltextsuche - Hinkelmann

6. Volltextsuche - Hinkelmann

2 Information Retrieval Motivation Information Retrieval is a field of activity for many years It was long seen as an area of narrow interest Advent of the Web changed this perception universal repository of knowledge free (low cost) universal access no central editorial board many problems though: IR seen as key to finding the solutions! Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 2 Motivation Information Retrieval: representation, storage, organization of, and access to information items Emphasis on the retrieval of information (not data) Focus is on the user information need User information need Example: Find all documents containing information about car accidents which happend in Vienna had people injured The information need is expressed as a query Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 3 Generic Schema of an Information System representation of resources (index/meta-data) information resources

representation of information need (query) user Comparison (Ranking) Information Retrieval systems do not search through the documents but through the representation (also called index, meta-data or description). source: (Ferber 2004) Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 4 Example D1 D2 D3 Heavy accident More vehicles Truck causes accident Because of a heavy car accident 4 people died yesterday morning in Vienna. In this quarter more cars became registered in Vienna. In Vienna a trucker drove into a crowd

of people. Four people were injured. Information need: documents containing information about accidents with heavy vehicles in Vienna Query: accident heavy vehicles vienna Expected result: document D3 but: not all terms of the query occur in the document the occurring terms accident and heavy also occur in D1 Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 5 retreival (search) (search) retreival Retrieval System information need answer: sorted list of IDs interface weighted documents query query

processing Ranking set of documents terms indexing indexing index store documents and IDs terms indexing assign IDs Text Prof. Dr. Knut Hinkelmann document resources documents Each document represented by a set of representative keywords or index terms An index term is a document word useful for remembering the document main themes the index is stored in an efficient system or data structured Queries are answered using the index with the ID die document can be retrieved

Information Retrieval and Knowledge Organisation - 2 Information Retrieval 6 Indexing manual indexing key words user specifies key words, he/she assumes useful Usually, key words are nouns because nouns have meaning by themselves there are two possibilities 1. user can assign any terms 2. user can select from a predefined set of terms ( controlled vocabulary) automatic indexing full text search Prof. Dr. Knut Hinkelmann search engines assume that all words are index terms (full text representation) system generates index terms from the words occurring in the text Information Retrieval and Knowledge Organisation - 2 Information Retrieval 7 Automatic Indexing: 1. Decompose a Document into Terms D1: D1: heavy heavy accident accident because because of of aa heavy heavy car car accident accident 44 people

people died died yesterday yesterday morning in vienna morning in vienna D2: D2: more more vehicles vehicles in in this this quarter more cars quarter more cars became became registered registered in in vienna vienna D3: D3: Truck Truck causes causes accident accident in in vienna vienna aa trucker trucker drove drove into into aa crowd crowd of of people people four four people people were were injured injured

Prof. Dr. Knut Hinkelmann rules determine how text are decomposed into terms by defining separators like punctuation marks, blanks or hyphens Additional preprocessing, e.g.. exclude specific strings (stop words, numbers) generate normal form stemming substitute characters (e.g. upper case lower case, Umlaut) Information Retrieval and Knowledge Organisation - 2 Information Retrieval 8 Automatic Indexing 2. Index represented as an inverted List Term Term aa accident accident became became because because car car cars cars died died heavy heavy in in more more of

of people people quarter quarter registered registered truck truck vehicles vehicles Prof. Dr. Knut Hinkelmann Dokument-IDs Dokument-IDs D1,D3 D1,D3 D1,D3 D1,D3 D2 D2 D1 D1 D1 D1 D2 D2 D1 D1 D1 D1 D1,D2,D3 D1,D2,D3 D2 D2 D1 D1 D1,D3 D1,D3 D2 D2 D2

D2 D3 D3 D2 D2 For each term: list of documents in which the term occurs additional information can be stored with each document like frequency of occurrence positions of occurrence In inverted list is similar to an index in a book Information Retrieval and Knowledge Organisation - 2 Information Retrieval 9 Index as Inverted List with Frequency In this example the inverted list contains the document identifier an the frequency of the term in the document. term term aa accident accident became became because because car car cars cars died died heavy heavy in in more

more of of people people quarter quarter registered registered truck truck vehicles vehicles ... ... Prof. Dr. Knut Hinkelmann (document; (document;frequency) frequency) (D1,1) (D1,1)(D3,2) (D3,2) (D1,2) (D3,1) (D1,2) (D3,1) (D2,1) (D2,1) (D1,1) (D1,1) (D1,1) (D1,1) (D2,1) (D2,1) (D1,1) (D1,1) (D1,2) (D1,2) (D1,1) (D1,1)(D2,1) (D2,1)(D3,1) (D3,1) (D2,1)

(D2,1) (D1,1) (D1,1) (D1,1) (D1,1)(D3,2) (D3,2) (D2,1) (D2,1) (D2,1) (D2,1) (D3,1) (D3,1) (D2,1) (D2,1) Information Retrieval and Knowledge Organisation - 2 Information Retrieval 10 Problems of Information Retrieval Word form A word can occur in different forms, e.g. singular or plural. Example: A query for car should find also documents containing the word cars Meaning A singular term can have different meanings; on the other hand the same meaning can be expressed using different terms. Example: when searching for car also documents containing vehicle should be found. Wording, phrases The same issue can be expressed in various ways Example: searching for motorcar should also find documents containing motorized car Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 11 Word Forms Flexion: Conjugation and declension of a word car - cars run ran - running

Derivations: words having the same stem form format formation compositions: statements information management - management of information In German, compositions are written as single words, sometimes with hyphen Informationsmanagement Informations-Management Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 12 Word Meaning and Phrases Dealing with words having the same or similar meaning Synonyms record file - dossier seldom not often Variants in spelling (e.g BE vs. AE) organisation organization night - nite Abbrevations UN United Nations Polyseme: words with multiple meanings Bank Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 13 2.1 Dealing with Word Forms and Phrases We distinguish two ways to deal with word forms and phrases Indexing without preprocessing All occuring word forms are included in the index Different word forms are unified at search time string operations Indexing with preprocessing Unification of word forms during indexing

Terms normal forms of occuring word forms index is largely independent of the concrete formulation of the text computerlinguistic approach Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 14 2.1.1 Indexing Without Preprocessing Index: contains all the word forms occuring in the document Query: Searching for specific word forms is possible (e.g. searching for cars but not for car) To search for different word forms string operations can be applied Operators for truncation and masking, e.g. ? covers exactly one character * covers arbitrary number of characters Context operators, e.g. [n] exact distance between terms maximal distance between terms Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 15 Index Without Preprocessing and Query Term Term aa accident accident became became because

because car car cars cars died died heavy heavy in in more more of of people people quarter quarter registered registered truck truck vehicles vehicles Prof. Dr. Knut Hinkelmann Dokument-IDs Dokument-IDs D1,D3 D1,D3 D1,D3 D1,D3 D2 D2 D1 D1 D1 D1 D2 D2 D1

D1 D1 D1 D1,D2,D3 D1,D2,D3 D2 D2 D1 D1 D1,D3 D1,D3 D2 D2 D2 D2 D3 D3 D2 D2 vehicle? car? people Information Retrieval and Knowledge Organisation - 2 Information Retrieval 16 Truncation and Masking: Searching for Different Word Forms Truncation: Wildcards cover characters at the beginning and end of words prefix or suffix schreib* ??schreiben finds schreiben, schreibt, schreibst, schreibe, finds anschreiben, beschreiben, but not verschreiben Masking deals with characters in words in particular in German, declensions and conjugation affect not only suffix and prefix schr??b* h??s* can find schreiben, schrieb can find Haus, Huser

Disadvantage: With truncation and masking not only the intended words are found schr??b* h??s* Prof. Dr. Knut Hinkelmann also finds schrauben also finds Hans, Hanse, hausen, hassen and also words in other languages like horse Information Retrieval and Knowledge Organisation - 2 Information Retrieval 17 Context Operators Context operators allow searching for variations of text phrases exact word distance Bezug [3] Telefonat Bezug nehmend auf unser Telefonat maximal word distance text <2> retrieval text retrieval text and fact retrieval For context operators to be applicable, the positions of the words must be stored in the index Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 18 Indexing Without Preprocessing Efficiency Efficient Indexing Overhead at retrieval time to apply string operators Wort forms user has to codify all possible word forms and phrases in the query

using truncation and masking operators no support given by search engine retrieval engine is language independent Phrases Variants in text phrases can be coded using context operators Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 19 2.1.2 Preprocessing of the Index Computerlinguistic Approach Each document is represented by a set of representative keywords or index terms An index term is a document word useful for remembering the documents main themes Index contains standard forms of useful terms 1. Restrict allowed terms Index terms not for Index 2. Normalisation: Map terms to a standard form Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 20 Restricting allowed Index Terms Objective: increase efficiency effectivity by neglecting terms that do not contribute to the assessment of a documents relevance There are two possibilities to restrict allowed index terms 1. Explicitly specify allowed index terms controlled vocabulary

2. Specify terms that are not allowed as index terms stopwords Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 21 Stop Words Stop words are terms that are not stored in the index Candidates for stop words are words that occur very frequently A term occurring in every document ist useless as an index term, because it does not tell anything about which document the user might be interested in a word which occurs only in 0.001% of the documents is quite useful because it narrows down the space of documents which might be of interest for the user words with no/little meanings terms that are not words (e.g. numbers) Examples: General: articles, conjunctions, prepositions, auxiliary verbs (to be, to have) occur very often and in general have no meaning as a search criteria application-specific stop words are also possible Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 22 Normalisation of Terms There are various possibilities to compute standard forms N-Grams stemming: removing suffixes or prefixes Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 23

N-Grams Index: sequence of charcters of length N Example: persons 3-Grams (N=3): per, ers, rso, son, ons 4-Grams (N=4): pers, erso, rson, sons N-Grams can also cross word boundaries Example: persons from switzerland 3-Grams (N=3): er, ers, rso, son, ons, ns_, s_f, _fr, rom, om_, m_s, _sw, swi, wit, itz, tze, zer, erl, rla, lan, and Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 24 Stemming Stemming: remove suffixes and prefixes to find a comming stem, e.g. remove ing and ed for verbs remove plural -s for nouns There are a number of exceptions, e.g. ing and ed may belong to a stem as in red or ring irregular verbs like go - went - gone, run - ran - run Approaches for stemming: rule-based approach lexicon-based approach Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 25 Rules for Stemming in English Kuhlen (1977) derived a rule set for stemming of most English words: Ending 1 2 3

4 5 6 7 8 9 10 11 12 13 ies XYes XYs ies' Xes' Xs' X 's X' XYing XYing ied XYed XYed Replacement y XY XY y X X X X XY XYe y XY XYe Condition XY = Co, ch, sh, ss, zz oder Xx XY = XC, Xe, Vy, Vo, oa oder ea X and Y are any letters

C stands for a consonant V stands for any vowel XY= CC, XV, Xx XY= VC XY = CC, XV, Xx XY= VC Source: (Ferber 2003) Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 26 Problems for Stemming In English a small number of rules cover most of the aorkd In German it is more difficult because also stem changes for many words insertion of Umlauts, e.g. Haus Huser new prefixes, e.g laufen gelaufen separation/retaining of prefix, e.g. mitbringen er brachte den Brief mit berbringen er berbrachte den Brief Irregular insertion of Verfungen when building composita Schwein-kram, Schwein-s-haxe, Schwein-e-braten These problems that cannot be easily dealt with by general rules operating only on strings Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 27 Lexicon-based Approaches for Stemming Principle idea: A lexicon contains stems for word forms complete lexicon: for each possible form the stem is stored persons person running run ran run went go

going go gone - go word stem lexicon: to each stem all the necessary data are stored to derive all word forms Distinction of different flexion classes specification of anomalies Example: To compute the stem of Flssen, the last characters are removed successively and the Umlaut is exchanged until a valid stem is found (Lezius 1995) Fall/Endung normal Umlaut Prof. Dr. Knut Hinkelmann FlssenFlussen- n Flsse-n Flusse-n en Flss-en Fluss-en sen ... Fls-sen ... Flus-sen ... Source: (Ferber 2003) Information Retrieval and Knowledge Organisation - 2 Information Retrieval 28 Index with Stemming and Stop Word Elimination Index: D1: D1: heavy heavy accident accident because because

of of aa heavy heavy car car accident accident 44 people people died died yesterday yesterday morning in vienna morning in vienna D2: D2: more more vehicles vehicles in in this this quarter more cars quarter more cars became became registered registered in in vienna vienna D3: D3: Truck Truck causes causes accident accident in in vienna vienna aa trucker trucker drove drove into into aa crowd crowd of of people

people four four people people were were injured injured Prof. Dr. Knut Hinkelmann Terms accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday Information Retrieval and Knowledge Organisation - 2 Information Retrieval Document IDs D1,D3 D1, D2 D3 D3 D1 D3 D3 D1 D3 D2 D1

D1, D3 D2 D2 D3 D3 D2 D1, D2, D3 D1 29 2.2 Classical Information Retrieval Models Classcial Models Boolean Model Vectorspace model Probabilistic Model Alternative Models user preferences Associative Search Social Filtering Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 30 Classic IR Models - Basic Concepts Not all terms are equally useful for representing the document contents: less frequent terms allow identifying a narrower set of documents The importance of the index terms is represented by weights associated to them Let ti be an index term dj be a document wij is a weight associated with (ti,dj) The weight wij quantifies the importance of the index term for describing the document contents (Stop words can be regarded as terms where wij = 0 for every document)

(Baeza-Yates & Ribeirp-Neto 1999) Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 31 Classic IR Models - Basic Concepts ti is an index term dj is a document n is the total number of docs T = (t1, t2, , tk) is the set of all index terms wij >= 0 is a weight associated with (ti,dj) wij = 0 indicates that term does not belong to doc vec(dj) = (w1j, w2j, , wkj) is a weighted vector associated with the document dj gi(vec(dj)) = wij is a function which returns the weight associated with pair (ti,dj) fi is the number of documents containing term ti source: teaching material of Ribeirp-Neto Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 32 Index vectors as Matrix d1 d1 d2 d2 d3 d3 d4 d4

t1 t1 ww1,1 1,1 ww1,2 1,2 ww1,3 1,3 ww1,4 1,4 t2 t2 ww2,1 2,1 ww2,2 2,2 ww2,3 2,3 ww2,4 2,4 t3 t3 ww3,1 3,1 ww3,2 3,2 ww3,3 3,3 ww3,4 3,4

... ... tn tn wwn,1 n,1 wwn,2 n,2 wwn,3 n,3 wwn,4 n,4 Prof. Dr. Knut Hinkelmann the vectors vec(dj) = (w1j, w2j, , wkj) associated with the document dj can be represented as a matrix Each colunm represents a document vector vec(dj) = (w1j, w2j, , wkj) the document dj contains a term ti if wij > 0 Each row represents a term vector tvec(ti) = (wi1, wi2, , win) the term ti is in document dj if wij > 0 Information Retrieval and Knowledge Organisation - 2 Information Retrieval 33 Boolean Document Vectors d1: d1: heavy heavy accident accident because because of

of aa heavy heavy car car accident accident 44 people died yesterday people died yesterday morning morning in in vienna vienna d2: d2: more more vehicles vehicles in in this this quarter more cars quarter more cars became became registered registered in in vienna vienna d3: d3: Truck Truck causes causes accident accident in in vienna vienna aa trucker trucker drove drove into into aa crowd crowd of of people people four

four people people were were injured injured Prof. Dr. Knut Hinkelmann accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday d1 d2 d3 1 1 0 0 1 0 0 1 0 0

1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 0 0

1 1 0 1 0 Information Retrieval and Knowledge Organisation - 2 Information Retrieval 34 2.2.1 The Boolean Model Simple model based on set theory precise semantics neat formalism Binary index: Terms are either present or absent. Thus, wij {0,1} Queries are specified as boolean expressions using operators AND (), OR (), and NOT () q = ta (tb tc) vehicle OR car AND accident Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 35 Boolean Retrieval Function The retrieval function can be defined recursivley R(ti,di) = TRUE, if wij = 1 (i.e. ti is in dj ) = FALSE, if wij = 0 (i.e. ti is not in dj ) R(q1 AND q2,di) = R(q1,di) AND R(q2,di) R(q1 OR q2,di) = R(q1,di) OR R(q2,di) R(NOT q,di) = NOT R(q,di)

The Boolean functions computes only values 0 or 1, i.e. Boolean retrieval classifies documents into two categories relevant (R = 1) irrelevant (R = 0) Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 36 Example fr Boolesches Retrieval accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday Prof. Dr. Knut Hinkelmann d1 d2 d3 1 1 0 0 1

0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 0 1

0 0 1 0 0 1 1 0 1 0 Query: (vehicle OR car) AND accident R(vehicle OR car AND accident, d1) = R(vehicle OR car AND accident, d2) = R(vehicle OR car AND accident, d3) = 1 0 0 Query: (vehicle AND car) OR accident R(vehicle AND car OR accident, d1) = R(vehicle AND car OR accident, d2) = R(vehicle AND car OR accident, d3) = Information Retrieval and Knowledge Organisation - 2 Information Retrieval 0 1 0 37 Drawbacks of the Boolean Model Retrieval based on binary decision criteria no notion of partial matching No ranking of the documents is provided (absence of a grading scale) The query q = t1 OR t2 OR t3 is satisfied by document containing one, two or three of the terms t1, t2, t3 No weighting of terms, wij {0,1}

Information need has to be translated into a Boolean expression which most users find awkward The Boolean queries formulated by the users are most often too simplistic As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 38 2.2.2 Vector Space Model Index can be regarded as an ndimensional space Example: accident car vehicle d1 4 3 1 d2 3 2 3 Each term corresponds to a dimension car (4,3,1) (3,2,3) wij > 0 whenever ti dj

accident To each term ti is associated a unitary vector vec(i) The unitary vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) document can be regarded as vector started from (0,0,0) vehicle Prof. Dr. Knut Hinkelmann point in space Information Retrieval and Knowledge Organisation - 2 Information Retrieval 39 2.2.2.1 Coordinate Matching Documents and query are represented as document vectors vec(dj) = (w1j, w2j, , wkj) query vector vec(q) = (w1q,...,wkq) Vectors have binary values wij = 1 wij = 0 if term ti occurs in Dokument dj else Ranking: Return the documents containing at least one query term rank by number of occuring query terms Ranking function: scalar product R(q,d) = q * d =

n q *d i=1 Prof. Dr. Knut Hinkelmann i i Multiply components and summarize Information Retrieval and Knowledge Organisation - 2 Information Retrieval 40 Coordinate Matching: Example accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday Prof. Dr. Knut Hinkelmann d1

d2 d3 q 1 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1

1 0 1 0 1 1 0 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 query vector represents

terms of the query (cf. stemming) accident heavy vehicles vienna Resultat: q * d1 = 3 q * d2 = 2 q * d3 = 2 Information Retrieval and Knowledge Organisation - 2 Information Retrieval 41 Assessment of Coordinate Matching Advantage compared to Boolean Model: Ranking Three main drawbacks frequency of terms in documents in not considered no weighting of terms privilege for larger documents Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 42 2.2.2.2 Term Weighting Use of binary weights is too limiting Non-binary weights provide consideration for partial matches These term weights are used to compute a degree of similarity between a query and each document How to compute the weights wij and wiq ? A good weight must take into account two effects: quantification of intra-document contents (similarity) tf factor, the term frequency within a document quantification of inter-documents separation (dissi-milarity) idf factor, the inverse document frequency

wij = tf(i,j) * idf(i) Prof. Dr. Knut Hinkelmann (Baeza-Yates & Ribeirp-Neto 1999) Information Retrieval and Knowledge Organisation - 2 Information Retrieval 43 TF - Term Frequency accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday d1 d2 d3 q 2 1 0 0 1 0

0 2 0 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 2 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 0 1 0

0 2 0 0 1 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 Let freq(i,j) be the raw frequency of term ti within document dj (i.e. number of occurrences of term ti in document dj) A simple tf factor can be computed as f(i,j) = freq(i,j) A normalized tf factor is given by f(i,j) = freq(i,j) / max(freq(l,j)) For reasons of simplicity, in this example f(i,j) = freq(i,j) Prof. Dr. Knut Hinkelmann where the maximum is computed over all terms which occur within the document dj

(Baeza-Yates & Ribeiro-Neto 1999) Information Retrieval and Knowledge Organisation - 2 Information Retrieval 44 IDF Inverse Document Frequency IDF can also be interpreted as the amount of information associated with the term ti . A term occurring in few documents is more useful as an index term than a term occurring in nearly every document Let ni be the number of documents containing term ti N be the total number of documents A simple idf factor can be computed as idf(i) = 1/ni A normalized idf factor is given by idf(i) = log (N/ni) the log is used to make the values of tf and idf comparable. Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 45 Example with TF and IDF accident car cause crowd die drive four heavy injur more morning people quarter register

truck trucker vehicle vienna yesterday Prof. Dr. Knut Hinkelmann IDF d1 d2 d3 0.5 0.5 1 1 1 1 1 1 1 1 1 0.5 1 1 1 1 1 0.33 1 2 1 0 0 1 0 0 2 0

0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 2 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 0 1 0 0 2 0

0 1 1 0 1 0 In this examle a simple tf factor f(i,j) = freq(i,j) and a simple idf factor idf(i) = 1/ni are used It is of advantage to store IDF and TF separately Information Retrieval and Knowledge Organisation - 2 Information Retrieval 46 Indexing a new Document Changes of the indexes when adding a new document d a new document vector with tf factors for d is created idf factors for terms occuring in d are adapted All other document vectors remain unchanged Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 47 Ranking Scalar product computes co-occurrences of term in document and query Drawback: Scalar product privileges large documents over small ones t1 q Euclidian distance between endpoint of vectors

d Drawback: euclidian distance privileges small documents over large ones t2 Prof. Dr. Knut Hinkelmann Angle between vectors the smaller the angle beween query and document vector the more similar they are the angle is independent of the size of the document the cosine is a good measure of the angle Information Retrieval and Knowledge Organisation - 2 Information Retrieval 48 Cosine Ranking Formula t1 the more the directions of query a and document dj coincide the more relevant is dj q dj the cosine formula takes into account the ratio of the terms not their concrete number t2 cos(q,dj) = q dj |q| |dj| Let be the angle between q and dj Because all values wij >= 0 the angle is between 0 und 90 the larger the less is cos the less the larger is cos

cos 0 = 1 cos 90 = 0 Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 49 The Vector Model The best term-weighting schemes use weights which are given by wij = f(i,j) * log(N/ni) the strategy is called a tf-idf weighting scheme For the query term weights, a suggestion is wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni) (Baeza-Yates & Ribeirp-Neto 1999) Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 50 The Vector Model The vector model with tf-idf weights is a good ranking strategy with general collections The vector model is usually as good as the known ranking alternatives. It is also simple and fast to compute. Advantages: term-weighting improves quality of the answer set partial matching allows retrieval of docs that approximate the query conditions cosine ranking formula sorts documents according to degree of similarity to the query Disadvantages: assumes independence of index terms (??); not clear that this is bad though (Baeza-Yates & Ribeiro-Neto 1999) Prof. Dr. Knut Hinkelmann

Information Retrieval and Knowledge Organisation - 2 Information Retrieval 51 2.2.3 Extensions of the Classical Models Combination of Boolean model vector model indexing with and without preprocessing Extended index with additional information like document format (.doc, .pdf, ) language Using information about links in hypertext link structure anchor text Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 52 Boolean Operators in the Vector Model accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday

Prof. Dr. Knut Hinkelmann d1 d2 d3 q 2 1 0 0 1 0 0 2 0 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 2 0 0 1 1

0 0 1 1 0 1 0 1 1 0 1 1 0 1 0 0 2 0 0 1 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1

0 Many search engines allow queries with Boolean operators (vehicle OR car) AND accident Retrieval: Boolean operators are used to select relevant documents in the example, only documents containing accident and either vehicle or carare considered relevant ranking of the relevant documents is based on vector model idf-tf weighting cosine ranking formula Information Retrieval and Knowledge Organisation - 2 Information Retrieval 53 Queries with Wild Cards in the Vector Model accident car cars causes crowd died drove four heavy injured more morning people quarter registered truck trucker vehicles vienna yesterday

Prof. Dr. Knut Hinkelmann d1 d2 d3 q 2 1 0 0 0 1 0 0 2 0 0 1 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 2 0 0

1 1 0 0 1 1 0 1 0 0 1 1 0 1 1 0 1 0 0 2 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0

0 0 1 1 0 vector model based in index without preprocessing index contains all word forms occuring in the documents Queries allow wildcards (masking and truncation), e.g. accident heavy vehicle* vienna Principle of query answering First, wildcards are extended to all matching terms (here vehicle* matches vehicles) ranking according to vector model Information Retrieval and Knowledge Organisation - 2 Information Retrieval 54 Using Link Information in Hypertext Ranking: link structure is used to calculate a quality ranking for each web page PageRank HITS Hypertext Induced Topic Selection (Authority and Hub) Hilltop Indexing: text of a link (anchor text) is associated both with the page the link is on and with the page the link points to Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 55 The PageRank Calculation

PageRank has been developed by Sergey Brin and Lawrence Page at Stanford University and published in 19981) PageRank uses the link structure of web pages Original version of PageRank calculation: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) with PR(A) PR(Ti) being the PageRank of page A, being the PageRank of apges Ti that contain a link to page A C(Ti) being the number of links going out of page Ti d being a damping factor with 0 <= d <= 1 S. Brin and L. Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Computer Networks and ISDN Systems. Vol. 30, 1998, Seiten 107-117 http://www-db.stanford.edu/~backrub/google.html oder http://infolab.stanford.edu/pub/papers/google.pdf 1) Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 56 The PageRank Calculation - Explanation PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) The PageRank of page A is recursively defined by the PageRanks of

those pages which link to page A The PageRank of a page Ti is always weighted by the number of outbound links C(Ti) on page Ti: This means that the more outbound links a page Ti has, the less will page A benefit from a link to it on page Ti. The weighted PageRank of pages Ti is then added up. The outcome of this is that an additional inbound link for page A will always increase page A's PageRank. Finally, the sum of the weighted PageRanks of all pages Ti is multiplied with a damping factor d which can be set between 0 and 1. Source: http://pr.efactory.de/e-pagerank-algorithm.shtml Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 57 Damping Factor and the Random Surfer Model The PageRank algorithm and the damping factor are motivated by the model of a random surfer. The random surfer finds a page A by following a link from a page Ti to page A or by random choice of a web page (e.g. typing the URL). The probability that the random surfer clicks on a particular link is given by the number of links on that page: If a page Ti contains C(Ti) links, the probability for each links is 1/ C(Ti) The justification of the damping factor is that the surfer does not click on an infinite number of links, but gets bored sometimes and jumps to another page at random. d is the probability for the random surfer not stopping to click on links this is way the sum of pageRanks is multiplied by d (1-d) is the probability that the surfer jumps to another page at random after he stopped clicking links. Regardless of inbound links, the probability for the random surfer jumping to a page is always (1-d), so a page has always a minimum PageRank (According to Brin and Page d = 0.85 is a good value) Source: http://pr.efactory.de/e-pagerank-algorithm.shtml Prof. Dr. Knut Hinkelmann

Information Retrieval and Knowledge Organisation - 2 Information Retrieval 58 Calculation of the PageRank - Example We regard a small web consisting of only three pages A, B and C and the links structure shon in the figure To keep the calculation simple d is set to 0.5 These are the equation for the PageRank calculation: PR(A) = 0.5 + 0.5 PR(C) PR(B) = 0.5 + 0.5 (PR(A) / 2) PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B)) Solving these equations we get the following PageRank values for the single pages: PR(A) = 14/13 = 1.07692308 PR(B) = 10/13 = 0.76923077 PR(C) = 15/13 = 1.15384615 Quelle: http://pr.efactory.de/e-pagerank-algorithmus.shtml Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 59 Iterative Calculation of the PageRank - Example Because of the size of the actual web, the Google search engine uses an approximative, iterative computation of PageRank values each page is assigned an initial starting value the PageRanks of all pages are then calculated in several computation cycles. Iteration 0 1 2 3 4 5 6 7 8 9 10 11 12

PR(A) 1 1 1.0625 1.07421875 1.07641602 1.07682800 1.07690525 1.07691973 1.07692245 1.07692296 1.07692305 1.07692307 1.07692308 PR(B) 1 0.75 0.765625 0.76855469 0.76910400 0.76920700 0.76922631 0.76922993 0.76923061 0.76923074 0.76923076 0.76923077 0.76923077 PR(C) 1 1.125 1.1484375 1.15283203 1.15365601 1.15381050 1.15383947 1.15384490 1.15384592 1.15384611 1.15384615 1.15384615

1.15384615 According to Lawrence Page and Sergey Brin, about 100 iterations are necessary to get a good approximation of the PageRank values of the whole web. Quelle: http://pr.efactory.de/d-pagerank-algorithmus.shtml Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 60 Alternative Link Analysis Algorithms (I): HITS Hypertext-Induced Topic Selection (HITS) is a link analysis algorithm proposed by J. Kleinberg 1999 HITS rates Web pages for their authority and hub values: The authority value estimates the value of the content of the page; a good authority is a page that is pointed to by many good hubs the hub value estimates the value of its links to other pages; a good hub is a page that points to many good authorities (examples of hubs are good link collections); Every page i is assigned a hub weight hi and an Authority weight ai : Jon Kleinberg: Authoritative sources in a hyperlinked environment. In: Journal of the ACM, Vol. 36, No. 5, pp. 604-632, 1999, http://www.cs.cornell.edu/home/kleinber/auth.pdf Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 61 Alternative Link Analysis Algorithms (II): Hilltop The Hilltop-Algorithm1) rates documents based on their incoming links from so-called expert pages Expert pages are defined as pages that are about a topic and have links to many non-affiliated pages on that topic. Pages are defined as non-affiliated if they are from authors of nonaffiliated organisations. Websites which have backlinks from many of the best expert pages are authorities and are ranked high. A good directory page is an example of an expert page (cp. hubs). Determination of expert pages is a central point of the hilltop algorithm. 1)

The Hilltop-Algorithmus was developed by Bharat und Mihaila an publishes in 1999: Krishna Bharat, George A. Mihaila: Hilltop: A Search Engine based on Expert Documents. In 2003 Google bought the patent of the algorithm (see also http://pagerank.suchmaschinen-doktor.de/hilltop.html) Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 62 Anchor-Text The polar bear Knut was born in the zoo of Berlin The Google search engine uses the text of links twice First, the text of a link is associated with the page that the link is on, In addition, it is associated with the page the link points to. Advantages: Anchors provide additional description of a web pages from a users point of view Documents without text can be indexed, such as images, programs, and databases. Disadvantage: Search results can be manipulated (cf. Google Bombing1)) A Google bomb influences the ranking of the search engine. It is created if a large number of sites link to the page with anchor text that often has humourous, political or defamatory statements. In the meanwhile, Google bombs are defused by Google. Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 63 Natural Language Queries i need information about accidents with cars and other vehicles

is equivalent to information accident car vehicle Natural language queries are treated as any other query Stop word elimination Stemming but no interpretation of the meaning of the query Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 64 Searching Similar Documents Is is often difficult to express the information need as a query An alternative search method can be to search for similar documents to a given document d Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 65 mple: Finding Similar Documents Principle and Example Find the most similar documents to d1 accident car cause crowd die drive four heavy injur more morning

people quarter register truck trucker vehicle vienna yesterday Prof. Dr. Knut Hinkelmann IDF d1 d2 d3 0.5 0.5 1 1 1 1 1 1 1 1 1 0.5 1 1 1 1 1 0.33 1 2 1 0 0 1 0

0 2 0 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 2 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 0 1 0

0 2 0 0 1 1 0 1 0 Principle: Use a given document d as a query Compare all document di with d Example (scalar product): IDF * d1 * d2 = 0.83 IDF * d1 * d3 = 2.33 The approach is the same as for a : same index same ranking function Information Retrieval and Knowledge Organisation - 2 Information Retrieval 66 The Vector Space Model The vector space model ... is relatively simple and clear, is efficient, ranks documents, can be applied for any collection of documents The model has many heuristic components of parameters, e.g. determintation of index terms calculation of tf and idf ranking function The best parameter setting depends on the document collection Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 67

2.3 Implementation of the Index The vector space model is usually implemented with an inverted index For each term a pointer references a posting list with an entry for each document containing the term The posting lists can be implemented as linked lists or more efficient data structures that reduce the storage requirements (index pruning) To answer a query, the corresponding posting lists are retrieved and the documents are ranked, i.e. efficient retrieval of posting list is essential Source: D. Grossman, O. Frieder (2004) Information Retrieval, Springer-Verlag Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 68 Implementing the Term Structure as a Trie Example: Structure of a node in a trie*) *) Sequentially scanning the index for query terms/posting lists is inefficient a trie is a tree structure each node is an array, one element for each character each element contains a the characters and their order are identical for each node. Therefore they do not need to be link to another node stored explicitly. Source: G. Saake, K.-U. Sattler: Algorithmen und Datenstrukturen Eine Einfhrung mit Java. dpunkt Verlag 2004 Prof. Dr. Knut Hinkelmann

Information Retrieval and Knowledge Organisation - 2 Information Retrieval 69 The Index as a Trie The leaves of the trie are the index terms, pointing to the corresponding position lists Searching a term in in a trie: search starts at the root subsequently for each character of the term the reference to the corresponding subtree is followed until a leave with the term is found search stops without success (Saake, Sattler 2004) Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 70 Patricia Trees Idea: Skip irrelevant parts of terms This is achieved by storing in each node the number of characters to be skipped. Example: Patricia = Practical Algorithm To Retrieve Information Coded in Alphanumeric Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval (Saake, Sattler 2004) 71 2.4 Evaluating Search Methods Set of all document documents

found relevant documents found relevant documents that are not found Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 72 Performance Measure of Information Retrieval: Recall und Precision relevant documents in answer set DRA Several different measures for evaluating the performance of information retrieval systems have been proposed; two important ones are: Recall: fraction of the relevant documents that are are successfully retrieved. D R= |DRA| |DR| Precision: fraction of the documents retrieved that are relevant to the user's information need relevant documents DR Prof. Dr. Knut Hinkelmann answer set DA

P= Information Retrieval and Knowledge Organisation - 2 Information Retrieval |DRA| |DA| 73 F-Measure The F-measure is a mean of precision and recall In this version, precision and recall are equally weighted. The more general version allows to give preference to recall or precision F2 weights recall twice as much as precision F0.5 weights precision twice as much as recall Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 74 Computing Recall and Precision Evaluation: Perform a predefined set of queries The search engines delivers a ranked set of documents Use the first X documents of the result list as answer set Compute recall and precision for the frist X documents of the ranked result list. How do you know, which documents are relevant? 1. A general reference set of documents can be used. For example, TREC (Text REtrieval Conference) is an annual event where large test collections in different domains are used to measure and compare

performance of infomration retrieval systems 2. For companies it is more important to evaluate information retrieval systems using their own documents 1. Collect a representative set of documents 2. Specify queries and associated relevant documents 3. evaluate search engines by computing recall and precision for the query results Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 75 2.5 User Adaptation Take into account information of a user to filter document particularly relevant to this user Relevance Feedback Retrieval in multiple passes; in each pass the use refines the query based on results of previous queries Explicit User Profiles subscription User-specific weights of terms Social Filtering Similar use get similar documents Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 76

2.5.1 Relevance Feedback given by the User Example: The user specifies relevance of each document. Example: for query "Pisa" only the documents about the education assessment are regarded as relevant In the next pass, the top ranked documents are only about the education assessment This example is from the SmartFinder system from empolis. The mindaccess system from Insiders GmbH uses the same techology. Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 77 Relevance Feedback: Probabilistic Model Assumption: Given a user query, there is an ideal answer set Idea: An initial answer is iteratively improved based on user feedback Approach: An initial set of documents is retrieved somehow User inspects these docs looking for the relevant ones (usually, only top 10-20 need to be inspected) IR system uses this information to refine description of ideal answer set By repeting this process, it is expected that the description of the ideal answer set will improve The description of ideal answer set is modeled in probabilistic terms (Baeza-Yates & Ribeiro-Neto 1999)

Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 78 Probabilistic Ranking Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Probabilistic ranking is: Definitions: wij {0,1} (i.e. weights are binary) similarity of document dj to the query q is the probability that document dj is relevant is the probability that document dj is not relevant is the document vector of dj Prof. Dr. Knut Hinkelmann (Baeza-Yates & Ribeiro-Neto 1999) Information Retrieval and Knowledge Organisation - 2 Information Retrieval 79 Computing Probabilistic Ranking Probabilistic ranking can be computed as: where stands for the probability that the index term ki is present in a document randomly selected from the set R of relevant documents stands for the probability that the index term ki is not present in a document randomly selected from the set R of relevant documents is the weight of term ki in the query is the weight of term ki in document dj

(Baeza-Yates & Ribeiro-Neto 1999) Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 80 Relevance Feedback: Probabilistic Model The probabilities that a term ki is (not) present in a set of relevant documents can be computed as : N ni total number of documents number of documents containing term ki V number of relevant documents retrieved by the probabilistic model number of relevant documents containing term ki Vi There are different ways to find the relevant document V : Automatically: V can be specified as the top r documents found By user feedback: The user specifies for each retrieved document whether it is relevant or not Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 81 2.5.2 Explicit User Profiles Idea: Using knoweldge about the user to provide information that is particularly relelvant for him/her users specify topics of interest as a set of terms these terms represent the user profile documents containing the terms of the user profile are prefered

information need/ preferences documents document representation profile acquisition user profile index ranking function Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 82 User profiles for subscribing to information accident car cause crowd die drive four heavy injur more morning people quarter register truck trucker vehicle vienna yesterday IDF

d1 d2 d3 U1 U2 0.5 0.5 1 1 1 1 1 1 1 1 1 0.5 1 1 1 1 1 0.33 1 2 1 0 0 1 0 0 2 0 0 1 1 0

0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 2 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 0 1 0 0 2 0 0 1 1 0

1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 user profiles are treated

as queries Example: news feed As soon as a new document arrives it is tested for similarity with the user profiles Vector space model can be applied A document is regarded relevant if the ranking reaches a specified threshold Example: User 1 is interested in any car accicent User 2 is interested in deadly car accidents with trucks Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 83 User profiles for Individual Queries Users specify importance Example: users profiles with term accident car cause crowd die drive four heavy injur more morning people quarter register truck

trucker vehicle vienna yesterday Prof. Dr. Knut Hinkelmann IDF d1 d2 d3 U1 U2 q 0.5 0.5 1 1 1 1 1 1 1 1 1 0.5 1 1 1 1 1 0.33 1 2 1 0 0

1 0 0 2 0 0 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 2 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 0

1 0 0 2 0 0 1 1 0 1 0 1 0.8 0 0 0 0 0 0.2 0 0 0 0.5 0 0 0.6 0 1 0 0 1 0.2 0 0 0.8 0 0 0.6 0 0 0 0.8

0 0 1 0.6 0.1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 of terms User profiles are used as additional term weights Different ranking for different users Example ranking for user 1 IDF * d1 * U1 *q = 1,4 IDF * d2 * U1 *q = 1 IDF * d3 * U1 *q = 0,5 ranking for user 2 IDF * d1 * U2 *q = 2,2 IDF * d2 * U2 *q = 0,1

IDF * d3 * U2 *q = 0,5 Information Retrieval and Knowledge Organisation - 2 Information Retrieval 84 Acquisition and Maintenance of User Profiles There are different ways to specify user profiles manual: users specifies topics of interests (and weights) explicitly selection of predefined terms or query Problem: Maintenance user feedback: user collects relevant documents terms in selected document are regarded as important Problem: How to motivate the user to give feedback (a similar approach is used by spam filters - classification) Heuristics: observing user behaviour Example: If a user has opened a document for long time, it is assumend that he/she read it and therefore it might be relevant Problem: Heuristics might be wrong Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 85 Social Filtering Idea: Information is relevant, if other users who showed similar behaviour regarded the information as relevant Relevance is specified by the users User profiles are compared Example: A simple variant can be found at Amazon purchases of books and CDs are stored people who bought this book also bought Prof. Dr. Knut Hinkelmann Information Retrieval and Knowledge Organisation - 2 Information Retrieval 86

Recently Viewed Presentations

  • The density matrix renormalization group

    The density matrix renormalization group

    Density . Matrix Renormalization: A Review of the Method and its Applications. in Theoretical Methods for Strongly Correlated Electrons, CRM Series in Mathematical Physics, David Senechal, Andre-Marie Tremblay and Claude Bourbonnais (eds.), Springer, New York, 2003. New
  • INTRODUCING THE MEASURES REGISTRY USER GUIDE: FOOD ENVIRONMENT

    INTRODUCING THE MEASURES REGISTRY USER GUIDE: FOOD ENVIRONMENT

    Here is one type of data display possible with GIS. This figure shows the number of fast food and convenience stores per 1,000 residents in this geographically defined area. From this figure you can see that Berkeley, Piedmont, Castro Valley,...
  • The Forecast Process

    The Forecast Process

    Arial Wingdings Perpetua Titling MT Cascade 1_Cascade Bitmap Image ABC's of weather forecasting THE TOPICS Forecast tools Observations Computer models Forecast process Wide view - hemispheric Narrowing in - synoptic scale Close look - Mesoscale Elements of a forecast Temperature...
  • 19590_PPT_WF_AssocTruHealth_USE-082916

    19590_PPT_WF_AssocTruHealth_USE-082916

    A REAL-FOOD SYSTEM WITHGLYCONUTRIENTS. Mannatech uses an innovative and scientific method of "capturing" nutrition from nature. The TruHealth System delivers nutrients to your body the way it needs while containing our proprietary Glyconutrients that help keep your cells communicatingfor
  • THE INDUSTRIAL REVOLUTION  Student Handouts, Inc. THE FIRST

    THE INDUSTRIAL REVOLUTION Student Handouts, Inc. THE FIRST

    The Industrial Revolution . Machines were invented which replaced human labor. New energy sources were developed to power the new machinery - water, steam, electricity, oil (gas, kerosene) Some historians place advances in atomic, solar, and wind energy at the...
  • The Jovian Planets Chapter 7 Topics  Jupter, Saturn,

    The Jovian Planets Chapter 7 Topics Jupter, Saturn,

    Times New Roman Tahoma comet The Jovian Planets Topics Jovian planets (Jupiter-like) Size Distance from the Sun PowerPoint Presentation Composition Jupiter Io Europa Ganymeade Callisto Saturn Titan Uranus Neptune Triton Extrasolar planets How do we know?
  • 缺血性卒中和短暂性脑缺血发作的二级预防

    缺血性卒中和短暂性脑缺血发作的二级预防

    Arial 宋体 Wingdings Times New Roman 黑体 隶书 Times 楷体_GB2312 Comic Sans MS 微软雅黑 Verdana Calibri Symbol MS PGothic Georgia Arial Unicode MS Frutiger-Roman Watermark 1_Watermark Adobe Photoshop Image Microsoft Office Excel 图表 Photo Editor 照片 图表 Microsoft Graph Chart...
  • 计算机网络II

    计算机网络II

    adjacencies in a broadcast network? How to . sync. data . in a broadcast network? How to . abstract a graph. from an LSDB(LSdatabase)? How to design a . hierarchical routing. scheme? The main task of the protocol is to...