Applying Co-training to Clickthrough Data for Search Engine ...

Applying Co-training to Clickthrough Data for Search Engine ...

Searching Web Better Dr Wilfred Ng Department of Computer Science The Hong Kong University of Science and Technology 1 Outline Introduction Main Techniques (RSCF) The RSCF-based Metasearch Engine Clickthrough Data Ranking Support Vector Machine Algorithm Ranking SVM in Co-training Framework Search Engine Components

Feature Extraction Experiments Current Development 2 Search Engine Adaptation Computer Science CS terms Finance Social Science Product News Google, MSNsearch, Wisenut, Overture, Adapt the search engine by learning from implicit feedback ---- Clickthrough data 3 Clickthrough Data

Clickthrough data: data that indicates which links in the returned ranking results have been clicked by the users Formally, a triplet (q, r, c) q the input query r the ranking result presented to the user c the set of links the user clicked on Benefits: Can be obtained timely No intervention to the search activity 4 An Example of Clickthrough Data Users input query l l Clicked by the user l l l

l l l 5 Target Ranking (Preference Pairs Set ) Arising from l1 Arising from l7 Arising from l10 Empty Set l7

l10

Target Ranking (Preference Pairs Set ) Arising from l1 Arising from l7 Arising from l10 Empty Set l7

Labelled data set: l1, l2,, l10 Unlabelled data set: l11, l12, 8 The Ranking SVM Algorithm 1 l1 , l2 , l3 Three links, each described by a feature vector Target ranking: l1

l1 l3 Weight vector -- Ranker Distance between two closest projected links l3 l3 Cons: It needs a large set of labelled data 9 The Ranking SVM in Co-training Framework Divide the feature vec tor into two subvectors Rebuild each ranker fr om the augmented label led data set Training Ranker a_A Ranker a_B Selecting confident

pairs Augmented pairs Each ranker chooses several unlabelled prefe rence pairs and add the m to the labelled data se t Augmented pairs Two rankers are built over these two feature s ubvectors Labelled Preference Feedback Pairs P_l Unlabelled Preference Pairs P_u 10 Some Issues Guideline for partitioning the feature vector After the partition each subvector must be sufficient fo

r the later ranking Number of rankers Depend on the number of features When to terminate the procedure? Prediction difference: indicates the ranking differenc e between the two rankers After termination, get a final ranker on the augmented labelled data set 11 Metasearch Engine User

Receives query from user Sends query to multiple search engines Combines the retrieved results from the underlying search engines Presents a unified ranking result to user query Metasearch Engine Search Engine 1 Search Engine 2 Search Engine n Retrieved Retrieved Results 1 Results 2 Retrieved Results n Unified Ranking

Result 12 Search Engine Components Powered by Inktomi, relatively mature One of the most powerful search engines nowadays A new but growing search engine Ranks links based on the prices paid by the sponsors on t he links 13

Feature Extraction Ranking Features (12 binary features) Rank(E,T) where E {M,W,O} TM,W,O} T} T {M,W,O} T1,3,5,10} (M: MSNsearch, W: Wisenut, O} T: Overture) Indicate the ranking of the links in each underlying sea rch engine Similarity Features(4 features) Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a) URL,Title, Abstract Cover, Abstract Group Indicate the similarity between the query and the link 14 Experiments Experiment data: within the same domain Computer science Objectives: Offline experiments compared with RSVM Online experiments compared with Google

15 Prediction Error Prediction Error: difference between the rankers ranking and the target ranking 1 Target ranking: l1

l3 16 Offline Experiment (Compared with RSVM) 10 queries R A B 30 queries 60 queries The ranker trained by the RSVM algorithm on the whole feature vector The ranker trained by the RSCF algorithm on one feature subvector The ranker trained by the RSCF algorithm on another feature subvector Prediction error rise up again! The number of iterations in RSCF algorithm is about four to five! 17 Offline Experiment (Compare with RSVM) Overall comparison R C

The ranker trained by the RSVM algorithm The final ranker trained by the RSCF algorithm 18 Online Experiment (Compare with Google) Experiment data: CS terms e.g. radix sort, TREC collection, Experiment Setup Combine the results returned by RSCF and those by Google into one shuffled list Present to the users in a unified way Record the users clicks Cases More clicks More clicks on RSCF on Google Queries 26 17

Tie No clicks Total 13 2 58 19 Experimental Analysis Features Rank(M,1) Rank(M,3) Rank(M,5) Rank(M,10) Weight 0.1914 0.2498 0.1152 0.2498 Features Rank(W,1) Rank(W,3) Rank(W,5) Rank(W,10 )

Weight 0.0184 0.1014 -0.3021 -0.4367 Rank(O} T,1) Rank(O} T,3) Rank(O} T,5) Rank(O} T,10) -0.1673 -0.1229 -0.4976 0.4441 Sim_U(q,l) Sim_T(q,t) Sim_C(q,a) Sim_G(q,a) 0.5382 0.4928 0.4136 0.5010 20 Experimental Analysis Features

Rank(M,1) Rank(M,3) Rank(M,5) Rank(M,10) Weight 0.1914 0.2498 0.1152 0.2498 Features Rank(W,1) Rank(W,3) Rank(W,5) Rank(W,10 ) Weight 0.0184 0.1014 -0.3021 -0.4367 Rank(O} T,1) Rank(O} T,3) Rank(O} T,5) Rank(O} T,10) -0.1673 -0.1229

-0.4976 0.4441 Sim_U(q,l) Sim_T(q,t) Sim_C(q,a) Sim_G(q,a) 0.5382 0.4928 0.4136 0.5010 21 Experimental Analysis Features Rank(M,1) Rank(M,3) Rank(M,5) Rank(M,10) Weight 0.1914 0.2498 0.1152 0.2498 Features Rank(W,1) Rank(W,3)

Rank(W,5) Rank(W,10 ) Weight 0.0184 0.1014 -0.3021 -0.4367 Rank(O} T,1) Rank(O} T,3) Rank(O} T,5) Rank(O} T,10) -0.1673 -0.1229 -0.4976 0.4441 Sim_U(q,l) Sim_T(q,t) Sim_C(q,a) Sim_G(q,a) 0.5382 0.4928 0.4136 0.5010 22

Conclusion on RSCF Search engine adaptation The RSCF algorithm Train on clickthrough data Apply RSVM in the co-training framework The RSCF-based metasearch engine Offline experiments better than RSVM Online experiments better than Google 23 Current Development Features extraction and division Apply in different domains Search engine personalization SpyNoby Project: Personalized search engine with clickthrough analysis

24 Modified Target Ranking for Metasearch Engines If l1 and l7 are from the same underlying se arch engine, the preference pairs set arisin g from l1 should be l1

l7

Unlabelled data set: l11, l12, 26 RSCF-based Metasearch Engine - MEA User query q 1. 2. 30. ...... q MEA q q 1. 2. 30. Unified Ranking

Result 1. 2. 30. 27 RSCF-based Metasearch Engine - MEB User query q MEB q 1. 2. 30. q q q 1.

2. 30. 1. 2. 30. 1. 2. 30. Unified Ranking Result 28 Generating Clickthrough Data n Probability of being clicked on: Pr(k ) k H (V ) k: the ranking of the link in the metasearch engine n: the number of all the links in the metasearch engine

: the skewness parameter in Zipfs law n 1 Harmonic number: H (V ) i 1 i Judge the links relevance manually If the link is irrelevant not be clicked on If the link is relevant has the probability of Pr(k) to be clicked on 29 Feature Extraction Ranking Features (binary features) Rank(E,T): whether the link is ranked within ST in E where E {M,W,O} TG,M,W,O} T} T {M,W,O} T1,3,5,10,15,20,25,30} S1={M,W,O} T1}, S3={M,W,O} T2,3}, S5={M,W,O} T4,5}, S10={M,W,O} T6,7,8,9,10} (G: Google, M: MSNsearch, W: Wisenut, O} T: Overture) Indicate the ranking of the links in each underlying sear ch engine

Similarity Features(4 features) Sim_U(q,l), Sim_T(q,t), Sim_C(q,a), Sim_G(q,a) Measure the similarity between the query and the link 30 Experiments Experiment data: three different domains CS terms News E-shopping Objectives: Prediction Error better than RSVM Top-k Precision adaptation ability 31 Top-k Precision

Advantages: Precision is more easier to obtained than recall Users care only top-k links (k=10) Evaluation data: 30 queries in each domain 32 Comparison of Top-k precision News CS terms E-shopping 33 Statistical Analysis between Hypothesis Testing: Comparison Engines (two-sample hypothesis ME VS Google testing about means) ME VS MSNsearch ME VS Overture used to analyze whether there is a

ME VS Wisenut statistically significant ME VS Google difference between ME VS MSNsearch two means of two ME VS Overture samples News E-Shopping CS terms Combined A > > A

> > A > > > A > > > B >

> B > > B > > > MEB VS Wisenut >

> > MEA VS MEB 34 Comparison Results MEA can produce better search quality than Goo gle Google does not excel in every query category MEA and MEB is able to adapt to bring out the str engths of each underlying search engine MEA and MEB are better than, or comparable to all their underlying search engine components in every query category

The RSCF-based metasearch engine Comparison of prediction error better than RSVM Comparison of top-k precision adaptation ability 35 Spy Nave Bayes Motivation The problem of Joachims method l1, l2, l3 are apt to appear on the right, while l7, l10 on the left New interpretation of clickthrough data Strong assumptions Excessively penalize high-ranked links Clicked positive (P)

Unclicked unlabeled (U), containing both positive and negative samples. Arising fr Arising fr Arising fr om l1 om l7 om l10 Empty Set l7

Goal: identify Reliable Negatives (RN) from U l10

obust Run one-step SpyNB for n times and g et n sets of RNi A sample appear in at least m (m

and to appear: ACM Transactions on Internet Technology, (2006). Yiping KE, Lin DENG, Wilfred NG and Dik-Lun LEE. Web Dynamics and their Ramifications for the Development of Web Search Engines. Accepted and to appear: Computer Networks Journal - Special Issue on Web Dynamics, (2005). Qingzhao TAN, Yiping KE and Wilfred NG. WUML: A Web Usage Manipulation Language For Querying Web Log Data. International Conference on Conceptual Modeling ER 2004, Lecture Notes of Computer Science Vol.3288, Shanghai, China, page 567581, (2004). Lin DENG, Xiaoyong CHAI, Qingzhao TAN, Wilfred NG, Dik-Lun LEE. Spying Out Real User Preferences for Metasearch Engine Personalization. ACM Proceedings of WEBKDD Workshop on Web Mining and Web Usage Analysis 2004, Seattle, USA, (2004). Qingzhao TAN, Xiaoyong CHAI, Wilfred NG and Dik-Lun LEE. Applying Co-training to Clickthrough Data for Search Engine Adaptation. 9th International Conference on Database Systems for Advanced Applications DASFAA 2004, Lecture Notes of Computer Science Vol. 2973, Jeju Island, Korea, page 519532, (2004). Haofeng ZHOU, Yubo LOU, Qingqing YUAN, Wilfred NG, Wei WANG and Baile SHI. Refining Web Authoritative Resource by Frequent Structures. IEEE Proceedings of the International Database Engineering and Applications Symposium IDEAS 2003, Hong Kong, pages 236-241, (2003). Wilfred NG. Capturing the Semantics of Web Log Data by Navigation Matrices. A Book Chapter in "Semantic Issues in E-Commerce Systems", Edited by R. Meersman, K. Aberer and T. Dillon, Kluwer Academic Publishers, pages 155-170, (2003). 39

Recently Viewed Presentations

  • 1 Eric Norman Nuclear Engineering Dept. UC Berkeley

    1 Eric Norman Nuclear Engineering Dept. UC Berkeley

    The Four Elements Earth Air Fire Water The Beginning of Nuclear Science Scientists in the late 1800's and early 1900's made discoveries which would change the course of science and medicine. Henri Becquerel Marie and Pierre Curie Atomic and Nuclear...
  • CS-183 Lecture

    CS-183 Lecture

    Looping and Repetition SE-1010 Dr. Mark L. Hornick*
  • Medication Administration in the Community Administrative Rule 116

    Medication Administration in the Community Administrative Rule 116

    Communication is very important and must go both ways. It is best if the Nurse-Trainer remains, as much as possible in the teaching ,expert and supervisory role. The likelihood of staff communicating with you if you have termination responsibility is...
  • Health Care Vocab

    Health Care Vocab

    Susan has her yearly mammogram and discovers she has breast cancer. The doctors recommend a mastectomy along with chemotherapy. Right or Privilege. ... James is morbidly obese. He would like to receive gastric bypass surgery to help him lose weight.
  • Getting Started Guide What is PBIS Rewards? Reward

    Getting Started Guide What is PBIS Rewards? Reward

    Open the app, select reward amount, and scan the QR code - Fast. Track. Points are automatically tracked and viewable in the reports on the desktop portal- Simple. Redeem. Points are always available to review and redeem - Easy
  • Session One: Library Resources and Skills

    Session One: Library Resources and Skills

    LCSH = Library of Congress Subject Headings Subject headings vs. keywords For example, we might be thinking of a topic and calling it "GUN CONTROL," but the LCSH calls the same topic "FIREARMS--LAW AND LEGISLATION." When we are searching for...
  • Slide 1 SPE-190882-MS WETTABILITY ESTIMATION BY OIL ADSORPTION

    Slide 1 SPE-190882-MS WETTABILITY ESTIMATION BY OIL ADSORPTION

    Slide . SPE-190882-MS •Wettability Estimation by Oil Adsorption using QCM-D • Samuel Erzuah. WETTABILITY OVERVIEW. Wettability is the tendency of a fluid to
  • Hourly-Paid Workshop UCU  an introduction Your rights Getting

    Hourly-Paid Workshop UCU an introduction Your rights Getting

    Unless there are significant reasons why UCU should take up the case, possibly when it is a major issue that affects the wider UCU membership. UCU member Sue Birch used the Part-Time Worker Regulations to challenge her hourly rate of...