Web Mining - Sharif

Web Mining - Sharif

Web Mining Kyumars Sheykh Esmaili Modern Information Retrieval Course Sharif University of Technology Spring 2006 Table of Contents Introduction Web Content Mining Feature Selection and Similarity Measures Web Structure Mining Web as Social Network Features and Similarity Measures Social Network Analysis Algorithms

PageRank Cyber-Communities Web Content-Structure Clustering Web Usage Mining Some Concrete Applications of Web Mining HITS CT Focus Crawling Web Search Result Clustering

Summary January 27, 2020 Web Mining 2 Table of Contents Introduction Web Content Mining Feature Selection and Similarity Measures Web Structure Mining Web as Social Network Features and Similarity Measures

Social Network Analysis Algorithms PageRank Cyber-Communities Web Content-Structure Clustering Web Usage Mining Some Concrete Applications of Web Mining HITS CT Focus Crawling

Web Search Result Clustering Summary January 27, 2020 Web Mining 3 Introduction Information Overloading on the web Size 2001 New information created: 6 exabytes (10^18 bytes) 10 billion (nonspam) e-mail messages were sent per day. 2002

2003 the public Internet contained about 1 trillion pages and was increasing at a rate of approximately 8 million pages per day. 2005 January 27, 2020 New information created: 12 exabytes (10^18 bytes) 35 billion messages per day by 2005. Web Mining 4 Challenges on WWW Interactions

Finding Relevant Information Creating knowledge from Information available Personalization of the information Learning about customers / individual users Web Mining can play an important Role! January 27, 2020 Web Mining 5 Introduction Web mining - data mining techniques to automatically discover and extract information from Web documents/services

Web mining research integrate research from several research communities : Database (DB) Information retrieval (IR) The sub-areas of machine learning (ML) Natural language processing (NLP) January 27, 2020 Web Mining 6 Web Data Web pages Intra-page structures Inter-page structures Usage data Supplemental data

Profiles Registration information Cookies January 27, 2020 Web Mining 7 Web Data Categories Free Texts HTML Files Content Data XML Files Dynamic Content Multimedia Web Data Structure Data

Static Link Dynamic Link Usage Data User Profile Data January 27, 2020 Web Mining 8 Web Mining Taxonomy Web Clustering Web Content Clustering Web Structure Clustering Web Usage Clustering Web C-S Clustering January 27, 2020

Web Mining 9 Web Mining : Subtasks Resource Task Finding of retrieving intended web-documents Information Selection & Pre-processing Automatic selection and pre-processing specific information from retrieved web resources Generalization Automatic Discovery of patterns in web sites

Analysis Validation January 27, 2020 and / or interpretation of mined patterns Web Mining 10 Table of Contents Introduction Web Content Mining Feature Selection and Similarity Measures Web Structure Mining

Web as Social Network Features and Similarity Measures Social Network Analysis Algorithms PageRank Cyber-Communities Web Content-Structure Clustering Web Usage Mining Some Concrete Applications of Web Mining HITS CT

Focus Crawling Web Search Result Clustering Summary January 27, 2020 Web Mining 11 Feature Selection for Web Mining for the purposes of automated text classification text features should be: Relatively few in number Moderate in frequency of assignment Low in redundancy Low in noise Related in semantic scope to the classes to be assigned Relatively unambiguous in meaning January 27, 2020 Web Mining

12 Feature Selection Potential features: BODY META TITLE Snippet Means sentences attached with URL u appeared in search results Anchor Window The anchor text and text around the hyperlink v->u in the source page v MT, the union of META and TITLE content; BMT, the union of BODY, META and TITLE content. January 27, 2020 Web Mining 13 Feature Selection for Content Mining Percentage of Web Pages With Words in HTML Tags January 27, 2020

Web Mining 14 Feature Selection For Web Pages Classification performance for various representations of web pages January 27, 2020 Web Mining 15 Vector Space Model for Content-Similarity IR systems usually adopt index terms to process queries Index term: a keyword or group of selected words any word (more general)

Stemming might be used: connect: connecting, connection, connections An inverted file is built for the chosen index terms January 27, 2020 Web Mining 16 Vector Space Model - Basic Concepts Ki is an index term dj is a document t is the number of index terms K = (k1, k2, , kt) is the set of all index terms

wij >= 0 is a weight associated with (ki,dj) wij = 0 indicates that term does not belong to doc vec(dj) = (w1j, w2j, , wtj) is a weighted vector associated with the document dj gi(vec(dj)) = wij is a function which returns the weight associated with pair (ki,dj) January 27, 2020 Web Mining 17 The Vector Space Model j dj

dk i Sim(dk,dj) = cos() = [vec(dk) vec(dj)] / |dk| * |dj| = [ wik * wij] / |dk| * |dj| Since wij > 0 and wik > 0, 0 <= sim(dk,dj) <=1 A document is retrieved even if it matches the target document terms only partially January 27, 2020 Web Mining 18 The Vector Space Model: Example k2

k1 d2 d4 d1 d2 d3 d4 d5 d6 d7 k1 1 1 0 1 1 1 0 k2 k3 0 1 0 0 1 1 0 0 1 1

1 0 1 0 q 1 1 January 27, 2020 1 q dj 2 1 2 1 3 2 1 |dj| 1.41 1 1.41 1

1.73 1.41 1 |q| 1.73 Sim(dj,q) 0.82 0.58 0.82 0.58 1 0.82 0.58 Web Mining d7 d6 d5 d1 d3

k3 19 The Vector Space Model - Weighting Sim(q,dj) = [ wij * wiq] / |dj| * |q| How to compute the weights wij and wiq ? A good weight must take into account two effects: quantification of intra-document contents (similarity) quantification of inter-documents separation (dissi-milarity) tf factor, the term frequency within a document idf factor, the inverse document frequency

wij = tf(i,j) * idf(i) January 27, 2020 Web Mining 20 The Vector Model - Weighting Example: A collection includes 10,000 documents The term A appears 20 times in a particular document The maximum apperance of any term in this document is 50 The term A appears in 2,000 of the collection documents. f(i,j) = freq(i,j) / max(freq(l,j)) = 20/50 = 0.4 idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32 wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.93 January 27, 2020 Web Mining 21

Table of Contents Introduction Web Content Mining Feature Selection and Similarity Measures Web Structure Mining Web as Social Network Features and Similarity Measures Social Network Analysis Algorithms PageRank Cyber-Communities

Web Content-Structure Clustering Web Usage Mining Some Concrete Applications of Web Mining HITS CT Focus Crawling Web Search Result Clustering Summary January 27, 2020 Web Mining

22 Social network analysis Social network is the study of social entities (people in an organization, called actors), and their interactions and relationships. The interactions and relationships can be represented with a network or graph, each vertex (or node) represents an actor and each link represents a relationship. From the network, we can study the properties of its structure, and the role, position and prestige of each social actor. We can also find various kinds of sub-graphs, e.g.,

communities formed by groups of actors. January 27, 2020 Web Mining 23 Social network and the Web Social network analysis is useful for the Web because the Web is essentially a virtual society, and thus a virtual social network, Each page: a social actor and each hyperlink: a relationship. Many results from social network can be adapted and extended for use in the Web context. January 27, 2020

Web Mining 24 Web Structure Mining The Web consists not only of pages, but also of hyperlinks pointing from one page to another These hyperlinks contain an enormous amount of latent human annotation Assumption: link from page A to page B is a recommendation of page B by A If A and B are connected by a link, there is a higher probability that they are on the same topic January 27, 2020 Web Mining

25 Web Link Analysis Used for Ordering documents matching a user query: ranking Deciding what pages to add to a collection: crawling Page categorization Finding related pages Finding duplicated web sites January 27, 2020 Web Mining 26 Table of Contents

Introduction Web Content Mining Feature Selection and Similarity Measures Web Structure Mining Web as Social Network Features and Similarity Measures Social Network Analysis Algorithms PageRank Cyber-Communities

Web Content-Structure Clustering Web Usage Mining Some Concrete Applications of Web Mining HITS CT Focus Crawling Web Search Result Clustering Summary January 27, 2020 Web Mining 27 Structural Similarity Measures

We must define the similarity of two nodes Method I: For page and page B, A is related to B if there is a hyperlink from A to B, or from B to A Page A Page B January 27, 2020 Not so good. Consider the home page of IBM and Microsoft. Web Mining 28 Structural Similarity Measures Method II (from Bibliometrics)

Co-citation: the similarity of A and B is measured by the number of pages cite both A and B Page A Page B Bibliographic coupling: the similarity of A and B is measured by the number of pages cited by both A and B. Page A January 27, 2020 Page B Web Mining 29 Table of Contents Introduction Web Content Mining

Feature Selection and Similarity Measures Web Structure Mining Web as Social Network Features and Similarity Measures Social Network Analysis Algorithms PageRank Cyber-Communities Web Content-Structure Clustering

Web Usage Mining Some Concrete Applications of Web Mining HITS CT Focus Crawling Web Search Result Clustering Summary January 27, 2020 Web Mining 30 Using link structure of web (cont.) There are two famous Link-Structure based algorithms for ranking : PageRank

HITS Nearly All other algorithms are base on these ones : Salsa, Clever, . January 27, 2020 Web Mining 31 PageRank Introduced by Page et al (1998) January 27, 2020 An offline algorithm (Query independent) The weight is assigned by the rank of parents

Web Mining 32 Matrix Notation January 27, 2020 Web Mining 33 Solve the PageRank equation T P A P (15)

This is the characteristic equation of the eigensystem, where the solution to P is an eigenvector with the corresponding eigenvalue of 1. It turns out that if some conditions are satisfied, 1 is the largest eigenvalue and the PageRank vector P is the principal eigenvector. A well known mathematical technique called power iteration can be used to find P. Problem: the above Equation does not quite suffice because the Web graph does not meet the conditions. January 27, 2020 Web Mining 34 Using Markov chain To introduce these conditions and the enhanced equation, let us derive the same Equation (15) based on the Markov chain.

In the Markov chain, each Web page or node in the Web graph is regarded as a state. A hyperlink is a transition, which leads from one state to another state with a probability. This framework models Web surfing as a stochastic process. It models a Web surfer randomly surfing the Web as state transition. January 27, 2020 Web Mining 35 Random surfing Recall we use Oi to denote the number of out-links of a node i.

Each transition probability is 1/Oi if we assume the Web surfer will click the hyperlinks in the page i uniformly at random. The back button on the browser is not used and the surfer does not type in an URL. January 27, 2020 Web Mining 36 Transition probability matrix Let A be the state transition probability matrix,, A11 A21 . A . . .

A n1 A12 . . . A22 . . . An 2 . . . .

. . A1n A2 n . . . Ann Aij represents the transition probability that the surfer in state i (page i) will move to state j (page j). Aij is defined exactly as in Equation (14). January 27, 2020 Web Mining 37 Let us start

Given an initial probability distribution vector that a surfer is at each state (or page) p0 = (p0(1), p0(2), , p0(n))T (a column vector) and an nn transition probability matrix A, we have n p (i) 1 (16) A (17) 0 i 1 n ij

1 j 1 If the matrix A satisfies Equation (17), we say that A is the stochastic matrix of a Markov chain. January 27, 2020 Web Mining 38 Back to the Markov chain In a Markov chain, a question of common interest is: Given p0 at the beginning, what is the probability that m steps/transitions later the Markov chain will be at each state j?

We determine the probability that the system (or the random surfer) is in state j after 1 step (1 transition) by using the following reasoning: n p1 ( j ) A (1) p (i) ij 0 (18) i 1 January 27, 2020 Web Mining 39 State transition January 27, 2020 Web Mining

40 Stationary probability distribution By a Theorem of the Markov chain, a finite Markov chain defined by the stochastic matrix A has a unique stationary probability distribution if A is irreducible and aperiodic. The stationary probability distribution means that after a series of transitions pk will converge to a steady-state probability vector regardless of the choice of the initial probability vector p0, i.e., lim pk k January 27, 2020 Web Mining

(21) 41 PageRank again When we reach the steady-state, we have pk = pk+1 =, and thus =AAT. is the principal eigenvector of AT with eigenvalue of 1. In PageRank, is used as the PageRank vector P. We again obtain Equation (15), which is reproduced here as Equation (22): T P A P January 27, 2020 (22) Web Mining 42

Is P = justified? Using the stationary probability distribution as the PageRank vector is reasonable and quite intuitive because January 27, 2020 it reflects the long-run probabilities that a random surfer will visit the pages. A page has a high prestige if the probability of visiting it is high. Web Mining 43 Back to the Web graph Now let us come back to the real Web context and see whether the above conditions are satisfied, i.e.,

whether A is a stochastic matrix and whether it is irreducible and aperiodic. None of them is satisfied. Hence, we need to extend the ideal-case Equation (22) to produce the actual PageRank model. January 27, 2020 Web Mining 44 A is a not stochastic matrix A is the transition matrix of the Web graph 1 Aij Oi 0

if (i, j ) E otherwise It does not satisfy equation (17) n A ij 1 j 1 because many Web pages have no out-links, which are reflected in transition matrix A by some rows of complete 0s. Such pages are called the dangling pages (nodes). January 27, 2020

Web Mining 45 An example Web hyperlink graph 0 0 0 12 12 0 0 0 1 2 0 1 2 0 0 1 0 0 0 0 A 0 1 3 0 1 3 1 3 0

0 0 0 0 0 0 0 0 0 1 2 1 2 0 January 27, 2020 Web Mining 46 Fix the problem: two possible ways 1.

2. Remove those pages with no out-links during the PageRank computation as these pages do not affect the ranking of any other page directly. Add a complete set of outgoing links from each such page i to all the pages on the Web. Let us use the second way January 27, 2020 0 12 1 2 0 0 1 A 0 0 1 6 1 6 0 0

Web Mining 12 0 0 0 12 0 0 0 0 0 0 0 1 3 0 1 3 1 3 1 6 1 6 1 6 1 6 0 1 2 1 2 0 47 A is a not irreducible Irreducible means that the Web graph G is strongly connected. Definition: A directed graph G = (V, E) is strongly connected if and only if, for each pair of nodes u, v V, there is a path from u to v. A general Web graph represented by A is not

irreducible because for some pair of nodes u and v, there is no path from u to v. In our example, there is no directed path from nodes 3 to 4. January 27, 2020 Web Mining 48 A is a not aperiodic A state i in a Markov chain being periodic means that there exists a directed cycle that the chain has to traverse. Definition: A state i is periodic with period k > 1 if k is the smallest number such that all paths leading from state i back to state i have a length that is a multiple of k.

If a state is not periodic (i.e., k = 1), it is aperiodic. A Markov chain is aperiodic if all states are aperiodic. January 27, 2020 Web Mining 49 An example: periodic Fig. 5 shows a periodic Markov chain with k = 3. Eg, if we begin from state 1, to come back to state 1 the only path is 12-3-1 for some number of times, say h. Thus any return to state 1 will take 3h transitions. January 27, 2020 Web Mining 50 Deal with irreducible and aperiodic

It is easy to deal with the above two problems with a single strategy. Add a link from each page to every page and give each link a small transition probability controlled by a parameter d. Obviously, the augmented transition matrix becomes irreducible and aperiodic January 27, 2020 Web Mining 51 Improved PageRank After this augmentation, at a page, the random surfer has two options

With probability d, he randomly chooses an out-link to follow. With probability 1-d, he jumps to a random page Equation (25) gives the improved model, E T P ((1 d ) dA ) P n T (25) where E is ee (e is a column vector of all 1s) and thus E is a nn square matrix of all 1s. January 27, 2020 Web Mining 52

Follow our example 1 6 0 7 15 7 15 E (1 d ) dAT n 1 6 0 1 6 0 1 6 0 January 27, 2020 7 15 1 6 0 1 6 0 11 12 7 15 1 60 1 60 1 60 1 6 1 6 0 1 6 1 6 0

1 6 0 19 6 0 1 6 1 6 0 1 6 0 1 6 0 1 6 7 15 1 6 0 19 6 0 1 6 7 15 1 6 0 19 6 0 1 6 1 6 0 Web Mining 1 60 1 60 53 The final PageRank algorithm (1-d)E/n + dAT is a stochastic matrix (transposed). It is also irreducible and aperiodic If we scale Equation (25) so that eTP = n, P (1 d )e dAT P (27)

PageRank for each page i is n P (i ) (1 d ) d A ji P ( j) (28) j 1 January 27, 2020 Web Mining 54 The final PageRank (cont ) (28) is equivalent to the formula given in the PageRank paper

P( j ) P (i ) (1 d ) d ( j , i )E O j The parameter d is called the damping factor which can be set to between 0 and 1. d = 0.85 was used in the PageRank paper. January 27, 2020 Web Mining 55 A Practical Example for PageRank January 27, 2020 Web Mining 56

Table of Contents Introduction Web Content Mining Feature Selection and Similarity Measures Web Structure Mining Web as Social Network Features and Similarity Measures Social Network Analysis Algorithms PageRank Cyber-Communities

Web Content-Structure Clustering Web Usage Mining Some Concrete Applications of Web Mining HITS CT Focus Crawling Web Search Result Clustering Summary January 27, 2020 Web Mining

57 What is cyber-community A community on the web is a group of web pages sharing a common interest Eg. A group of web pages talking about POP Music Eg. A group of web pages interested in data-mining Main properties: Pages in the same community should be similar to each other in contents The pages in one community should differ from the pages in another community Similar to cluster January 27, 2020 Web Mining 58 Cyber Communities January 27, 2020

Web Mining 59 Two different types of communities Explicitly-defined communities They are well known ones, such as the resource listed by Yahoo! eg. Arts Music Implicitly-defined communities They are communities unexpected or invisible to most users

January 27, 2020 Web Mining Classic Painting Pop eg. The group of web pages interested in a particular singer 60 Different types of communities The explicit communities are easy to identify Eg. Yahoo!, InfoSeek, Clever System In order to extract the implicit communities,

we need analyze the web-graph objectively In research, people are more interested in the implicit communities January 27, 2020 Web Mining 61 Methods of clustering Clustering methods based on co-citation analysis Methods derived from HITS (Kleinberg) Using co-citation matrix

CT Method January 27, 2020 Web Mining 62 Table of Contents Introduction Web Content Mining Feature Selection and Similarity Measures Web Structure Mining Web as Social Network

Features and Similarity Measures Social Network Analysis Algorithms PageRank Cyber-Communities Web Content-Structure Clustering Web Usage Mining Some Concrete Applications of Web Mining HITS CT

Focus Crawling Web Search Result Clustering Summary January 27, 2020 Web Mining 63 HITS: Hubs and Authority Hub: web page links to a collection of prominent sites on a common topic Authority: Pages that link to a collection of authoritative pages on a broad topic; web page pointed to by hubs Mutual Reinforcing Relationship: a good authority is a page that is pointed to by many good hubs, while a good hub is a page that points to many good authorities

January 27, 2020 Web Mining 64 Authority and Hubness 5 2 3 1 1 4 6 7 x(1) = y(2) + y(3) + y(4) January 27, 2020

Web Mining y(1) = x(5) + x(6) + xs(7) 65 HITS Steps (1) Creating root and base sets January 27, 2020 Web Mining 66 HITS Steps (2) Calculating Weights Authority weight : Hub weight : Matrix notation: A - adjacency matrix A(i, j) = 1 if i-th page points to j-th page

January 27, 2020 Web Mining 67 Final Result of HITS January 27, 2020 Web Mining 68 HITS Results 3D perspective January 27, 2020 Web Mining 69 A Practical Example for HITS January 27, 2020 Web Mining

70 HITS Problems From narrow topic, HITS tends to end in more general one. Specific of hub pages - many links can cause algorithm drift. They can point to authorities in different topics Pages from single domain / website can dominate result, if they point to one page - not necessary a good authority. Automatically generated links Non relevant highly connected pages Topic drift: generalisation of the query topic January 27, 2020 Web Mining 71

Difference between PageRank and HITS The PageRank is computed for all web pages stored in the database and then prior to the query; HITS is performed on the set of retrieved web pages, and for each query. HITS computes authorities and hubs; PageRank computes authorities only. PageRank: non-trivial to compute, HITS: easy to compute, but real-time execution is hard January 27, 2020 Web Mining 72 Table of Contents

Introduction Web Content Mining Feature Selection and Similarity Measures Web Structure Mining Web as Social Network Features and Similarity Measures Social Network Analysis Algorithms PageRank Cyber-Communities

Web Content-Structure Clustering Web Usage Mining Some Concrete Applications of Web Mining HITS CT Focus Crawling Web Search Result Clustering Summary January 27, 2020 Web Mining 73 A cheaper method

Previous methods are expensive There another simple method called communities trawling (CT) It has been implemented on the graph of 200 millions pages, it worked very well January 27, 2020 Web Mining 74 Basic idea of CT Definition of communities dense directed bipartite

sub graphs Bipartite graph: Nodes are partitioned into two sets, F and C Every directed edge in the graph is directed from a node u in F to a node v in C dense if many of the possible edges between F and C are present January 27, 2020 Fans Web Mining Centers F C 75

Basic idea of CT Bipartite cores a complete bipartite subgraph with at least i nodes from F and at least j nodes from C i and j are tunable parameters A (i, j) Bipartite core Every community have such a core with a certain i and j. January 27, 2020 Web Mining A (i=3, j=3) bipartite core 76 Basic idea of CT

A bipartite core is the identity of a community To extract all the communities is to enumerate all the bipartite cores on the web. Author invent an efficient algorithm to enumerate the bipartite cores. Its main idea is iterate pruning -elimination-generation pruning January 27, 2020 Web Mining 77 Table of Contents Introduction Web Content Mining

Feature Selection and Similarity Measures Web Structure Mining Web as Social Network Features and Similarity Measures Social Network Analysis Algorithms PageRank Cyber-Communities Web Content-Structure Clustering Web Usage Mining

Some Concrete Applications of Web Mining HITS CT Focus Crawling Web Search Result Clustering Summary January 27, 2020 Web Mining 78 Content Link Clustering By CLC, each web page q in data set D is represented as 3 vectors: qOut qIn qKword with M, N and L as the vector dimension respectively

The ith item of vector qOut (and qIn) indicates whether q has the corresponding out-link as the ith one in M out-links. If yes, the ith item is1, else 0. The kth item of vector qKword indicates the frequency of the corresponding kth term of L appeared in page q. January 27, 2020 Web Mining 79 Similarity Measure The similarity of two pages Q and R is the linear combination of three parts: poutS(Qout,Rout)+ pinS(Qin,Rin)+ ptermS(Qterm,Rterm) pout +pin +pterm =1 S(Qout,Rout) is defined as Cosine of two out-link vectors. January 27, 2020 Web Mining 80 Tuning the similarity measure

By varying weighting factors in second formula, it is possible to study the effects of out-links, in-link and terms on clustering process. Results of term-based clustering is rather coarse and usually includes very general groups, which are totally different each other from semantic point of view. E.g. for topic jaguar, car group and animal group are two very general groups with very different semantic topics; January 27, 2020 Web Mining 81 Tuning the similarity measure So, term-based clustering could only roughly separate pages into general semantic groups and failed to handle the finer case Like racing car and car driver club since both pages may include some terms like car, model etc. The main reasons of poor purity of clusters produced by term-based clustering are: Noise pages are included into clusters instead of removing since noise pages share some unimportant terms with other pages;

Pages that on different finer topics (but the same general topic) are mixed together. January 27, 2020 Web Mining 82 Tuning the similarity measure Hyperlinks represent the authors view of the relationship among Web pages hyperlink-based clustering expresses association of pages. Therefore, we could say that clusters produced by link-based clustering are in finer granularity. The problem of link-based clustering is that some similar pages (e.g. new created pages) may not have enough co-citation/citation to be grouped together. That is to say, recall is some low . January 27, 2020 Web Mining 83 Tuning the similarity measure T, L and CLC to denote termsbased (with pout ,

pin and pKword as (0, 0, 1), link-based (with pout ,pin and pKword as (0.5, 0.5, 0) and contents-link coupled (with pout , pin and pKword as (0.2,0.3, 0.5) clustering approaches respectively. Parameters are Similarity threshold weighting factors The label of each cluster is identified automatically by term vector of centroid for each cluster. January 27, 2020 Web Mining 84 Content Link Mining January 27, 2020 Web Mining 85 Table of Contents

Introduction Web Content Mining Feature Selection and Similarity Measures Web Structure Mining Web as Social Network Features and Similarity Measures Social Network Analysis Algorithms PageRank Cyber-Communities

Web Content-Structure Clustering Web Usage Mining Some Concrete Applications of Web Mining HITS CT Focus Crawling Web Search Result Clustering Summary January 27, 2020 Web Mining 86 Web Usage Mining

Web usage mining also known as Web log mining mining techniques to discover interesting usage patterns from the secondary data derived from the interactions of the users while surfing the web Including web log data, click-stream data, cookies, user queries, and any data related to the results of interaction between humans interaction with the web January 27, 2020 Web Mining 87 Web Usage Mining Applications

Target potential customers for electronic commerce Enhance the quality and delivery of Internet information services to the end user Improve Web server system performance Identify potential prime advertisement locations Facilitates personalization/adaptive sites Improve site design Fraud/intrusion detection Predict users actions (allows prefetching) January 27, 2020 Web Mining 88 January 27, 2020 Web Mining 89 Web Log Clustering Applications

Association rules Find pages that are often viewed together Clustering Cluster users based on browsing patterns Cluster pages based on content January 27, 2020 Web Mining 90 Server Logs Fields Client IP: 128.101.228.20 Authenticated User ID: - Time/Date: [10/Nov/1999:10:16:39 -0600] Request: "GET / HTTP/1.0"

Status: 200 Bytes: Referrer: - Agent: "Mozilla/4.61 [en] (WinNT; I)" January 27, 2020 Web Mining 92 Web Usage Mining User: The principal using a client to interactively retrieve and render resources or resource manifestations. Page view: Visual rendering of a Web page in a specific client environment at a specific point of time Click stream: a sequential series of page view request

User session: a delimited set of user clicks (click stream) across one or more Web servers. Server session (visit): a collection of user clicks to a single Web server during a user session. Episode: a subset of related user clicks that occur within a user session. January 27, 2020 Web Mining 93 WUM Pre-Processing Data Cleaning Removes log entries that are not needed for the mining process Data Integration Synchronize data from multiple server logs User Identification Associates page references with different users Session/Episode Identification Groups users page references into user sessions Path Completion Fills in page references missing due to browser and proxy caching January 27, 2020

Web Mining 94 January 27, 2020 Web Mining 95 January 27, 2020 Web Mining 96 WUM Association Rule Generation Discovers the correlations between pages that are most often referenced together in a single server session Provide the information What are the set of pages frequently accessed together by Web users? What page will be fetched next? What are paths frequently accessed by Web users? Association rule

A B [ Support = 60%, Confidence = 80% ] Example 50% of visitors who accessed URLs /infor-f.html and labo/infos.html also visited situation.html January 27, 2020 Web Mining 97 WUM Clustering Groups together a set of items having similar characteristics User Clusters Discover groups of users exhibiting similar browsing patterns Page recommendation Users partial session is classified into a single cluster The links contained in this cluster are recommended

January 27, 2020 Web Mining 98 Web Usage Clustering clients who often access /products/software/webminer.html tend to be from educational institutions. clients who placed an online order for software tend to be students in the 20-25 age group and live in the United States. 75% of clients who download software from /products/software/demos/ visit between 7:00 and 11:00 pm on weekends. January 27, 2020 Web Mining

99 Table of Contents Introduction Web Content Mining Feature Selection and Similarity Measures Web Structure Mining Web as Social Network Features and Similarity Measures Social Network Analysis Algorithms PageRank

Cyber-Communities Web Content-Structure Clustering Web Usage Mining Some Concrete Applications of Web Mining HITS CT Focus Crawling Web Search Result Clustering Summary January 27, 2020

Web Mining 100 Focused Crawling Only visit links from a page if that page is determined to be relevant. Classifier is static after learning phase. Components: Classifier which assigns relevance score to each page based on crawl topic. Distiller to identify hub pages. Crawler visits pages to based on crawler and distiller scores. January 27, 2020 Web Mining

101 Focused Crawling Classifier also determines how useful outgoing links are Hub Pages contain links to many relevant pages. Must be visited even if not high relevance score. January 27, 2020 Web Mining 102 Focused Crawling January 27, 2020 Web Mining 103

Table of Contents Introduction Web Content Mining Feature Selection and Similarity Measures Web Structure Mining Web as Social Network Features and Similarity Measures Social Network Analysis Algorithms PageRank Cyber-Communities

Web Content-Structure Clustering Web Usage Mining Some Concrete Applications of Web Mining HITS CT Focus Crawling Web Search Result Clustering Summary January 27, 2020 Web Mining

104 Motivation In the web search context: organizing web pages (search results) into groups, so that different groups correspond to different user needs search engine i.e.: engine car part Engine Corp. Why not other data mining techniques? January 27, 2020 Web Mining 105 (1) Using Contents of Documents

Creating clusters based on snippets returned by web search engines. Clusters based on snippets are almost as good as clusters created using the full text of Web documents. Suffix Tree Clustering (STC) : incremental, O(n) time algorithm Linear Incremental Overlapping Can January 27, 2020 be extended to hierarchical Web Mining 106 STC algorithm Step 1: Cleaning

Step 2: Suffix tree construction Stemming Sentence boundary identification Punctuation elimination Produces base clusters (internal nodes) Base clusters are scored based on size and phrase score (which depends on length and word quality) Step 3: Merging base clusters January 27, 2020 Highly overlapping clusters are merged Web Mining

107 (2) Using users usage logs Advantage: relevancy information is objectively reflected by the usage logs An experimental result on www.nasa.gov/ January 27, 2020 Cluster 1 /shuttle/missions/41-c/news /shuttle/missions/61-b Cluster 2

/history/apollo/sa-2/news/ /history/apollo/sa-2/images Cluster 3 /software/winvn/userguide/3_3_2.htm /software/winvn/userguide/3_3_4.htm . Web Mining 108 (3) Using hyperlinks For each URL P in search results R, we extract its all out-links as well as top n inlinks by services of AltaVista

We could get all distinct N out-links and M in-links for all URLs in R. Each page P in R (result set) is represented as 2 vectors: POut (N- dimension) PIn (Mdimension) January 27, 2020 Web Mining 109 (3) Using Hyperlinks: continued January 27, 2020 Web Mining 110 (3) Using Hyperlinks: continued January 27, 2020

Web Mining 111 Concerns on current methods Each method has pros and cons Using hyperlinks : the best accuracy and still some room to improve STC January 27, 2020 : best to browse and for incrementality. Web Mining 112 Sample systems

Scatter/Gather Grouper Carrot2 Vivisimo Mapuccino SHOC January 27, 2020 Web Mining 113 Grouper Online

Operates on query result snippets Clusters together documents with large common subphrases Suffix Tree Clustering (STC) STC induces labeling January 27, 2020 Web Mining 114 January 27, 2020 Web Mining 115 January 27, 2020 Web Mining 116 January 27, 2020 Web Mining

117 Table of Contents Introduction Web Content Mining Feature Selection and Similarity Measures Web Structure Mining Web as Social Network Features and Similarity Measures Social Network Analysis Algorithms PageRank

Cyber-Communities Web Content-Structure Clustering Web Usage Mining Some Concrete Applications of Web Mining HITS CT Focus Crawling Web Search Result Clustering Summary January 27, 2020

Web Mining 118 Summary Web Mining Web Content Mining Web Page Content Mining January 27, 2020 Web Structure Mining Search Result Mining Web Usage Mining General Access

Pattern Tracking Web Mining Customized Usage Tracking 119 Thank You January 27, 2020 Web Mining 120

Recently Viewed Presentations

  • Coasts Revision

    Coasts Revision

    CASE STUDY: Retreating Coastline, causes, impacts and management . GEOLOGICALLY means- Rock, therefore this could be as case study of an area of hard/soft rock or a discordant vs concordant coastline. You could use Pamunugama and contrast this with Holderness.
  • ESL 7 Quarter 3 Week 8 MARCH 17-21,

    ESL 7 Quarter 3 Week 8 MARCH 17-21,

    Fill in the mind map giving examples of each characteristic (Have 3 for Code of Chivalry) Devise a thesis statement (See detailed slides) Make an outline to discuss three MR characteristics in SGATGK. Write a rough draft from the outline.
  • JPEG Image Compression All of the Gory Details

    JPEG Image Compression All of the Gory Details

    JPEG Image Compression All of the Gory Details Biological Imaging Ross Whitaker University of Utah Clinical Imaging Paradigm Diagnosis/prognosis (individuals) Radiologists/experts Trends More/larger datasets (+3D) Surgery/planning Quantification Biological Imaging Understanding organisms/populations Growth-explosive Trends: instrumentation + science Very large datasets E.g....
  • The Psychology of Survey Responses: Implications for ...

    The Psychology of Survey Responses: Implications for ...

    Roger Tourangeau, SRC, JPSM Robert M. Groves, SRC, JPSM Stanley Presser, JPSM ... George W. Bush ran on the Republican ticket against John Kerry for the Democrats. In the 2004 presidential election, did things come up that kept you from...
  • The musings of an Aboriginal researcher

    The musings of an Aboriginal researcher

    Positioning of self within the literature. I am the emu. It was not until beginning to navigate the oceans of higher degree research that I truly became aware of the rips in the tide; of how influential the colonial past...
  • Daniel - First Baptist Church

    Daniel - First Baptist Church

    Daniel. Daniel's visions. 4 . beasts. Ram, goat, & little horn. Spiritual conflict. Time of the . end. God is in control
  • Using Technology to Improve Fleet Compliance Ruth Waring

    Using Technology to Improve Fleet Compliance Ruth Waring

    Area . Challenge. Technology. Walk-round checks. Done? Location? Evidence? Bits of paper. Geo-location Apps. Planned Maintenance. Scheduling, communication, have they been done?
  • Institute for Environmental Studies (IVM) Assessing adaptation: A

    Institute for Environmental Studies (IVM) Assessing adaptation: A

    Adaptation policy recommendation = specific step or steps should be taken e.g. "It is recommended that in the next five years we allocate €20M to the development of a heat wave early warning system". Adaptation policy measure = something in...