Transcription

Investigating the Change of Web Pages’ Titles Over TimeMartin KleinMichael L. NelsonDepartment of Computer ScienceOld Dominion UniversityNorfolk, VA, 23529Department of Computer ScienceOld Dominion UniversityNorfolk, VA, [email protected] web pages are part of the browsing experience.The content of these pages however is often not completelylost but rather missing. Lexical signatures (LS) generatedfrom the web pages’ textual content have been shown to besuitable as search engine queries when trying to discover a(missing) web page. Since LSs are expensive to generate,we investigate the potential of web pages’ titles as they areavailable at a lower cost. We present the results from studying the change of titles over time. We take titles from copiesprovided by the Internet Archive of randomly sampled webpages and show the frequency of change as well as the degree of change in terms of the Levenshtein score. We foundvery low frequencies of change and high Levenshtein scoresindicating that titles, on average, change little from theiroriginal, first observed values (rooted comparison) and evenless from the values of their previous observation (sliding).Categories and Subject DescriptorsH.3.0 [Information Storage and Retrieval]:General TermsMeasurement, Performance, Design1.INTRODUCTIONInaccessible web pages and “404 Page Not Found” responses are part of the web browsing experience. Despiteguidance for how to create “Cool URIs” that do not change[4] there are many reasons why URIs or even entire websites break [16]. However, we claim that information on theweb is rarely completely lost, it is just missing. In wholeor in part, content is often just moving from one URL toanother. It is our intuition that major search engines likeGoogle, Yahoo and MSN Live, as members of what we callthe Web Infrastructure (WI), likely have crawled the contentand possibly even stored a copy in their cache. Thereforethe content is not lost, it “just” needs to be rediscovered.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.InDP’09, June 19, 2009, Austin, TX, USA.Copyright [email protected] WI, explored in detail in [21, 17, 11], also includes (besides search engines) non-profit archives such as the InternetArchive (IA) or the European Archive as well as large-scaleacademic digital data preservation projects e.g., CiteSeerand NSDL.It is commonplace for content to “move” to different URIsover time. Figure 1 shows two snapshots as an example ofa web page whose content has moved within a two year period. Figure 1(a) shows the content of the original URLof the Hypertext 2006 conference1 as displayed in 1/2009.The original URL clearly does not hold conference relatedcontent anymore. Our suspicion is that the website administrators did not renew the domain registration and thereforesomeone else took over. However, the content is not lost.Figure 1(b) shows the content which is now available at anew URL2 . This example describes the retrieval problem weare addressing with our research. In Figure 2 we are displaying our scenario for discovering web pages that are considered missing. The occurrence of an 404 error is displayed inthe first step. Note that a page returning unrelated content(such as in the example above) can be considered missing aswell since the user intents to retrieve the original content.Search engine caches and the IA will consequently be queriedwith the URL requested by the user. In case older copies ofthe page are available they can be offered to the user. If theuser’s information need is satisfied, nothing further needs tobe done (step (2)). If this is not the case we need to proceedto step (3) where we extract titles, try to obtain tags aboutthe URL and generate LSs from the obtained copies. Theyare then queried against live search engines and the returnedresults are again offered to the user as depicted in step (4)of Figure 2. In case the user is again not pleased with theoutcome more sophisticated and complex methods need tobe applied (step (5)). For example, search engines can bequeried to discover pages linking to the missing page. Theassumption is that the aggregate of those pages is likely tobe about the same topic. From this link neighborhood a LScan be generated. At this point the approach is the sameas the LS method, with the exception that the LS has beengenerated from a link neighborhood and not a cached copy ofthe page itself. This scenario also needs to be applied in caseno copies of the missing page can be found in search enginecaches and the IA. The final results are provided in step (6).The important point of this scenario is that it works whilethe user is browsing and therefore has to provide results inreal tus.com/

(1)query for URL in:·search engine caches·Internet ArchiveDONE presentresults(2)user issatisfied(3) (a) Original URL, new (unrelated) Content·identify dissimilar pages·extract titles·generate LSs·obtain tags·query search enginesno resultsfoundpresentresults (5)(b) Original Content, new URLFigure 1: The Content of the Website for the Conference Hypertext 2006 has Moved over TimeRecent research has shown that lexical signatures (LSs)generated from the textual content of web pages are suitableas search engine queries to rediscover missing pages [22, 13].LSs are rather expensive to generate, the web pages’ titleshowever are available at a lower cost. We investigated thechange of web page content compressed into LSs over timein [13] and focus here on the issue of title changes over time.Our intuition is that if the frequency of change is high, titlesmay not be very useful after all for rediscovering a missingweb page. In this paper we present the preliminary resultsof a study investigating the frequency and degree of changeof web pages’ titles over time. We predict a lower degree ofchange compared to LSs since LSs are based on the contentof the entire page which supposedly changes more frequentlythan the general topic captured by the page title. The Appendix shows three examples of web pages, their titles asobserved over time by the IA and our computed similarityscores.2.2.1RELATED WORKMissing Web PagesMissing web pages are a pervasive part of the web experience. The lack of link integrity on the web has beenaddressed by numerous researchers [5, 6, 1, 2]. In 1997Brewster Kahle published an article focused on preservationof Internet resources claiming that the expected lifetime ofa web page is 44 days [12]. A different study of web pageavailability performed by Koehler [14] shows the random testcollection of URLs eventually reached a “steady state” afterapproximately 67% of the URLs were lost over a 4-year pe-user issatisfied (4)(6) DONE·include link neighborhood·relevance feedback·user interaction: request keywords change number of terms in LS add/delete term from LS advanced search operatorspresent resultsDONEFigure 2: Process to Rediscover Missing Web Pagesriod. Koehler estimated the half-life of a random web page isapproximately two years. Lawrence et al. [15] found in 2000that between 23 and 53% of all URLs occurring in computerscience related papers authored between 1994 and 1999 wereinvalid. By conducting a partially manual search on the Internet, they were able to reduce the number of inaccessibleURLs to 3%. This confirms our intuition that information israrely lost, it is just moved. This intuition is also supportedby Baeza-Yates et al. [3] who show that a significant portionof the web is created based on already existing content.Spinellis [24] conducted a study investigating the accessibility of URLs occurring in papers published in Communications of the ACM and IEEE Computer Society. He foundthat 28% of all URLs were unavailable after five years and41% after seven years. He also found that in 60% of the caseswhere URLs where not accessible, a 404 error was returned.He estimated the half-life of an URL in such a paper to befour years from the publication date. Dellavalle et al. [7] examined Internet references in articles published in journalswith a high impact factor (IF) given by the Institute forScientific Information (ISI). They found that Internet references occur frequently (in 30% of all articles) and are ofteninaccessible within months after publication in the highestimpact (top 1%) scientific and medical journals. They discovered that the percentage of inactive references (referencesthat return an error message) increased over time from 3.8%after 3 month to 10% after 15 month up to 13% after 27month. The majority of inactive references they found were

Length123456P.com392239923561765URLs/Domain.org .net 911090O 1O 1O 2O 2O 3O 3O 4O 4O 5O 5Table 1: URL Character StatisticsFigure 3: Sliding and Rooted Comparison Methodsin the .com domain (46%) and the fewest in the .org domain (5%). By manually browsing the IA they were able torecover information for about 50% of all inactive references.2.2Search Engine QueriesThe work done by Henzinger et al. [9] is related in thesense that they tried to determine the “aboutness” of newsdocumentations. They provide the user with web pages related to TV news broadcasts using a 2-term summary whichcan be thought of as a LS. This summary is extracted fromclosed captions of the broadcast and various algorithms areused to compute the scores determining the most relevantterms. The terms are used to query a news search enginewhile the results must contain all of the query terms. Theauthors found that 1-term queries return results that are toovague and 3-term queries return too often zero results. Thusthey focus on creating 2-term queries.He and Ounis’ work on query performance prediction [8]is based on the TREC dataset. They measured retrievalperformance of queries in terms of average precision (AP)and found that the AP values depend heavily on the type ofthe query. They further found that what they call simplifiedclarity score (SCS) has the strongest correlation with APfor title queries (using the title of the TREC topics). SCSdepends on the actual query length but also on global knowledge about the corpus such as document frequency and totalnumber of tokens in the corpus.2.3The Web Infrastructure for thePreservation of Web PagesNelson et al. [21] present various models for the preservation of web pages based on the web infrastructure. Theyargue that conventional approaches to digital preservationsuch as storing digital data in archives and applying methods of refreshing and migration are, due to the implied costs,unsuitable for web scale preservation.McCown has done extensive research on the usability ofthe web infrastructure for reconstructing missing websites[17]. He also developed Warrick [19], a system that crawlsweb repositories such as search engine caches (characterizedin [18]) and the index of the IA to reconstruct websites. Hissystem is targeted to individuals and small scale communities that are not involved in large scale preservation projectsand suffer the loss of websites.3.3.1EXPERIMENTAL SETUPData GatheringIt is the main objective of this experiment is to investigatethe (degree of) change of web pages’ titles over time. It isclearly unfeasible to download all pages from the web on aregular bases over time and analyze their changes. On theother hand it has been shown that finding a small set of webpages that are representative for the entire web is not trivial[10, 23, 25]. We chose to randomly sample 6, 000 URLsfrom the Open Directory Project at dmoz.org. There is animplicit bias in this selection but it appears more suitablethan attempting to get an unbiased sample and therefore forthe sake of simplicity it shall be sufficient.We crawled the 6, 000 pages and randomly extracted fromeach of the pages up to three URLs which are referencingto locations within the same top level domain. The resulting set theoretically contains 18, 000 URLs. In practice thisnumber is lower since a number of URLs did not containany links or were simply inaccessible to the crawler at thetime of the crawl in February of 2009. Similar to the filtersapplied in [22] (also with the implicit bias towards Englishlanguage web pages) we dismissed URLs that were not fromthe .com,.net,.org or .edu domains. In order to investigate the temporal change of web page titles we checkedthe availability of all remaining URLs in the IA and foundcopies for a total of 1090 URLs. We call one particular copyof a web page in the IA identified by a time stamp an observation. We downloaded a total of more than 100, 000observations for our 1090 URLs. Table 1 summarizes thecharacteristics of all 1090 URLs that have observations inthe IA. The length of an URL is the number of tokens thepath to the referenced object contains. For example theURLs foo.bar/ and foo.bar/index.html have a length ofone and foo.bar/bar/ as well as foo.bar/bar/index.htmlhave a length of two. URLs from the .com domain (70.2%)as well as URLs of length one (45.7%) and two (35%) aredominant in our sample set.3.2Measures of ChangeWith the corpus created we analyze the change of webpage titles over time with two different measures. Sincewe anticipate a low degree of change we first investigatethe general frequency of change meaning how often a titleis modified over the time span covered by all available IAobservations.The second measure is meant to represent the degree ofchange of the titles over time. We use the Levenshtein scorewhich captures the minimum number of operations neededto transform one title into another and compute it for alltitles of our corpus. A low Levenshtein score means thecompared titles are very dissimilar and a high score indicatesa high level of similarity. The score is different from whatis known as the Levenshtein distance where the value of 1.0

10000Mean Time DeltaTime Span 1001000 10100Number of Days1000 10Number of Changes/Observations10000Number of ChangesNumber of Observations 11 02004006008001000 02004006008001000URLsURLsFigure 4: Number of Title Changes and Observations in the Internet Archive of all URLsFigure 5: Mean Time Delta Between all Observations in the Internet Archive and Entire Time Spanof Observations (in Days) of all URLsmeans totally dissimilar strings and 0 indicates a match.We compute the score in two different ways: the sliding andthe rooted comparison. To explain the two methods let usconsider an URL with five observations O1 .O5 . The slidingcomparison computes the Levenshtein score between O1 andO2 , O2 and O3 , O3 and O4 and O4 and O5 . It continuouslyslides the comparison window forward by one observation,hence the name. The rooted method (for the same example)will compute the score between O1 and O2 , O1 and O3 , O1and O4 and O1 and O5 hence we call it a rooted comparison.This example is visually represented in Figure 3.We used the SimMetrics library3 to compute the Levenshtein scores.4.4.1EXPERIMENTAL RESULTSThe Number of ChangesEach time a title changes, that means the captured titleof observation On 1 is different compared to the title of theearlier observation On the frequency of change is increasedby one. Figure 4 shows in semi-log scale the number of titlechanges and the total number of IA observations (y-axis) ofall 1090 URLs (x-axis). The URLs are sorted in increasingorder by number of observations first and number of titlechanges second. We generally observe a rather low frequencyof change. The most “inconsistent” URL accounts for 25 titlechanges. This result confirms the intuition that titles aremore stable than for example LSs of web pages. The numberof observations of URLs in the IA over time does not impactthe number of changes of their titles. For example we seeURLs with thousands of observations having similarly fewtitle changes as URLs with less than 50 observations. Thismeans that the frequency of title changes in our sample setis not biased towards the number of available observationsin the IA.Figure 5 is also plotted in semi-log scale. It displays themean time that has passed between all available IA obser3http://www.dcs.shef.ac.uk/ sam/simmetrics.htmlvations as well as the amount of time passed between thefirst and the last observation. Both values are measured indays and indicated on the y-axis. The ordering of the URLsin this graph is the same as in Figure 4. We can see thatwith the increasing number of IA observations per URL thetime gap between observation decreases. The overall timespan passed between the first and the last observation startsoff high and slightly increases with the rising number of IAobservations. This result indicates that URLs with manyobservations in the IA have been crawled frequently in thepast in a rather short period of time and most likely arestill being crawled with that frequency. It further points toan early start of the crawl for such URLs since the overalltime span of all observations is high. Since the web is growing and the IA claims to constantly increase the number ofpages crawled ([20]) this observation matches our intuition.However we are not in the position to say whether just thefrequency of crawls for already indexed pages increas