Transcription

An Introduction to the WEKA Data Mining SystemZdravko MarkovCentral Connecticut State [email protected] RussellUniversity of [email protected]

Data Mining "Drowning in Data yet Starving for Knowledge"? "Computers have promised us a fountain of wisdom but delivered a flood of data"William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus Data Mining: "The non trivial extraction of implicit, previously unknown, and potentiallyuseful information from data"William J Frawley, Gregory Piatetsky-Shapiro and Christopher J Matheus Data mining finds valuable information hidden in large volumes of data. Data mining is the analysis of data and the use of software techniques for findingpatterns and regularities in sets of data. Data Mining is an interdisciplinary field involving:– Databases– Statistics– Machine Learning– High Performance Computing– Visualization– Mathematics

Data Mining SoftwareKDnuggets : Polls : Data Mining Tools You Used in2005 (May 2005) PollData mining/Analytic tools youused in 2005 [376 voters, 860 votes total] Enterprise-level: (US 10,000 and more)Fair Isaac, IBM, Insightful, KXEN, Oracle, SAS, andSPSS Department-level: (from 1,000 to 9,999)Angoss, CART/MARS/TreeNet/Random Forests,Equbits, GhostMiner, Gornik, Mineset, MATLAB,Megaputer, Microsoft SQL Server, Statsoft Statistica,ThinkAnalytics Personal-level: (from 1 to 999): Excel, See5 Free: C4.5, R, Weka, Xelopes

Weka Data Mining SoftwareKDnuggets : News : 2005 : n13 : item2SIGKDD Service Award is the highest service award in the field of data mining and knowledge discovery. It is is givento one individual or one group who has performed significant service to the data mining and knowledge discoveryfield, including professional volunteer services in disseminating technical information to the field, education, andresearch funding.The 2005 ACM SIGKDD Service Award is presented to the Weka team for their development of the freely-availableWeka Data Mining Software, including the accompanying book Data Mining: Practical Machine Learning Tools andTechniques (now in second edition) and much other documentation.The Weka team includes Ian H. Witten and Eibe Frank, and the following major contributors (in alphabetical order oflast names): Remco R. Bouckaert, John G. Cleary, Sally Jo Cunningham, Andrew Donkin, Dale Fletcher, SteveGarner, Mark A. Hall, Geoffrey Holmes, Matt Humphrey, Lyn Hunt, Stuart Inglis, Ashraf M. Kibriya, RichardKirkby, Brent Martin, Bob McQueen, Craig G. Nevill-Manning, Bernhard Pfahringer, Peter Reutemann, GabiSchmidberger, Lloyd A. Smith, Tony C. Smith, Kai Ming Ting, Leonard E. Trigg, Yong Wang, Malcolm Ware, andXin Xu.The Weka team has put a tremendous amount of effort into continuously developing and maintaining the system since1994. The development of Weka was funded by a grant from the New Zealand Government's Foundation forResearch, Science and Technology.The key features responsible for Weka's success are:– it provides many different algorithms for data mining and machine learning– is is open source and freely available– it is platform-independent– it is easily useable by people who are not data mining specialists– it provides flexible facilities for scripting experiments– it has kept up-to-date, with new algorithms being added as they appear in the research literature.

Weka Data Mining SoftwareKDnuggets : News : 2005 : n13 : item2 (cont.)The Weka Data Mining Software has been downloaded 200,000 times since it was put on SourceForge in April2000, and is currently downloaded at a rate of 10,000/month. The Weka mailing list has over 1100subscribers in 50 countries, including subscribers from many major companies.There are 15 well-documented substantial projects that incorporate, wrap or extend Weka, and no doubt manymore that have not been reported on Sourceforge.Ian H. Witten and Eibe Frank also wrote a very popular book "Data Mining: Practical Machine LearningTools and Techniques" (now in the second edition), that seamlessly integrates Weka system into teachingof data mining and machine learning. In addition, they provided excellent teaching material on the bookwebsite.This book became one of the most popular textbooks for data mining and machine learning, and is veryfrequently cited in scientific publications.Weka is a landmark system in the history of the data mining and machine learning research communities,because it is the only toolkit that has gained such widespread adoption and survived for an extended periodof time (the first version of Weka was released 11 years ago). Other data mining and machine learningsystems that have achieved this are individual systems, such as C4.5, not toolkits.Since Weka is freely available for download and offers many powerful features (sometimes not found incommercial data mining software), it has become one of the most widely used data mining systems. Wekaalso became one of the favorite vehicles for data mining research and helped to advance it by making manypowerful features available to all.In sum, the Weka team has made an outstanding contribution to the data mining field.

Using Weka to teach Machine Learning, Data and Web Mininghttp://uhaweb.hartford.edu/compsci/ccli/

Machine Learning, Data and Web Miningby Example(“learning by doing” approach) Data preprocessing and visualizationAttribute selectionClassification (OneR, Decision trees)Prediction (Nearest neighbor)Model evaluationClustering (K-means, Cobweb)Association rules

Data preprocessing and visualizationInitial Data Preparation(Weka data input) Raw data (Japanese loan data) Web/Text documents (Department data)

Data preprocessing and visualizationJapanese loan data (a sample from a loan history database of a Japanese bank)Clients: s1,., s20 Approved loan: s1, s2, s4, s5, s6, s7, s8, s9, s14, s15, s17, s18, s19Rejected loan: s3, s10, s11, s12, s13, s16, s20Clients data: unemployed clients: s3, s10, s12loan is to buy a personal computer: s1, s2, s3, s4, s5, s6, s7, s8, s9, s10loan is to buy a car: s11, s12, s13, s14, s15, s16, s17, s18, s19, s20male clients: s6, s7, s8, s9, s10, s16, s17, s18, s19, s20not married: s1, s2, s5, s6, s7, s11, s13, s14, s16, s18live in problematic area: s3, s5age: s1 18, s2 20, s3 25, s4 40, s5 50, s6 18, s7 22, s8 28, s9 40, s10 50, s11 18, s12 20,s13 25, s14 38, s15 50, s16 19, s17 21, s18 25, s19 38, s20 50money in a bank (x10000 yen): s1 20, s2 10, s3 5, s4 5, s5 5, s6 10, s7 10, s8 15, s9 20, s10 5,s11 50, s12 50, s13 50, s14 150, s15 50, s16 50, s17 150, s18 150, s19 100, s20 50monthly pay (x10000 yen): s1 2, s2 2, s3 4, s4 7, s5 4, s6 5, s7 3, s8 4, s9 2, s10 4, s11 8,s12 10, s13 5, s14 10, s15 15, s16 7, s17 3, s18 10, s19 10, s20 10months for the loan: s1 15, s2 20, s3 12, s4 12, s5 12, s6 8, s7 8, s8 10, s9 20, s10 12, s11 20,s12 20, s13 20, s14 20, s15 20, s16 20, s17 20, s18 20, s19 20, s20 30years with the last employer: s1 1, s2 2, s3 0, s4 2, s5 25, s6 1, s7 4, s8 5, s9 15, s10 0, s11 1,s12 2, s13 5, s14 15, s15 8, s16 2, s17 3, s18 2, s19 15, s20 2

Data preprocessing and visualizationRelations, attributes, tuples (instances)Loan data – CVS format(LoanData.cvs)

Data preprocessing and visualizationAttribute-Relation File Format (ARFF) - http://www.cs.waikato.ac.nz/ ml/weka/arff.html

Data preprocessing and visualizationDownload and install Weka - http://www.cs.waikato.ac.nz/ ml/weka/

Data preprocessing and visualizationRun Weka and select the Explorer

Data preprocessing and visualizationLoad data into Weka – ARFF format or CVS format (click on “Open file ”)

Data preprocessing and visualizationConverting data formats through Weka (click on “Save ”)

Data preprocessing and visualizationEditing data in Weka (click on ”Edit ”)

Data preprocessing and visualizationExamining data Attribute type and properties Class (last attribute) distribution

Data preprocessing and visualizationClick on “Visualize All”

Data preprocessing and visualizationWeb/Text documents - Department datahttp://www.cs.ccsu.edu/ markov/ Download Ch1, DMW Book Download datasets

Data preprocessing and visualizationConvert HTML to Text

Data preprocessing and visualizationLoading text data in Weka String format for ID and content One document per line Add class (nominal) if needed

Data preprocessing and visualizationConverting a string attribute into nominalChoose filters/unsupervised/attribute/StringToNominaland and set the index to 1

Data preprocessing and visualizationConverting a string attribute into nominalClick on Apply – document name is now nominal

Data preprocessing and visualizationConverting text data into TFIDF (Term Frequency – Inverted Document Frequency) attribute format Choose filters/unsupervised/attribute/StringToWordVector Set the parameters as needed (see “More”) Click on “Apply”

Data preprocessing and visualizationMake the class attribute last Choose filters/unsupervised/attribute/Copy Set the index to 2 and click on Apply Remove attribute 2

Data preprocessing and visualization Change the attributes to nominal (use NumericToBinary filter) Save data on a file for further use

Data preprocessing and visualizationARFF file representing the department data in binary format (NonSparse)Note the format (seeSparseToNonSparseinstance filter)

Attribute SelectionFinding a minimal set of attributes that preserve the class distributionAttribute relevance with respect to the class – not relevant attribute (accounting)IF accounting 1 THEN class A (Error 0, Coverage 1 instance overfitting )IF accounting 0 THEN class B (Error 10/19, Coverage 19 instances low accuracy)

Attribute SelectionAttribute relevance with respect to the class – relevant attribute (science)IF accounting 1 THEN class A (Error 0, Coverage 7 instance)IF accounting 0 THEN class B (Error 4/13, Coverage 13 instances)

Attribute Selection (with document name)

Attribute Selection (without document name)

Attribute Selection (ranking)

Attribute Selection (explanation of ranking)

Attribute Selection (using filters) Choose filters/supervised/attribute/AttributeSelection Set parameters to InfoGainAttributeEval and Ranker Click on Apply and see the attribute ordering

Attribute Selection (using filters)

Classification – creating models (hypotheses)Mapping (independent attributes - class)Inferring rudimentary rules - OneRWeather data /5sunny - noovercast - yes 0/42/5rainy - yesTotalerror4/14temperature hot - nomild - yescool - yes2/42/61/45/14humidity3/71/74/14false - yes2/85/14true - no3/5high - nonormal - yeswindy

Classification – OneR

Classification – decision treeRight click on the highlighted line in Result list and choose Visualize tree

Classification – decision treeTop-down induction of decision trees (TDIDT, old approach knowfrom pattern recognition): Select an attribute for root node and create a branch for eachpossible attribute value. Split the instances into subsets (one for each branchextending from the node). Repeat the procedure recursively for each branch, using onlyinstances that reach the branch (those that satisfy theconditions along the path from the root to the branch). Stop if all instances have the same class.ID3, C4.5, J48 (Weka): Select the attribute that minimizes theclass entropy in the split.

Classification – numeric attributesweather.arff

Classification – predicting classClick on Set Click on Open file

Classification – predicting classRight click on the highlighted line in Result list and choose Visualize classifier errorsClick on the square

Classification – predicting classClick on Save

Prediction (no model, lazy learning)test: (sunny, cool, high, TRUE, ?) K-nearest neighbor (KNN, IBk)Take the class of the nearest neighboror the majority class among K neighborsK 1 - noK 3 - noK 5 - yesK 14 - yes (Majority predictor, ZeroR) Weighted K-nearest neighborK 5 - undecidedno 1/1 1/2 1.5yes 1/2 1/2 1/2 1.5X2891112 10Distance(test,X)12222 4playnonoyesyesyes yes Distance is calculated as the number of different attribute values Euclidean distance for numeric attributes

Prediction (no model, lazy learning)

Prediction (no model, lazy inary-training

Prediction (no model, lazy learning)

Model evaluation – holdout (percentage split)Click on More options

Model evaluation – cross validation

Model evaluation – leave one out cross validation

Model evaluation – confusion (contingency) matrixactualpredictedaba21b20

Clustering – k-meansClick on Ignore attributes

Hierarchical Clustering – Cobweb

Association Rules (A B) Confidence (accuracy): P(B A) (# of tuples containing both A and B) / (# of tuples containing A). Support (coverage): P(A,B) (# of tuples containing both A and B) / (total # of tuples)

Association Rules

And many more Thank you!