Subgroup Discovery in Defect Prediction - UCL

Subgroup Discovery in Defect Prediction - UCL

Subgroup Discovery in Defect Prediction Rachel Harrison, Oxford Brookes University D a n i e l R o d r g u e z , Un i v o f A l c a l a Jo s R i q u e l m e , Un i v o f S e v i l l e Roberto Ruiz, Pablo de Olavide University Outline Supervised Description Subgroup Discovery Preliminary Experimental Work Datasets Algorithms (SD and CN2-SD) Results Conclusions and future work Descriptive Models Typically, ML algorithms have been divided into: Predictive (Classification, Regression, temporal series) Descriptive (Clustering, Association, summarisation) Recently, supervised descriptive rule discovery is being introduced in the literature. The aim is to understand the underlying phenomena, not to classify new instances, i.e., to find information about a specific value in the class attribute. The information should be useful to the domain expert and easily interpretable. Types of supervised descriptive techniques include: Contrast Set Mining (CSM) Emerging Pattern Mining (EPM) Subgroup discovery (SD) SD Definition SD algorithms aims to find subgroups of data that are statistically different given a property of interest. [Klsgen, 96; Wrobel, 97] SD lies between predictive (finding rules given historical data and a property of interest) and descriptive tasks (discovering interesting patterns in data). SD algorithms generally extract rules from subsets of the data, having previously specified the concept, for example defective modules from a software metrics repository. Rules have also the "Condition Class" where the condition is the conjunction of a set of selected variables (pairs attributevalue) among all variables. Advantages of rules include that they are well known representations easily understandable by the domain experts So far, SD has mostly been applied to the medical domain.

SD vs. Classification Classification Predictive Inductio n Output Set of classification rules (dependent rules) Purpose To learn a model for classification or prediction Subgroup Discovery Descriptive Individual Rules to describe subgroups (independent rules) To find interesting and interpretable patterns with respect to a specific attribute SD vs. Classification S3 S1 Following [Herrera et al, 2011] S2 SD Algorithms SD algorithms could be classified as: Exhaustive (e.g.: SD-map, Apriori-SD) Heuristic (e.g.: SD, CN2-SD) Fuzzy genetic algorithms (SDIGA, MESDIF, EDER-SD) Or from their origin, evolved from different communities: Extension of classification algorithms (SD, CN2-SD, etc.) Extension of association algorithms (Apriori-SD, SD4TS, SD-Map, etc.) Comprehensive survey by [Herrera et al. 2011]

Quality Measures in SD Measures of Complexity Number of rules: It measures the number of induced rules. Number of conditions: It measures the number of conditions in the antecedent of the rule. Measures of Generality Coverage: where N is the number of samples and n(Cond) is the no. of instances that satisfy the antecedent of the rule. Support: where n(Cond Class) is the no. of instances that satisfy both the condition and the class Quality Measures in SD Measures of precision Confidence: Precision Qc: Precision Qg: Measures of interest Significance: ( )=2 ( ) ( ) =1 ( ) () Other Measures Sensitivity: False alarm: ) ( ( )= = =

( ) Specificity: ( )= Unusualness: ( ) = = + () Experimental Work Datasets NASA Datasets Originally available from: http://mdp.ivv.nasa.gov/ From PROMISE, using the ARFF format (Weka data mining toolkit): http://promisedata.org/ Boetticher, T. Menzies, T. Ostrand, Promise Repository of Empirical Software Engineering Data, 2007. Bug prediction dataset http://bug.inf.usi.ch/ D'Ambros, M., Lanza, M., Robbes, Romain, Empirical Software Engineering (EMSE), In press, 2011 Datasets Characteristics Some of these datasets are highly unbalanced, with duplicates and contradictory instances, and irrelevant attributes for defect prediction. # NonDef % Def Lang inst def CM1 KC1 KC2 KC3 MC2 MW1 PC1

Eclipse JDT Core Eclipse PDEUI Equinox Lucene Mylyn 498 2,109 522 458 161 434 1,109 997 449 1,783 415 415 109 403 1,032 791 49 326 107 43 52 31 77 206 9.83 15.45 20.49 9.39 32.29 7.14 6.94 20.66 C C++ C++ Java

C++ C++ C Java 1,497 1,288 209 13.96 Java 324 691 1,862 195 627 1,617 129 64 245 39.81 9.26 13.15 Java Java Java Metrics Used from the Datasets For the NASA datasets: McCabe Halstead Metric loc v(g) ev(g) iv(g)

uniqOp uniqOpnd Unique operands, n2 totalOp Total operators, N1 totalOpnd Branch For the OO datasets: Class C&K Class Definition McCabe's Lines of code Cyclomatic complexity Essential complexity Design complexity Unique operators, n1 branchCount Metric defective? wmc dit cbo noc lcom rfc defective? Total operands N2 No. branches of the flow graph Reported defects? Definition (true/false) Weighted Method Count Depth of Inheritance Tree Coupling Between Objects No. of Children

Lack of Cohesion in Methods Response For Class Reported defects? Algorithms The algorithms used: The Subgroup Discovery algorithm (SD) [Gamberger, 02] is a covering rule induction algorithm that using beam search aims to find rules that maximise: where TP and FP are the number of true and false positives respectively and g is a generalisation parameter that allow us to control the specificity of a rule, i.e., balance between the complexity of a rule and its accuracy. The CN2-SD [Lavrac, 04] algorithm is an adaptation of the CN2 classification rule algorithm [Clark, 89]. It induces subgroups in the form of rules using as a quality measure the relation between true positives and false positives. The original algorithm consists of a search procedure using beam search within a control procedure and the control procedure that iteratively performs the search. The CN2-SD algorithm uses Weighted Relative Accuracy (explained next) as a covering measure of the quality of the induced rules. Tool: Orange: http://orange.biolab.si/ Examples Rules KC2 Dataset # SD 0 1 2 3 4 5 6 7 8 9 10 11 pd .2 4 .2 8

.2 7 .2 7 .2 7 .2 4 .2 4 .2 3 .3 1 .2 9 .2 9 .2 8 pf 0 .0 1 .0 1 .0 1 .0 1 .0 1 .0 1 .0 1 .0 1 .0 1 .0 1 .0 1 Rules

TP 26 FP 0 ev(g) > 4 totalOpnd > 117 30 5 iv(G) > 8 uniqOpnd > 34 ev(g) > 4 29 5 loc > 100 uniqOpnd > 34 ev(g) > 4 29 5 loc > 100 iv(G) > 8 ev(g) > 4 29 5 26 5 26 5 25 5 totalOpnd > 117 34 5

loc > 100 iv(G) > 8 32 5 ev(g) > 4 iv(G) > 8 32 5 ev(g) > 4 uniqOpnd > 34 30 5 loc > 100 ev(g) > 4 loc > 100 iv(G) > 8 totalOpnd > 117 iv(G) > 8 uniqOp > 11 totalOp > 80 iv(G) > 8 uniqOpnd > 34 Example Rules JDT Core Dataset SD # 0 pd .27 1 .3 2 .3 3

.29 4 .29 5 .33 6 .32 7 .33 8 .32 9 .18 1 0 1 1 .19 .18 pf .0 2 .0 2 .0 2 .0 2 .0 2 .0 3

.0 3 .0 3 .0 3 .0 2 .0 2 .0 2 TP FP 56 16 Rules lcom > 171 rfc > 88 cbo > 16 wmc > 141 62 16 rfc > 88 wmc > 141 cbo > 16 62 16 cbo > 16 wmc > 141 60 16 lcom > 171 rfc > 88 wmc > 141 60 16 lcom > 171 wmc > 141 68 24

rfc > 88 wmc > 141 66 24 rfc > 88 wmc > 141 dit 5 68 24 wmc > 141 66 24 dit <= 5 wmc > 141 38 16 wmc > 141 noc = 0 dit 5 40 16 wmc > 141 noc = 0 38 16 cbo > 16 rfc > 88 noc > 0 dit 5 Cross-validation Results (10 CV) Co v SD CN2-SD SD

Sup Size Sig Cplx Acc RAc c AUC CM1 .233 .72 20 3.045 4.548 .029 .602 .748 KC1 .079 .426 20 2.61 .023 .61 .657

KC2 .085 .533 20 2.185 16.26 6 9.581 .049 .703 .74 KC3 .294 .91 20 2.435 5.651 .037 .608 .83 MC2 .161 .647 20

2.055 2.204 .042 .643 .689 MW1 .071 .5 20 2.515 3.767 .02 .736 .678 PC1 .118 .37 20 3.515 3.697 .01 .66 .621 CM1

KC1 KC2 .113 .107 .156 .64 .607 .795 5 5 5 .023 .03 .065 .628 .634 .733 .617 .71 .816 KC3 MC2 MW1 PC1 JDT Core .126 .152 .079 .087 .082 .885 .427 .558 .661 .539 4.9

5 5 5 20 2.972 2.912 11.78 7 3.146 2.186 3.517 2.814 13.77 4 .019 .04 .02 .007 .039 .68 .593 .661 .632 .662 .797 .593 .743 .688 .726 1.3 1.1 1.6 1.295 2.32 2.02 1.86 2.485 Visualisation of SD ROC and Rule visualisation for KC2 (SD & CN2-SD) Conclusions

Rules obtained using SD are intuitive but needed to be analysed by an expert. The metrics used for classifiers cannot be directly applied in SD and need to be adapted. Current and future work Further validation and application in other software engineering domains, e.g., project management. SD is a search problem! Development of new algorithms and metrics EDER-SD (Evolutionary Decision Rules SD) in Weka Unbalanced data (ROC, AUC metrics?), etc. Feature Selection (as a pre-processing step, part of the algorithm?, which metrics really influence defects) Discretisation Different search strategies and fitness functions (and multi-objective!) Use of global optimisation (set of metrics) vs. local metrics (individual metrics) References Kralj, P., Lavrac, N., Webb GI (2009) Supervised Descriptive Rule Discovery: A Unifying Survey of Constrast Set, Emerging Pateern and Subgroup Mining. Journal of Machine Learning Research 10: 377403 Kloesgen, W. (1996), Explora: A Multipattern and Multistrategy Discovery Assistant. In: Advances in Knowledge Discovery and Data Mining, American Association for Artificial Intelligence, pp 249271 Wrobel, S. (1997), An Algorithm for Multi-relational Discovery of Subgroups. Proceedings of the 1st European Symposium on Principles of Data Mining and Knowledge Discovery, Springer, LNAI, vol 1263, pp 7887 Bay S., Pazzani, M. (2001) Detecting Group Differences: Mining Contrast Sets. Data Mining and Knowledge Discovery 5: 213246 Dong, G., Li, J. (1999) Efficient Mining of Emerging Patterns: Discovering Trends and Differences. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, pp 4352 Herrera, F., Carmona del Jesus, C.J., Gonzalez, P., and del Jesus, M.J., An overview on subgroup discovery: Foundations and applications, Knowledge and Information Systems, 2011 In Press. Gamberger, D., Lavrac, N.: Expert-guided subgroup discovery: methodology and application. Journal of Artificial Intelligence Research 17 (2002) 501527 Lavrac, N., Kavsek, B., Flach, P., Todorovski, L.: Subgroup discovery with CN2-SD. The Journal of Machine Learning Research 5 (2004) 153188 Clark, P., Niblett, T. (1989) , The CN2 induction algorithm, Machine Learning 3 261283

Recently Viewed Presentations

  • Focus on Asia Workshop February 26, 2015 Japan

    Focus on Asia Workshop February 26, 2015 Japan

    First and only non stop service from/to Japan. First commercial flight of Boring's new aircraft, the 787, in the entire western hemisphere. Reduced 5hours off each way comparing connection flight. 787 Dreamliner is a dream come true for markets like...
  • Distribution D - PSSRA

    Distribution D - PSSRA

    Currently have 2 active assessment tasks due in 2019.Assess aircraft elevator IAW NSTM 588 using MRC FY16 for guidance; REPAIR: 144m situational periodicity condition: 3rd rail, ace wire ropes require replacement, replace aircraft elevator wire ropes in accordance with CSWT...
  • January 2019 CONSORTIUM Strengthening Career and Technical Education

    January 2019 CONSORTIUM Strengthening Career and Technical Education

    Each consortium has a minimum of one college and one school district. We do have one consortium with 43 school districts. You can see the organization is geographical. The consortia structure joins our more than343 high schools with the 31...
  • Human Communication - Houston Community College

    Human Communication - Houston Community College

    Arial Wingdings Times New Roman MS Pゴシック Univers (W1) Capsules 1_Capsules Theories of Interpersonal Communication and Relationships Attraction Relationship Rules Social Penetration Social Exchange and Equity Culture, Technology, Work, and Relationships Culture, Technology, Work, and Relationships cont…
  • Introduction to Nonfiction

    Introduction to Nonfiction

    It seemed like an ordinary day when she got up that morning, but Lynda was about to embark on the worst day of her life. First, she fell in the bathtub because her mother forgot to rinse out the bath...
  • Naming Ionic Compounds - Dublin Unified School District

    Naming Ionic Compounds - Dublin Unified School District

    Naming Ionic Compounds Ionic compounds Compounds are created by a combination of charged ions. Generally they contain a metal bonded to a nonmetal the positive ion (cations) and the negative ion (anions) can be thought of as the first and...
  • D3 Human Evolution - Ms De Souza&#x27;s Super Awesome IB Biology Class

    D3 Human Evolution - Ms De Souza's Super Awesome IB Biology Class

    The course of our evolution can be traced through the fossil record - namely fossils of hip bones, feet, leg bones, footsteps, and skulls ... FoxP2 gene - codes for a protein that is vital for speech. ... D3 Human...
  • UNIT 5 AS and AD and International Trade

    UNIT 5 AS and AD and International Trade

    Congress takes time to pass legislation. Operational Lag-Spending/planning takes time to organize and execute ( changing taxing is quicker) Politically Motivated Policies. Politicians may use economically inappropriate policies to get reelected. Ex: A senator promises more welfare and public works...