Predicting zero-day software vulnerabilities through data ...
PREDICTING ZERO-DAY SOFTWARE VULNERABILITIES THROUGH DATA MINING --SECOND PRESENTATION Su Zhang 1 Outline Quick Review. Data Source NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing. Functions Available For Our Approach. Statistical Results Plan For Next Phase.
2 Quick Review 3 Source Database NVD National Vulnerability Database U.S. government repository of standards based vulnerability management data. Data included in each NVD entry Published Date Time Vulnerable softwares CPE Specification Derived data
Published Date Time Month Published Date Time Day Two adjacent vulnerabilities CPE diff (v1,v2)Version diff CPE Specification Software Name Adjacent different Published Date Time ttpv Adjacent different Published Date Time ttnv 4 Six Most Vulnerable/Popular Vendors Linux: 56925 instances
Sun: 24726 instances Cisco: 20120 instances Mozilla: 19965 instances Microsoft: 16703 instances Apple: 14809 instances. 5 Why We Only Choose Instances Of Pop VendorsInstances Table Instances Table 60000 50000 40000 30000 Instances
20000 10000 0 r e s t e M hp le ft lla co un ux b B P pp so zi is S in I o L A cr o o C
Ad M i M 6 Why We Only Choose Instances Of Pop VendorsVulnerability Table Vulnerability Table 2500 2000 1500 1000 Vul_Num 500 0
r e s t t P x o e n H nu zila isc cle IBM pl Su sof Li Mo C Ora Ap ro
c i M 7 Why We Only Choose Instances Of Pop Vendors Huge size of nominal types (vendors and software) will result in a scalability issue. Top six take up 43.4% of all instances.
We have too many vendors(10411) in NVD. The seventh most popular/vulnerable vendor is much less than the sixth. Vendors are independent for our approach. 8 Data Preprocessing NVD dataTraining/Testing dataset Starting from 2005 since before that the data looks unstable. Correct some obvious errors in NVD(e.g.
cpe:/o:linux:linux_kernel:390). Attributes Published time : Only use month and day. Version diff: A normalized difference between two versions. Vendor: Removed. 9 Data Preprocessing(cont) Attributes Group vulnerabilities published at the same day- we can guarantee ttnv/ttpv are non-zero values. ttnv is the predicted attribute.
For each software Delete its first bunch of instances. Delete its last bunch of instances. 10 version diff Calculation v1= 3.6.4; v2 = 3.6; MaxVersionLength=4; v1= expand ( v1, 4 ) = 18.104.22.168 v2 =expand ( v2, 4 ) = 22.214.171.124 diff(v1, v2) = (3-3) * 1000 +(6-6) * 100-1 +(4-0) * 100-2 +(0-0) * 100-3 = 4 E -4
11 An Example Vendor, soft, version, month, day, vdiff, ttpv, ttnv linux,kernel,2.6.18, 05, 02, 0, 70, 5 linux,kernel,126.96.36.199, 05, 07,1.02E-4,5, 281 12 Functions Available For Our Approach On Weka Least Mean Square. Linear Regression
Multilayer Perceptron. SMOreg. RBF Network. Gaussian Processes. 13 Several Statistical Results Function: Linear Regression Training Dataset: 66% Linux(Randomly picked since 2005). Test Dataset: the rest 34%
Test Result: Correlation coefficient Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 0.5127 11.2358 25.4037 107.629 % 86.0388 % 17967 14 Correlation Coefficient 15
Several Definitions About Error Mean absolute error : Root mean square error: 16 Several Definitions About Error(Cont) Relative absolute error:
Root relative squared error: 17 Several Statistical Results Function: Least Mean Square Training Dataset: 66% Linux(Randomly picked since 2005). Test Dataset: the rest 34% Test Result: Correlation coefficient
Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances -0.1501 7.6676 30.6038 73.449 % 103.6507 % 17967 18 Several Statistical Results
Function: Multilayer Perceptron Training Dataset: 66% Linux(Randomly picked since 2005). Test Dataset: the rest 34% Test Result: Correlation coefficient Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 0.9886 0.4068 4.6905 3.7802 % 15.1644 %
17967 19 Several Statistical Results Function: RBF Network Training Dataset: 66% Linux(Randomly picked since 2005). Test Dataset: the rest 34% Test Result: Linear Regression Model ttnv = -15.3206 * pCluster_0_1 + 21.6205 Correlation coefficient
Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 0.1822 10.5857 29.048 101.4023 % 98.3814 % 17967 20 Summary Of Current Results
Linear Regression: Not accurate enough but looks promising (correlation coefficient: 0.5127). Least Mean Square: Probably not good for our approach(negative correlation coefficient). Multilayer Perceptron: Looks good but it couldnt provide us with a linear model. 21 Summary Of Current Results (Cont)
SMOreg: For most vendors, it takes too long time to finish (usually more than 80 hours). RBF Network: Not very accurate. Gaussian Processes: Runs out of heap memory for most of our experiments. 22 Possible Ways To Improve The Accuracy Of Our Models.
Adding CVSS metrics as predictive attributes. Binarize our predictive attributes (e.g. divide ttnv/ttpv into several categories.) Use regression SVM with multiple kernels. 23 Plan For Next Phase Try to find out an optimal model for our prediction.
Try to investigate how to apply it with MulVAL if we get a good model. Otherwise, find out the reason why it is not accurate enough. 24 Thank you! 25
Rivier College, CS699 Professional Seminar The Revolution Yet to Happen Gordon Bell & James N. Gray (from Beyond Calculation, Chapter 1) Discoveries in the Past The Electron Discovery by J. J. Thompson, 1895; First Electronic Computers were built in 1940s;...
1957: 9 African American students attend Central High School with 2000 white students. 1st test of Brown decision. Federal troops sent in 1958. Little Rock schools were shut down. Crisis in Little Rock, Arkansas. Little Rock Nine. Students being escorted...
my Login . ID and . Password . to . MyLCI. for the current Lions Year? Why is the mailing address for our . Club's . incorrect in the District Directory and/or on the District's WEB site? Why are our...
"In view of these and other occurrences which caused a number of complaints from the regular users of the laboratory, I must ask you to turn in your keys and terminate your activities in the laboratory immediately. Please . inform...
Provide high quality resort within Lake Malawi National Park. This is a purpose-built infrastructure with 100 beds 5 Star Lodge, 150 (4 Star) bed hotel, conference facilities, sporting complex, Champion Golf Course, Quay, Museum, wildlife ranch and underwater aquarium.
What is Child Sexual Exploitation? ... The trafficking of children into the UK from other countries for the purpose of sexually abusing them. Online sexual exploitation can include: An adult pretending to be a child, befriending the child through online...
CS590D: Data Mining Prof. Chris Clifton January 24, 2006 Association Rules Association Rule Mining Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Mining...