CSE 544 - cs.stonybrook.edu

CSE 544 - cs.stonybrook.edu

CSE 357 Statistical Methods for Data Science Lecture 1: Intro and Logistics Anshul Gandhi Assistant Professor Department of Computer Science 1 CSE 357 Statistical Methods for Data Science What is Data Science? Analysis of data (using several tools/techniques) Statistics/Data Analysis + CS

2 CSE 357 Statistical Methods for Data Science Who is a Data Scientist Statistics/Data Analysis + CS Someone who is better at stats than the average CS person and someone who is better at CS than an average statistician. 3 Contact Info: Anshul Gandhi

347, New CS building [email protected] [email protected] 4 Outline 1. Logistics

Course info Lectures Course webpage Office hours Grading 2. Syllabus 5 Course Info Taught previously as CSE 39x (Fall17, Spring19)

Probability theory Probability review (basics, conditional prob, Bayes theorem) Random variables (mean, variance, Geometric, Normal) Statistical inference Non-parametric inference (empirical distribution, sample mean, bias, confidence intervals) Parametric inference (method of moments, max. likelihood) Hypothesis testing (truth table, various tests, p-values) DS techniques Bayesian inference (Bayesian reasoning, conjugate priors) Regression analysis (linear regression, time series analysis) 6

Course Info Prerequisites: Probability and Statistics Recommended (not necessary) Basic CS background Python (not necessary, but will help) This is NOT a systems course More of a theory + algorithms course 7 Course Info

Recommended texts: Software: Available from DoIT 8 Lectures Tues Thurs: 4:00pm 5:20pm Frey 309 On echo (should be available) 5-min break at the halfway point Whiteboard + maybe slides Occasionally some programming (Python)

Interactive (please) Carry a book, a real one! Two guest lectures: (i) Python, (ii) Stats in medicine. 9 Example 1: Simple stats X is a collection of 99 integers (positive and negative) Mean(X) > 0 How many elements of X are > 0? Same question but now Median(X) > 0? 10

Course webpage www.cs.stonybrook.edu/~cse357 (will redirect) Please bookmark this page This is your best resource! Will be regularly updated 11 Course webpage www.cs.stonybrook.edu/~cse357 12

Course webpage Piazza (link on website) Blackboard for assignments, solutions, and grades 13 Office hours Tues 1:30-2:30pm Thurs 1:30-2:30pm CS 347 Will re-visit after add/drop date TA and TA Office hours: TBA

14 Example 2: Correlation v/s Causation Q1: Are A and B correlated? A B 15 Example 2: Correlation v/s Causation

Q2: Which of the following is true (i) A causes B (ii) B causes A (iii) Either (i) or (ii) (iv) None of the above A B 16 Example 2: Correlation v/s Causation Q2: Which of the following is true

(i) A causes B (ii) B causes A (iii) Either (i) or (ii) (iv) None of the above A B 17 Example 2: Correlation v/s Causation 18

Grading (also on website) 50% assignments 6 assignments. Expect 5-6 questions/assignment. Later assignments will have more programming. Build on material taught in class, more challenging. 45% exams (2 exams) Similar to assignment questions, but shorter and simpler Mid-term 1: 20%, Mid-term 2: 25% 5% class participation Exact %ages are somewhat tentative

19 Grading - assignments 50% assignments 6 assignments 5-6 problems per assignment Collaboration is allowed (groups of at most 3 students)

One write-up per group. DO NOT COPY across groups Assignments due at the beginning of class NO LATE SUBMISSIONS Hard-copies only (typed/hand-written) Some programming required for later assignments 20

Grading - exams 45% exams Mid-terms 1 and 2 20% mid-term 1 25% mid-term 2 Non-overlapping Roughly mid-way and at the end of the semester

Written exams Closed-book exams No programming questions Somewhat easier than assignments No collaborations, obviously 75 mins

21 Grading class participation 5% class participation Starts after add/drop date Contribute to class discussions Interactive Very helpful for bumping your grade if you are on the border

22 Grading recap 50% assignments (6 assignments) 45% exams (two exams) 5% class participation 23 Example 3: Simpsons Paradox Earns above-average

income in A Developing Nation (A) Average income of A goes down Average income of A+B goes up!!

Earns below-average income in B Developed Nation (B) Average income of B goes down 24

Example 3: Simpsons Paradox Earns below-average income in B Earns above-average income in A Developing Nation (A) Person 1: 20K

Person X: 40K Developed Nation (B) Average income of A+B Before: 160K/3 = 53.3K After: 200K/3 = 66.7K Person 2: 100K Person X: 80K 25

Example 3: Simpsons Paradox Since 2000, the median US wage has risen about 1% (adjusted) But over the same period, the median wage for: high school dropouts, high school graduates with no college education, people with some college education, and people with Bachelors or higher degrees have all decreased. In other words, within every educational subgroup, the median wage is lower now than it was in 2000. How can both things be true?? 26

Syllabus www.cs.stonybrook.edu/~cse357 27 Next class Probability review - 1 Basics: sample space, outcomes, probability Events: mutually exclusive, independent Calculating probability: sets, counting, tree diagram 28

Recently Viewed Presentations

  • Semantic Portals - cs.um.edu.mt

    Semantic Portals - cs.um.edu.mt

    Semantic Web Research: Visual Modelling of OWL-S Services Computer Science Annual Workshop September 2004 Charlie Abela, James Scicluna Department of Computer Science and AI
  • Nuclear Physics - Richmond County School System

    Nuclear Physics - Richmond County School System

    Symbol Notation A convenient way of describing an element is by giving its mass number and its atomic number, along with the chemical symbol for that element. For example, consider beryllium (Be):
  • Diapositiva 1

    Diapositiva 1

    Assemblea anual Directors a Barcelona. En Antoni Gasol a l'esquerra, llavors director de Sant Guim de Freixenet -035- I en Pere Xuriach Barniol, llavors director de Guardiola del Bergadà -053- ELS COMPANYS MEMÓRIA Fets i anècdotes del dia a dia...
  • Collaboration, collusion and plagiarism in computer science

    Collaboration, collusion and plagiarism in computer science

    Collusion. Excessive collaboration. Definition is set by the course instructor. From Waterloo's OAI: "Clearly indicate if group collaboration is acceptable (and the level of collaboration permitted) or if students must do all work independently."
  • MACHINE VS. RESEARCHER - IS IT A RACE?

    MACHINE VS. RESEARCHER - IS IT A RACE?

    Speaker 1: Matt Armitage - Founder Kulturpop. Will Man Survive the Tech Future? 9.40 am. Speaker 2: ArpapatBoonrod - CEO Kantar Thailand. Voice - The Future for MR - ESOMAR APAC Best Paper 2018. 10.10 am. Speaker 3: Anne Marie...
  • Obscured AGNs and Deep Radio Surveys D.R. Ballantyne

    Obscured AGNs and Deep Radio Surveys D.R. Ballantyne

    Obscured AGNs and Deep Radio Surveys D.R. Ballantyne Center for Relativistic Astrophysics, School of Physics, Georgia Tech AGN dN/dS as before, but now includes star-formation in the host galaxy.
  • Online tuition maths support: can we do better

    Online tuition maths support: can we do better

    Nicola McIntyre, Linda Thomson, Gerry Golding. Background. SDK100 - Science and Health: an evidence-based approach. Prior to 17J, maths support was provided at dayschools (plus online alternatives) and via a maths workbook.
  • Practical halving; the Nelumbo nucifera evidence on early

    Practical halving; the Nelumbo nucifera evidence on early

    Genome halving (k = 2) and genome aliquoting (k > 2) are computational procedures to find the number of chromosomes in the ancestor. Genome halving algorithm designed for large eukaryote genomes with largely single-copy genes, taking advantage of a signature...