Statistics vs Big Data Constantinos Daskalakis CSAIL and EECS, MIT Greek Stochastics YOU WANT BIG DATA? ILL GIVE YOU BIG DATA! BIG Data small Facebook: 20 petabytes images daily Human genome: 40 exabytes storage by 2025 SKA Telescope: 1 exabyte daily

High-dimensional DNA microarray Computer vision Expensive Experimental drugs Financial records What properties do your BIG distributions

have? e.g.1: play the lottery? Is it uniform? Is the lottery unfair? from Hitlotto.com: Lottery experts agree, past number histories can be the key to predicting future winners. True Story! Polish lottery Multilotek Choose uniformly at random distinct 20 numbers out of 1 to 80. Initial machine biased

e.g., probability of 50-59 too small Thanks to Krzysztof Onak (pointer) and Eric Price (graph) New Jersey Pick 3,4 Lottery New Jersey Pick k ( =3,4) Lottery. Pick k random digits in order. 10k possible values. Data: Pick 3 - 8522 results from 5/22/75 to 10/15/00 2-test (on Excel) answers "42% confidence

Pick 4 - 6544 results from 9/1/77 to 10/15/00. fewer results than possible values not a good idea to run test convergence to distribution wont kick in for small sample size (textbook) rule of thumb: expected number of at least 5 observations of each element in the domain under the null hypothesis to run e.g. 2: Independence Testing Shopping patterns:

Independent of zip code? e.g.2: Linkage Disequilibrium Genome locus 1 locus 2 locus Single nucleotide polymorphisms, are they independent? Suppose loci, possible states each, then: state of ones genome humans: some distribution over Question: Is a product distn OR far from all product distns?

should we expect the genomes from the 1000 human genomes project to be sufficient? up to how many loci? e.g. 3: Outbreak of diseases Similar patterns in different years? More prevalent near large airports? Flu 2005 Flu 2006 Distributions on BIG domains Given samples of a distribution, need to know, e.g., entropy number of distinct elements shape (monotone, unimodal, etc) closeness to uniform, Gaussian, Zipfian

No assumptions on shape of distribution i.e., parametric, smoothness, monotonicity, normality, Considered in Statistics, information theory, machine learning, databases, algorithms, physics, biology, Old questions, new challenges Classical Setting Domain tosses Small domain

Modern Setting Domain: One human genome Large domain (comparatively) Asymptotic analysis Computation not crucial New challenges: samples, computation, communication, storage

A Key Question How many samples do you need in terms of domain size? Do you need to estimate the probabilities of each domain item? -- OR - Can sample complexity be sublinear in size of the domain? Rules out standard statistical techniques Aim Algorithms with sublinear sample complexity STATISTICS

MACHINE LEARNING INFORMATION THEORY ALGORITHMS The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions The Menu Motivation

Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions Problem formulation discrete Model : family of distributions over may be non-parametric, e.g. unimodal, product, log-concave Problem Given: samples from unknown with probability 0.9, distinguish

vs Objective Minimize samples Computational efficiency 1 ( , ) min 2 max ( ) () Sublinear in ?

Well-studied Problem (Composite) hypothesis testing Neyman-Pearson test Kolmogorov-Smirnov test Pearsons chi-squared test Generalized likelihood ratio test

Quantities of Interest Focus Consistency Error exponents: as Asymptotic regime: Results kick in when - Sublinear in ? - Strong control for false positives? The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit

Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions Testing Fairness of a Coin : unknown probability of

vs Question: Is OR Goal: Toss coin several times, deduce correct answer w/ probability Number of samples required? Can estimate , by tossing times, then taking By concentration bounds, if w/ probability Are many samples necessary? Suppose there is tester using samples Then it can distinguish one sample from where each from one sample from where each w/ probability

Claim: Any tester has error probability at least ) Testing Uniformity : unknown distribution over Sample access to Question: is or ? : unknown vs [Paninski03]: samples and time

Intuition: (Lower Bound) Suppose is uniform distribution over and is either uniform on or uniform on a random size subset of - unless samples are observed, there are no collisions, hence cannot distinguish (Upper Bound) Collision statistics suffice to distinguish Proving Lower Bounds [Le Cam73]: Consider two disjoint sets of distributions , . Suppose algorithm is given samples from some unknown and claims to distinguish vs . Then: all distributions generating samples as follows

- choose a random distribution from (according to some distribution over - then generate samples from Proving Lower Bounds [Le Cam73]: Consider two disjoint sets of distributions , . Suppose algorithm is given samples from some unknown and claims to distinguish vs . Then: all distributions generating samples as follows - choose a random distribution from (according to some distribution over - then generate samples from

Proving Lower Bounds [Le Cam73]: Consider two disjoint sets of distributions , . Suppose algorithm is given samples from some unknown and claims to distinguish vs . Then: To prove lower bound for uniformity testing take: all distributions generating samples as follows - choose a random distribution from (according to some distribution over - then generate samples from The Menu

Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions Identity Testing (goodness of fit) : distributions over : given; sample access to

Question: is or ? [Batu-Fisher-Fortnow-Kumar-Rubinfeld-White01] [Paninski08, Valiant-Valiant14]: samples and time : unknown

vs [w/ Acharya-Kamath15]: a tolerant goodness of fit test with same sample size can distinguish: vs Cauchy-Schwarz: A new - Goodness of Fit Test Goal: given and sample access to distinguish: Case 1: vs Case 2: Approach: Draw many samples from : # of appearances of symbol Side-Note: Pearsons test uses statistic independent random variables

Statistic: Subtracting in the numerator gives an unbiased estimator and importantly may hugely decrease variance Case 1: ; Case 2: chug chug chugbound variance of samples suffice to distinguish The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions

Discussion/Road Ahead Testing in High Dimensions The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions Testing Properties of Distributions so far ={single distribution} restrictive, as rarely know hypothesis distribution exactly

natural extension: test structural properties monotonicity: PDF is monotone, e.g. cancer vs radiation unimodality: PDF is single-peaked, e.g. single source of disease log-concavity: is concave monotone-hazard rate: is concave product distribution, e.g. testing linkage disequilibrium Example question: Sample access to Is unimodal OR is -far from or unimodal distributions? Testing Properties of Distributions [w/ Acharya and Kamath 2015]: 1. Testing identity, monotonicity, log-concavity, monotone hazard-rate, unimodality for distributions over (ordered set) is doable w/ samples and time. 2. Testing monotonicity/independence of a distribution over is doable w/ samples and time.

previous best for monotonicity testing: [Bhattacharya-Fisher-Rubinfeld-Valiant11] previous best for independence: d=2, worst bounds [Batu et al.01] 3. All bounds above are optimal Matching lower bounds for 1 and 2 via Le Cam. 4. Unified approach, computationally efficient tests N.B. Contemporaneous work of [Canonne et al2015] provide a different unified approach for testing structure but their results are suboptimal. A Natural Approach

Goal: Given and sample access to , distinguish vs Choose a Hypothesis Test the Hypothesis (how well does fit ?) A Natural Approach (contd) Goal: Given and sample access to , distinguish vs A Learning-Followed-By-Testing Algorithm: 1. Learn hypothesis s.t. (needs cheap proper learner)

(automatic since 2. Reduce to tolerant goodness of fit Tolerant tester requires almost linear #samples in the support of given , sample access to , and explicit description of , distinguish vs namely ) samples [Valiant-Valiant10] Could try investing more samples for more accurate learning, but proper

learning complexity vs tolerant testing complexity tradeoff does not work out to give optimal testing complexity A Modified Approach Goal: Given and sample access to , distinguish vs A Learning-Followed-By-Testing Algorithm: 1. Learn hypothesis s.t. (needs cheap proper learner) (automatic since 2. Reduce to tolerant goodness of fit

given , sample access to , and explicit description of , distinguish vs Now tolerant testing has the right complexity of Pertinent Question: are there sublinear proper learners in ? We show that the -learning complexity is dominated by the testing complexity for all properties of distributions we consider Tutorial: part 2 Summary so far

Hypothesis Testing in the small sample regime. i.i.d. samples Test Pass/Fail?

unknown distribution over some discrete set set of distributions over Given: sample access to Goal: w/ prob tell vs Properties of interest: Is uniform? unimodal? logconcave? MHR? product measure? All above properties can be tested w/ samples and time Unified approach based on modified Pearsons goodness of fit test: statistic tight control for false positives: want to be able to both assert and reject the null hypothesis accommodate sublinear sample size The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit

Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions Other Distances (beyond So far focused on (a.k.a. total variation) distance Given sample access to , w/ prob distinguish: vs

Stronger distances? [Acharya-D-Kamath]: results are actually shown for Should also extend to Weaker distances? is easy to test for [Goldreich Ron], but makes less sense, e.g. p = (2/m,2/m,2/m,,2/n,0,0,0,0,0,0,0.0) q = (0,0,0,0,0,0,0.0,2/m,2/m,2/m,,2/m) l1 distance = 2 l2 distance = Tolerance So far, focused on non-tolerant version: Given set of distributions , and sample access to Distinguish: vs

Tolerant version: Distinguish: vs [Valiant-Valiant10]: samples are needed Tolerant version in : Distinguish: vs [w/ Acharya, Kamath15]: samples suffice Different ratios change the constants in notation

Goodness of Fit Our goodness of fit test was given an explicit distribution and sample access to a distribution , and was asked to test vs Sometimes both distributions are unknown, e.g. Transactions of 20-30 yr olds Transactions of 30-40 yr olds Same or different? Goodness of Fit w/ two unknowns p q

i.i.d. samples i.i.d. samples Test Pass/Fail? Given sample access to two unknown distributions Distinguish vs Goodness of Fit w/ two unknowns

[Batu Fortnow Rubinfeld Smith White], [P. Valiant], [Chan Diakonikolas Valiant Valiant]: Tight upper and lower bound of: Why different? Collision statistics are all that matter Collisions on heavy elements can hide collision statistics of rest of the domain Construct pairs of distributions where heavy elements are identical, but light elements are either identical or very different Continuous Distributions What can we say about continuous distributions? without extra assumptions such as smoothness of PDF/parametric modeling cannot stick to hard distances ()

Instead of restricting , , let us switch distances Can extend results if we switch to Kolmogorov distance recall: in contrast: Now want to distinguish: Claim: Tolerant testing in Kolmogorov distance of any distribution property (continuous or discrete) of -dimensional distributions can be performed from samples. Importantly: Kolmogorov distance allows graceful scaling with dimensionality of data DvoretzkyKieferWolfowitz inequality Suppose i.i.d. samples from (single-dimensional) , and let be the resulting empirical CDF, namely

Then: , . i.e. samples suffice to learn any single-dimensional distn to within in Kolmogorov distance. VC Inequality same is true for -dimensional distributions when #samples is at least After learning in Kolmogorov, can tolerant test any property. Runtime under investigation. trouble: computing/approximating the Kolmogorov distance of two highdimensional distributions is generally a hard computational question. The Menu Motivation

Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions Testing in High-Dimensions Already talked about testing high-dimensional

distributions in Kolmogorov distance. Sample complexity is Next Focus: discrete distributions, stronger distances High-Dimensional Discrete Distns Consider source generating -bit strings 0011010101 (sample 1) 0101001110 (sample 2) 0011110100 (sample 3) Are bits/pixels independent?

Our algorithms require samples Is source generating graphs over nodes Erdos-Renyi ? Our algorithms require samples Exponential dependence on unsettling, but necessary Lower bound exploits high possible correlation among bits Nature is not adversarial Often high dimensional systems have structure, e.g. Markov random fields fields (MRFs), graphical models (Bayes nets), etc Testing high-dimensional distributions with combinatorial structure? 400 bit images

High-Dimensional Discrete Distns Consider source generating -bit strings 400 bit [w/ Dikkala, Kamath16]: If unknown is known to be an Ising model, 0011010101 (sample 1) images then 2) samples suffice to test independence, goodness of fit. (extends 0101001110 (sample to MRFs) 0011110100 (sample

3) Are bits/pixels independent? Our algorithms require samples Is source generating graphs over nodes Erdos-Renyi ? Our algorithms require samples Exponential dependence on unsettling, but necessary Lower bound exploits high possible correlation among bits Nature is not adversarial Often high dimensional systems have structure, e.g. Markov random fields fields (MRFs), graphical models (Bayes nets), etc

Testing high-dimensional distributions with combinatorial structure? Ising Model Statistical physics, computer vision, neuroscience, social science Ising model: Probability distribution defined in terms of a graph edge potentials , node potentials State space High s strongly (anti-)correlated spins =0 Ising Model:

Strong vs weak ties high temperature regime low temperature regime Testing Ising Models ( ) exp ( = (, )

+ Given: sample access to Ising model over w/ nodes, edges unknown, graph G unknown supported on Goal: distinguish vs [w/ Dikkala, Kamath16]: samples suffice Warmup: + ) Testing Ising Models

( ) exp ( = (, ) + Given: sample access to Ising model over w/ nodes, edges unknown, graph G unknown supported on Goal: distinguish

vs [w/ Dikkala, Kamath16]: samples suffice Warmup: + ) Testing Ising Models ( ) exp ( = (, )

Focus: ) Given: sample access to Ising model over w/ nodes, edges

unknown, graph G unknown supported on Goal: distinguish = vs [w/ Dikkala, Kamath16]: samples suffice Warmup: = Hence, can distinguish which is the case from samples. Can localize departure from Uniformity (independence) on some edge Cheaper nonlocalizing test? Testing Ising Models ( ) exp

( = (, ) Focus: ) Given: sample access to Ising model over w/ nodes, edges

unknown, graph G unknown supported on Goal: distinguish = vs Cheaper nonlocalizing test? Claim: By expending some samples, can identify distinguishing statistic of the form , where for all . Issue: cant bound intelligently as vars arent pairwise ind. If , then O.w. best can say is trivial ) and is, in fact, tight: consider two disjoint cliques with super-strong ties suppose all for all

dances around its mean by Low temperature. How about high temperature? =0 Ising Model: Strong vs weak ties high temperature regime low temperature regime

1/ Exponential mixing of the Glauber dynamics mixing of the Glauber dynamics Testing Ising Models ( ) exp ( = (, )

Focus: ) Given: sample access to Ising model over w/ nodes, edges unknown, graph G unknown supported on Goal: distinguish = vs Claim: By expending some samples, can identify distinguishing statistic of the form

, where for all . Issue: cant bound intelligently as vars arent pairwise ind. If , then Low temperature: High temperature: O() for dense graphs proof via exchangeable pairs [Stein,,Chatterjee 2006] Exchangeable Pairs Goal: Given and random vector want to bound moments of , or prove concentration of about its mean Approach: Define pair of random vectors such that: has the same distribution as (exchangeability) marginal distributions are (faithfulness) Find anti-symmetric function (i.e. ) such that

Claims: 1. 2. Let . If a.s. then Silly Example , where Suppose for all : and a.s. Goal: prove concentration of about its mean Defining exchangeable pair as follows: sample pick u.a.r.; sample and set Choose anti-symmetric function:

Bounding ? Interesting Example: Ising How to define exchangeable pair? Natural approach: sample Do one step of the Glauber dynamics from to find i.e. pick random node and resample from marginal of at conditioning on th state of all other nodes being Harder question: find anti-symmetric s.t.

Approach that works for any where are two (potentially coupled) executions of the Glauber dynamics starting from states and respectively Requires a good coupling Challenging question: bound Need to show function contracts as Glauber dynamics unravels Showing Contraction Generous coupling: choose same node, but update independently

Lemma 1: Different than greedy coupling typically used where the same node is chosen and the update is coordinated to maximize probability of same update Lemma 2: The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions

Discussion/Road Ahead Testing in High Dimensions Future Directions The Menu Motivation Problem Formulation Uniformity Testing, Goodness of Fit Testing Properties of Distributions Discussion/Road Ahead Testing in High Dimensions Future Directions Markov Chain Testing Example: n = 52 cards, 7 riffle shuffles needed*

[Diakonis,Bayer92] (Uni , 7 ) 0.33 6 *riffle shuffle = Gilbert-Shannon-Reeds (GSR) model for distribution on card permutations. Empirical Fact: vs. different Markov chains! [Diakonis03] [Ongoing work with Dikkala, Gravin] Question: how close is real shuffle to the GSR distribution?

Given: sample access to Goal: distinguish is GSR vs Markov Chain Testing Question: test How to quantify distance between Markov chains? Distance? Let be the transition matrices of chains Object of interest:

Pertinent question: asymptotic as ? Easier to quantify , for all So proposed notion of distance: spectral radius Results: testing symmetric with samples [Ongoing work with Dikkala, Gravin]

Testing Combinatorial Structure Efficiently testing combinatorial structure Is the phylogenic tree assumption true? Sapiens-Neanderthal early interbreeding [Slatkin et al13] Is a graphical model a tree? [ongoing work with Bresler, Acharya] Testing from a Single Sample Given one social network, one brain, etc., how can we test the validity of a certain generative

model? Get many samples from one sample? INFORMATION THEORY MACHINE LEARNING STATISTICS ALGORITHMS Thank You!