Neural Networks Geoff Hulten The Human Brain (According to a computer scientist) Network of ~100 Billion Neurons Each ~1,000 10,000 connections Send electro-chemical signals Activation time ~10 ms second Image from Wikipedia ~100 Neuron chain in 1 second Artificial Neural Network Grossly simplified approximation of how the brain works Artificial Neuron (Sigmoid Unit) Features used as input to an initial set of artificial neurons Output of artificial neurons used as input to others

Output of the network used as prediction Mid 2010s image processing ~50-100 layers ~10-60 million artificial neurons Example Neural Network Fully connected network Single Hidden Layer 2313 weights to learn 1 connection per pixel + bias 1 connection per pixel + bias ( =1) 1 connection per pixel + bias 576 Pixels (Normalized) Output Layer

1 connection per pixel + bias 5 Weights 2,308 Weights Hidden Layer Example of Predicting with Neural Network 0.5 -1.0 1.0 0.0 ~0.5 1.0 1.5 0.5 1.0

~0.75 0.25 1.0 1.0 0.5 -1.0 1.0 ( =1 )= 0.82 Whats Going On? Very limited feature engineering on input Hidden nodes learn useful features instead Positive Weight?

( =1) Weights from Hidden Node 1 Negative Weight? Input Image (Normalized) Weights from Hidden Node 2 Logistic Regression with Responses as input Another Example Neural Network Fully connected network Two Hidden Layers 2333 weights to learn 1 connection per pixel + bias 1 connection per pixel

+ bias ( =1) 1 connection per pixel + bias 576 Pixels (Normalized) Output Layer 1 connection per pixel + bias 2,308 Weights 5 Weights 20 Weights Hidden Layer Hidden Layer Output Layer Single network (training

run), multiple tasks Hidden nodes learn generally useful filters () () () 576 Pixels (Normalized) (h) Hidden Layer Output Layer Neural Network Architectures/Concepts Fully connected layers Recurrent Networks (LSTM & attention) Convolutional Layers

Embeddings MaxPooling Residual Networks Activation (ReLU) Batch Normalization Softmax Dropout Will explore in more detail later Loss well use for Neural Networks .5 .1 1

0 .1 .95 1 1 All sorts of options for loss functions for Neural Networks Optimizing Neural Nets Back Propagation Gradient descent over entire networks weight vector Easy to adapt to different network architectures Converges to local minimum (usually wont find global minimum) Training can be very slow! For this weeks assignmentsorry For next week well use a package In general very well suited to run on GPU

1. Forward Propagation Conceptual Backprop 2. Back Propagation 3. Update Weights h1 1.0 ~0.5 Figure out how much each part contributes to the error. 0.5 1 ~0.75 h 2 Step each weight to reduce

the error it is contributing to ~0.82 Figure out how much error the network makes on the sample: 1. Forward Propagation Backprop Example 2. Back Propagation 3. Update Weights =0.1 0.5 -1.0 h1

1.0 = (1 )( ) =0.027 ~0.5 1.0 ~0.82 Error = ~0.18 0.25 0.5 1 1.0

~0.75 1.0 005 005 25 0.5 -1.0 1.0 h 2 h=h (1h ) h 2=.005

h Backprop Algorithm Initialize all weights to small random number (-0.05 0.05) While not time to stop repeatedly loop over training data: Input a single training sample to network and calculate for every neuron Back propagate the errors from the output to every neuron Update every weight in the network Stopping Criteria: # of Epochs (passes through data) Training set loss stops going down Accuracy on validation data Backprop with Hidden Layer 1. (or multiple outputs) 2. Back Propagation 3. Update Weights

+) h1,1 1,1 2,1 Forward Propagation h 2,1 1,1 2,2 1.0 0.5 1 h1,2 h 2,2

= (1 )( ) h=h (1h ) h Stochastic Gradient Descent Gradient Descent Calculate gradient on all samples Step Per Sample Gradient Stochastic Gradient Descent

Calculate gradient on some samples Step Stochastic can make progress faster (large training set) Stochastic takes a less direct path to convergence Gradient Descent Stochastic Gradient Descent Batch Size: N instead of 1 Local Optimum and Momentum Local Optimum Loss Why is this okay? In practice: Neural networks overfit Momentum

Power through local optimums Converge faster (?) Parameters Dead Neurons & Vanishing Gradients Neurons can die * Large weights cause gradients to vanish Test: Assert if this condition occurs What causes this Poor initialization of weights Optimization that gets out of hand Input variables unnormalized What should you do with Neural Networks? As a model (similar to others weve

learned) Fully connected networks Few hidden layers (1,2,3) A few dozen nodes per hidden layer Tune # layers Tune # nodes per layer Do some feature engineering Be careful of overfitting Simplify if not converging Leveraging recent breakthroughs Understand standard architectures Get some GPU acceleration Get lots of data Craft a network architecture More on this next class Summary of Artificial Neural Networks Model that very crudely approximates the way human brains

work Neural networks learn features (which we might have hand crafted without them) Each artificial neuron similar to linear model, with non-linear activation function Many options for network architectures Neural networks are very expressive, can learn complex concepts (and overfit) Backpropagation is a flexible algorithm to learn neural networks