Deep Learning for Speech - 國立臺灣大學

Deep Learning for Speech - 國立臺灣大學

Deep Learning for Speech Recognitio n Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New acoustic features Convolutional Neural Network (CNN) Applications in Acoustic Signal Processing Conventional

Speech Recognition Machine Learning helps Speech Recognition X ~ This is a structured learning problem. Evaluation function: ~ ( )

Inference: = max ( , ) max max ( ) ( ) max ( ) ( ) ( ) : Acoustic Model, : Language Model Input Representation Audio is represented by a vector sequence

:X :X x1 x2 x3 39 dim MFCC Input Representation - Spli ce To consider some temporal information MFCC Splice

Phoneme Phoneme: basic unit Each word corresponds to a sequence of phonemes. Lexicon

what do you think Lexicon hh w aa t Different words can correspond to the same phonemes d uw y uw th ih ng k State Each phoneme correspond to a sequence of states

what do you think Phone: hh w aa t d uw y uw th ih ng k Tri-phone: t-d+uw d-uw+y uw-y+uw y-uw+th

t-d+uw1 t-d+uw2 t-d+uw3 d-uw+y1 d-uw+y2 d-uw+y3 State: State Each state has a stationary distribution for acoustic features Gaussian Mixture Model (GMM) t-d+uw1 P(x|td+uw1) d-uw+y3 P(x|d-uw+y3)

State Each state has a stationary distribution for acoustic features Tied-state pointer P(x|td+uw1) pointer P(x|d-uw+y3) Same Address Acoustic Model ~ = max ( ) ( )

P(X|W) = P(X|S) W what do you think? : :S a b c d e Assume we also know the alignment . s s s s s 1 2 3 4

5 :X x1 x2 x3 x4 x5 transition

( ,h ) = ( 1 ) ( ) =1 emission Acoustic Model ~ = max ( ) ( ) P(X|W) = P(X|S) W what do you think? : :S a b c d e

Actually, we dont know the alignment. s1 s2 s3 s4 s5 :X x1 x2 x3 x4 x5 ( ) max ( 1 ) ( ) 1 =1

(Viterbi algorithm) How to use Deep Lear ning? People imagine Acoustic features This can not be true! DNN can only take fixed length vectors as input.

DNN Hello What DNN can do is P(a|xi) P(b|xi) P(c|xi) Size of output layer = No. of states DNN DNN input: One acoustic feature DNN output:

Probability of each state xi Low rank approximation Output layer M W N Input layer W: M X N N is the size of the last

hidden layer M is the size of output layer Number of states M can be large if the outputs are the states of tri-phone. Low rank approximation K N N V

K M W M U K < M,N Less parameters Output layer

Output layer M M U linear W K V N

N How we use deep learning There are three ways to use DNN for acoustic mode ling Way 1. Tandem Efforts for exploiting Way 2. DNN-HMM hybrid deep learning Way 3. End-to-end How to use Deep Lear ning? Way 1: Tandem

Way 1: Tandem system P(a|xi) P(b|xi) P(c|xi) Size of output layer = No. of states DNN new feature Input of your original speech recognition system xi

Last hidden layer or bottleneck layer are also possible. How to use Deep Lear ning? Way 2: DNN-HMM hybrid Way 2: DNN-HMM Hybrid ~ = max ( ) max ( ) ( ) ( ) max ( 1 ) ( )

1 =1 ( ) ( ) = DNN From DNN

( , ) ( ) ( ) ( ) ( ) Count from training data Way 2: DNN-HMM Hybrid ( ) max ( 1 ) ( ) 1 =1

( ) max ( 1 ) ( ; ) 1 =1 ( ) DNN ( ) ( ) DNN From original

HMM Count from training data This assembled vehicle works . Way 2: DNN-HMM Hybrid Sequential Training ~ max ( ; ) ( )

Given training data Find-tune the DNN parameters such that ^ ; ) ( ^ ) ( increase ( ; ) ( ) decrease

( is any word sequence different from ) How to use Deep Lear ning? Way 3: End-to-end Way 3: End-to-end - Chara cter Input: acoustic features (spectrograms) Output: characters (and space) + null (~) No phoneme and lexicon (No OOV problem)

A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng Deep Speech: Scaling up end-to-end speech recognition, arXiv:1412.5567v2, 2014. Way 3: End-to-end - Chara cter HIS FRIENDS ~~ ~~ ~~ ~~ ~~ ~~ ~~ ~

~~ Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with recurrent neural networks." Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014. Way 3: End-to-end Word? apple bog cat Output layer

~50k words in the lexicon DNN Size = 200 times of acoustic features Input layer padding with zero! Use other systems to get the word boundaries Ref: Bengio, Samy, and Georg Heigold. "Word embeddings for speech

recognition., Interspeech. 2014. Why Deep Learning ? Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011. Deeper is better? Word error rate (WER) multiple layers Deeper is Better

1 hidden layer Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011. Deeper is better? Word error rate (WER) multiple layers 1 hidden layer Deeper is Better

For a fixed number of parameters, a deep model is clearly better than the shallow one. A. Mohamed, G. Hinton, and G. Penn, Understanding how Deep Belief Networks Perform Acoustic Modelling, in ICASSP, 2012. What does DNN do? Speaker normalization is automatically done in DNN Input Acoustic Feature (MFCC) 1-st Hidden Layer

A. Mohamed, G. Hinton, and G. Penn, Understanding how Deep Belief Networks Perform Acoustic Modelling, in ICASSP, 2012. What does DNN do? Speaker normalization is automatically done in DNN Input Acoustic Feature (MFCC) 8-th Hidden Layer What does DNN do? In ordinary acoustic models, all the states are mode

led independently Not effective way to model human voice front back high The sound of vowel is only controlled by a few factors. low http://www.ipachart.com/

What does DNN do? front back high Vu, Ngoc Thang, Jochen Weiner, and Tanja Schultz. "Investigating the Learning Effect of Multilingual Bottle-Neck Features for ASR." Interspeech. 2014. Output of hidden

layer reduce to two dimensions /i/ low /u/ /e/ /o/ /a/ The lower layers detect the manner of articulation

All the states share the results from the same set of detectors. Use parameters effectively Speaker Adaptation Speaker Adaptation Speaker adaptation: use different models to recogni tion the speech of different speakers Collect the audio data of each speaker A DNN model for each speaker Challenge: limited data for training Not enough data for directly training a DNN model Not enough data for just fine-tune a speaker i

ndependent DNN model Categories of Methods Need less training data Conservative Training Output layer output close Output layer parameter close

initialization Input layer Audio data of Many speakers Input layer A little data from target speaker Transformation methods Add an extra layer Output layer

Fix all the other parameters Output layer layer i+1 layer i+1 W layer i Input layer W extra layer

Wa Input layer layer i A little data from target speaker Transformation methods Add the extra layer between the input and first laye r With splicing

layer 1 layer 1 extra layers extra layer Wa Larger Wa More data Wa Smaller Wa Wa

Wa less data Transformation methods SVD bottleneck adaptation Output layer M U linear Wa K

K is usually small V N Speaker-aware Training Can also be noiseaware, devise aware training, Lots of mismatched data Speaker information Data of

Speaker 1 Data of Speaker 2 Data of Speaker 3 Fixed length low dimension vectors Text transcription is not needed for extracting the vectors. Speaker-aware Training Training data: Speaker 1

train Output layer Speaker 2 Acoustic features are appended with speaker information features Testing data: test Acoustic Speaker feature Information All the speaker use the same DNN model

Different speaker augmented by different features Multi-task Learning Multitask Learning The multi-layer structure makes DNN suitable for m ultitask learning Task A Task A Task B Task B Input

feature Input feature Input feature for task A for task B Multitask Learning - Multili ngual states of French states of German acoustic features

states of Spanish states of Italian states of Mandarin Human languages share some common characteristics. Multitask Learning - Multili ngual

Character Error Rate 50 45 Mandarin only 40 35 30 25

1 With European Language 10 100 1000 Hours of training data for Mandarin Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers." Acoustics, Speech and Signal Processing (ICASSP), 2013

Multitask Learning - Different units A acoustic features B A= state B = phoneme A= state B = gender

A= state B = grapheme (character) Deep Learning for Acoustic Modeli ng New acoustic features MFCC spectrogram DFT Waveform

DCT MFCC Input of DNN log filter bank

Filter-bank Output spectrogram DFT Waveform Kind of standard now Input of

DNN log filter bank Spectrogram spectrogram DFT Waveform common today Input of DNN

5% relative improvement over fitlerbank output Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B., Learning filter banks within a deep neural network framework, In Automatic Speech Recognition and Understanding (ASRU), 2013 Spectrogram Waveform? Input of DNN If success, no Signal & Systems

Waveform People tried, but not better than spectrogram yet Ref: Tske, Z., Golik, P., Schlter, R., & Ney, H., Acoustic modeling with deep neural networks using raw time signal for LVCSR, In INTERPSEECH 2014 Still need to take Signal & Systems Waveform? Convolutional Neural Network (CN N) CNN Speech can be treated as images

Frequency Spectrogram Time CNN CNN Probabilities of states CNN Image

Replace DNN by CNN CNN Max Max pooling Max a1 a2 Maxout b1 Max b2

CNN Tth, Lszl. "Convolutional Deep Maxout Networks for Phone Recognition, Interspeech, 2014. Applications in Acoustic Signal Proce ssing DNN for Speech Enhancem ent Clean Speech for mobile commination or speech recognition DNN

Noisy Speech Demo for speech enhancement: http ://home.ustc.edu.cn/~xuyong62/demo/SE_DNN.html DNN for Voice Conversion Female DNN Male Demo for Voice Conversion: http ://research.microsoft.com/en-us/projects/vcnn/default.aspx Concluding Remark s

Concluding Remarks Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New acoustic features Convolutional Neural Network (CNN) Applications in Acoustic Signal Processing Thank you for your attention! More Researches related to Speech Find the lectures

related to deep learning lecture recordings Summary Spoken Content Retrieval Speech Recognition I would like to leave Taipei on November 2nd Computer Assisted Language Learning

Information Extraction Speech Summarization core techniques Hi Hello Dialogue

Recently Viewed Presentations

  • CLR 252 Developing KPPs Note from SME: Changes

    CLR 252 Developing KPPs Note from SME: Changes

    Failure of a system to meet a validated KPP threshold/initial minimum rescinds the validation, brings the military utility of the associated system(s) into question, and may result in a reevaluation of the program or modification to production increments. ... Requirements...
  • WARM-UP- 5 mins

    WARM-UP- 5 mins

    FFA Motto FFA Degree Activity- 30 mins Create an FFA Degree hierarchy Use the Student Manual to determine the FFA degrees available to students. ... Parliamentary Procedure FFA:SAEs Ms. Wiener Agriculture Department WARM-UP Turning SAE into JOB! A word ladder...
  • Containers &amp; Iterators - University of Texas at Austin

    Containers & Iterators - University of Texas at Austin

    Iterators. Objects that act like pointers. increment, decrement, dereference. Use to cycle through container's elements. Depending on type of iterator, may permit:
  • Lockerbie Manor - LT Scotland

    Lockerbie Manor - LT Scotland

    Lockerbie Manor . En suite bedrooms for up to 130 pupils. Staff rooms adjacent to children's rooms . Friendship groups for dorms . Duty Instructor (24 hours)
  • Presentation Title - CIT International

    Presentation Title - CIT International

    Disclosures. I have no personal financial relationships with commercial interests relevant to this presentation. The views expressed are my own, and do not necessarily represent those of the NIH, NIMH, or the Federal Government
  • Features of erosion LITHOSPHER GLACIATION E These slide-shows

    Features of erosion LITHOSPHER GLACIATION E These slide-shows

    A. To create an arete, there has to be more than one corrie eroding the land between them into an arete. The eroded corrie back-walls eat into the summit of the mountain creating a pyramidal peak, where several aretes come...
  • Word Study - Flushing Community Schools

    Word Study - Flushing Community Schools

    Open their word study notebook to the next clean page. Title that page with the sort number and the name of that weeks sort. Write a simple list of all of their words. 3. Finally they will cut out their...
  • Supporting working carers Emily Holzhausen OBE Director of

    Supporting working carers Emily Holzhausen OBE Director of

    - Reviewing existing policies to include caring and carers- Developing a specific carers policy (e.g. Sainsbury's) Promotion - Awareness raising activities(e.g. Carers Week)- Identifying champions and role models ... Clare Coates ...