Reference Assisted Nucleic Acid Sequence Reconstruction from ...

Reference Assisted Nucleic Acid Sequence Reconstruction from ...

Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie1, Alex Zelikovsky2 and Ion Mndoiu1 1CSE Department, University of Connecticut 2CS Department, Georgia State University MassCLEAVE assay for MS-based nucleic acid sequence analysis Error model Signed relative errors assumed to follow a normal distribution with mean 0, standard deviation for masses and for intensities Two types of error incurred when matching compomer c to peak of mass m and intensity i(m): Relative mass error

Relative intensity error: Problem formulation Given: Mass spectra MS Reference sequence r including position of PCR primers Maximum edit distance D Standard deviations and , tolerance parameter Find: Target sequence t flanked by PCR primers that a. b. is within edit distance D of r, and yields a matching of compomers of CS(t) to masses of MS with minimum total relative error

Nave Algorithm Exhaustive search Generate all sequences within an edit distance of D of the reference, and Compute the minimum total relative error for matching the compomers of each of these sequences to the masses in MS. The number of candidate sequences grows exponentially with D 3-Stage Algorithm 1. Identify regions of the reference sequence that are unambiguously supported by MS data High probability to be present in the unknown target sequence

2. Branch-and-bound approach to fill in remaining gaps Generates set of candidate sequences with compomers supported by MS data 3. Compute candidate sequences with minimum total relative error Min-cost flow problem currently solved as linear program With or without intensities First stage: finding strongly supported regions of the reference Chebyshevs inequality: A detectable compomer c CS(s) is strongly matched to mass m MS(s) if: where = / 0.5 is set based on a user specified

tolerance First stage: finding strongly supported regions of the reference A strong match between compomer c and mass m is unambiguous if: c has multiplicity of 1 in reference c can be strongly matched only to m m can be strongly matched only to c The set M of unambiguous matches can be found efficiently by binary search First stage: finding strongly supported regions of the reference (c1, m1), . . . , (cn, mn) = unambiguous matches for cut base , indexed in non-decreasing order of relative errors

We iteratively apply Chebyshevs inequality with tolerance to the running means of signed relative errors, which are normally distributed with mean 0 and standard deviation /i0.5 If Chebyshevs inequality fails for index i, match(c i, mi) is removed from M First stage: finding strongly supported regions of the reference A position in the reference sequence has strong support if All detectable compomers overlapping it can be strongly matched, and At least one of these matches is in M (unambiguous + not removed)

Positions in PCR primers automatically marked as having strong support Second stage: generating candidate targets by branch-and-bound Reference regions with strong support assumed to be present in target Gaps filled one base at a time, in left-to-right order, using branch-and-bound Choice order: reference base, substitutions, deletion, insertions Chebyshev test with tolerance applied to running means of signed relative errors of closest matches Search pruned when test fails or more than D mutations Third stage: scoring candidates by linear programming

Objective: Minimize total relative error Variables: For each c CS and m MS, xc,m is set to 1 if c is matched to m, 0 otherwise (integrality follows from total unimodularity) Constraints: No missing peaks: each detectable compomer c CS(t) must be matched to one mass in MS No extraneous peaks: each mass m MS must be matched to at least one detectable compomer c CS(t) LP w/o intensities LP with intensities

Simulation setup Reference length: 100-500 bp Reference sequences/targets D=1: 10 random references, all sequences at edit distance 1 used as targets D=2,3: 100 random reference-target pairs Error free MS data: = = 0 Noisy MS data: = 0.0001, =0-1 Tolerance parameter: = 0.01 Precision and Recall actual target predicted target(s) tp

(true positive) Prediction is unique & correct fn (false negative) Prediction is not unique fp (false positive) Prediction is unique & incorrect Branch-and-bound vs. Nave (F-measure for D=1, error free data, w/o intensities)

100% 95% 90% 85% 80% 75% 70% 65% 100 150 200 250 300 350 400 450 500 1 substitution Branchand-Bound 1 deletion Branch-andBound 1 substitution Nave 1 deletion Nave 1 insertion Branch-andBound

1 insertion Nave Branch-and-bound speed-up (D=1, error free data, w/o intensities) Length Nave 100 200 250 300 350

400 450 500 18.66 34.95 49.65 65.72 82.60 100.25 120.19 139.90 161.71 Branch-and-Bound 0.06 Speed-up 150 307X 0.08

0.12 0.16 0.19 0.25 0.33 0.50 0.52 429X

418X 418X 430X 404X 368X 278X 314X Results on noisy data (F-measure, D=1, = 0.0001, w/o intensities) 100%

95% 90% 1 substitution, =0, =0 1 substitution, =0.0001, =0.01 1 deletion, =0, =0 1 deletion, =0.0001, =0.01 1 insertion, =0, =0 1 insertion, =0.0001, =0.01 85% 80% 75%

70% 100 150 200 250 300 350 400 450

500 Effect of the number of mutations (F-measure, = 0.0001, w/o intensities) 100% 90% 80% 1 substitution, =0.0001, =0.01 1 deletion, =0.0001, =0.01 2 substitutions, =0.0001, =0.01 1 insertion, =0.0001, =0.01 2 deletions, =0.0001, =0.01 3 substitutions, =0.0001, =0.01 3 deletions, =0.0001, =0.01 2 insertions, =0.0001, =0.01 3 insertions, =0.0001, =0.01

70% 60% 50% 40% 30% 20% 100 150 200 250 300 350

400 450 500 Do intensities help? (F-measure, = 0.0001, 1 substitution) 98% 96% 94% 92% 90% 88% 86% 84%

100 150 200 250 300 350 400 450 500 '=0 '=0.15 '=0.25 '=0.35 '=0.5 '=1 without intensities Do intensities help? (F-measure, = 0.0001) 100% 95% 90% 85% 80%

75% 70% 65% 60% 55% 50% 100 150 200 250 300 350 400 450 500 1 substitution '=0.35 2 substitutions '=0.35 1 substitution w/o intensities 3 substitutions '=0.35 2 substitutions w/o intensities 3 substitutions w/o intensities

Ongoing Work Experiments on EPLD clone data Branch-and-bound relaxation + penalty in LP objective to handle missing/extraneous peaks Intensity data normalization: correct for mass and base composition effects

Recently Viewed Presentations

  • Synchronised High-Cadence Imaging of the Solar Chromosphere: The

    Synchronised High-Cadence Imaging of the Solar Chromosphere: The

    Synchronised High-Cadence Imaging of the Solar Chromosphere: The Rapid Dual Imager (RDI) David.R. Williams1, R.T. James McAteer2, Peter T. Gallagher3, Thanassis C ...
  • TCA Juried Biennial GLASS Tempe Center for the

    TCA Juried Biennial GLASS Tempe Center for the

    Within this context, I explore both the dark and light aspects of life. Using glass as my main medium, it is my intent to create spatial sculptures with light. The glass is bent, fractured and often etched with text and...
  • WHY YOU SHOULD BE USING ENVIROPEEL The Enviropeel

    WHY YOU SHOULD BE USING ENVIROPEEL The Enviropeel

    CASE HISTORIES - US COASTGUARD Over the past few years a number of trials have been undertaken for the US Coastguard, using both the small handheld Slugger 170 unit and the full-size unit. The USCGC Kukui undertook a full scale...
  • The Law in the High Middle Ages - Mr. Hill SD#53

    The Law in the High Middle Ages - Mr. Hill SD#53

    The Law in the High Middle Ages Many of our legal customs and traditions originated in some form or another during the Middle Ages People were prosecuted and sued; testimony was given under oath, and a sentence would be passed...
  • CHATTERBOOKS for children with dyslexia What do you

    CHATTERBOOKS for children with dyslexia What do you

    Development of partnership working- Dyslexia Awareness Week- guests this year's programme includes Jonathan Meres, Paul McNeil, Anita Govan and Sam Barclay. Programmes to take away. Dyslexia and Us. Events during the year. Viv French Building Stories, Cathy McPhail, Kenny Logan...
  • Surveys - KEIMYUNG UNIVERSITY (2019)

    Surveys - KEIMYUNG UNIVERSITY (2019)

    The Possible Survey Questions. Who. As of today, what grade are you in at Keimyung University? Where. When you arrive on campus, how many roads / streets do you need to cross to attend your first class on Monday?
  • Level 2 Diploma in Business Administration PowerPoint presentation

    Level 2 Diploma in Business Administration PowerPoint presentation

    Handout 4: The aims of this session are to:Identify job roles within a team delivering customer service.Describe dissatisfied customer behaviours.State the procedures to be followed when dealing with customer queries or problems.State to whom to refer customer queries and problems.
  • Assessing the Impact of Quality Improvement Skills Workshops

    Assessing the Impact of Quality Improvement Skills Workshops

    Feedbackregarding workshops was constructive with several recommendations for the future: (1) smaller discussion groups, (2) availability of handouts to take home, (3) increase use of multimedia, and (4) practice skills within time constraints. Acknowledgements