The Variant Call Format - GitHub Pages

The Variant Call Format - GitHub Pages

The Variant Call Format Week 2: VCF files GEN 8900-Computational Genomics Fall 2017 Outline of Todays Class I. Next-Gen (Illumina) Library Construction and Sequencing II. Overview of a typical data processing pipeline III. How to call variants to generate VCF files IV. Understanding the information within a VCF V. Exercises: 1) Read a VCF file into R, 2) count the genotypes, 3) calculate % heterozygosity and 4) convert to an alternate format Week 2: VCF files GEN 8900-Computational Genomics 2 Constructing an Illumina Library Start with genomic DNA: Randomly fragment DNA: (usually with sonication) PCR amplificatio n: Flowcell Binding Site Forward Primer Site Add Forward and Reverse Adapters: Overhang Size Selection: Overhang

Reverse Primer Site ~500 bp Final Library: Sequencing-By-Synthesis Library: Illumina Flow Cell w/ 8 lanes G T C A T A In situ amplification creates clusters of identical copies of each fragment Flow cell is pre-coated with small DNA oligos Add labeled A G nucleotides C T G G T T G C G G C T A T A T A T A T A T A T A T Cross section of one cluster (notice there is a mutation/PCR error at one position) Fragments in the library are bound to the oligos on the flow cell (via the recognition seq. on

the ends of the adapters) Base Prob. Wron g Overhead view of the same cluster T A <1% <1% T <1% G 30% Output Illumina Output: Fastq Format Single Cluster Sequencing-BySynthesis Bas e Prob. Phred Score Code T 0.01 % 40 H A

0.1% 30 ? T 0.05 % 33 B G 30% 5 & A Phred Score (Q) = -log10(Prob. of Error) Code comes from a subset of ASCII characters **Note that different versions of Illumina use slightly different sets of codes 1% 20 4 @Seqname:Flowcell:Lane:X1:Y1 C 5% 13 . TATGAC +Seqname:Flowcell:Lane:X1:Y1 Fastq format (.fq or .fastq): H?B&4. A text file with 4 lines per @Seqname:Flowcell:Lane:X2:Y2 sequence AAAGGG +Seqname:Flowcell:Lane:X2:Y2

HH??AB Data Processing Pipeline RawData.f q FASTQ FastQC Trimmomatic Quality Check Trimming Remove leftover adapters CleanData.fq FASTQ BWA NAST Bowtie2 soap2 Gmap GATKSamtools varFilterfreeBayes VCF Find Variant Sites b/t individual aligned files: Single Nucleotide Polymorphism (SNPs) Insertion/Deletions (InDels) Map to a Reference AlignedReads.sa m SAM/BAM .msa, .bed, .psl SNP Calling Referenc e Genome: SN

P1 SN P 2 SN P 3 SN P 4 Map. Rds Sample 1: Map. Rds Sample 2: Map. Rds Sample 3: SNP Position Sample 1 Sample 2 Sample 3 1 0/0 0/0 1/1 2 1/1 0/1 0/0 3

0/0 0/1 0/0 4 ./. 0/0 1/1 SN P 5 VCF Files At its core, a VCF file is just a tabdelimited text file ##fileformat=VCFv4.2 ##FORMAT= ##FORMAT= ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP001 SAMP002 20 1291018 rs11449 G A 20 PASS . GT 0/0 0/1 20 2300608 rs84825 C T 30 PASS

. GT:GP 0/1:. 0/1:0.03,0.97,0 20 2301308 rs84823 T G 30 PASS . GT:PL 1/1:26,3,0 1/1:10,5,0 ## Denotes a Meta-information Line. These lines can define the FILTER, INFO, and FORMAT terms, depending on what program created the vcf file (so not all vcf files are exactly the same!). The first line will always specify which VCF a file is. columns should always be # Denotes the Header line.version The first NINE the same for every VCF (unless you have a really old version). Then, there will be one column for every individual in your sample (i.e. these columns will change for each data set). The names for these columns are usually taken from your file names. The remaining rows have theinput information about each SNP position, with 1 row per VARIANT site (i.e. sites with data, but no observed differences, are NOT in the VCF file by default!) VCF Files ##fileformat=VCFv4.2 ##FORMAT= ##FORMAT= ##FORMAT= #CHROM

POS ID REF ALT QUAL FILTER INFO FORMAT SAMP001 SAMP002 20 1291018 rs11449 G A 20 PASS . GT 0/0 0/1 20 2300608 rs84825 C T 30 PASS . GT:GP 0/1:. 0/1:0.03,0.97,0 20 2301308 rs84823 T G 30 PASS . GT:PL 1/1:26,3,0 1/1:10,5,0 CHROM: The chromosome (or scaffold) where the SNP is located. Comes from the names within your .fasta reference genome file POS: The position within the chromosome of the SNP (positions start at 1). ID: A database ID for each SNP (if there is one). Often this may be blank: . REF: The allele for the Reference Genome at the SNP position. If the position has an InDel mutation, the REF may be a string instead of a single letter. ALT: The allele for alternate/SNP allele found at the position. If there are more than 2 alleles at a site, ALT will have a comma delimited list of all possible alleles. QUAL: The quality or likelihood score given to the site by the program used to call variants. Often a phred-scaled score, but sometimes a ln(likelihood) or other score

in arun very different range. FILTER: If you a filter on the vcf after calling the SNPs, this will say whether each SNP passed or failed (rather than deleting SNPs that fail the filter). INFO: Often this field is blank (.), unless you have run some additional analysis, such as annotation prediction. VCF Files ##fileformat=VCFv4.2 ##FORMAT= ##FORMAT= ##FORMAT= #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP001 SAMP002 20 1291018 rs11449 G A 20 PASS . GT 0/0 0/1 20 2300608 rs84825 C T 30 PASS . GT:GP 0/1:. 0/1:0.03,0.97,0 20 2301308 rs84823

T G 30 PASS . GT:PL 1/1:26,3,0 1/1:10,5,0 The FORMAT column tells us exactly what fields we can expect in each of our sample columns. Each field is separated by a : and the fields are typically defined in the meta information. Different programs can return different info., but the piece we are most interested in is the GT field, which is the actual genotype. Individual Genotypes are always given in the form: allele1/allele2 0 = reference allele 1 = alternate allele 0/0 = homozygous ref; 0/1 = heterozygous; 1/1 = homozygous alt A ./. means that the genotype is missing for that individual. If a site is multi-allelic, then there will be additional encodings (e.g. 0/2, 2/2, etc.) If a sample is polyploid, the genotype will give all of the alleles: 0/0/1/1 (tetraploid) VCF Files and R Since the VCF format is essentially a text file, it is easily readable by R The things to watch out for are the special characters: VCF files have ## and # headers, which is NOT commonly differentiated by most computing languages The use of . as a missing data character can trip up some regular expression searches The use of | in phased VCF files can also mess up regular It is also important to keep in mind that it is almost impossible to expression searches. write a script that will work correctly with every version of the VCF format; early versions in particular might cause problems. This is also true of software with dedicated teams of programmers (like GATK), a change in version can break certain package functions! So, dont feel bad just be aware of the issue! A very detailed guide to VCF files can be found here: https://samtools.github.io/hts-specs/VCFv4.1.pdf

Recently Viewed Presentations

  • Streamlined Activity Manpower Document (S-AMD) January 2018 CAPT

    Streamlined Activity Manpower Document (S-AMD) January 2018 CAPT

    NOBC Code "9087" falls within the 9000-9999 classification group which is the Naval Operations Field Group and 9000-9099 which is the Staff and Fleet Command sub-Group. So the NOBC Code provides a three level hierarchy of work, while the JOBCODE...
  • Sampling of Private Wells for Pesticides, Upstate NY

    Sampling of Private Wells for Pesticides, Upstate NY

    The product names are easier to pronounce but we must be generic. ... Concern about cancer, endocrine disruption. Metolachlor much less concern. Atrazine and metolachlor at Sleepy Hollow Lake 2. Atrazine in the 0.2-0.5 range many samples. Metolachlor up to...
  • Exam Skills Recap 1st November - fhsenglishrevise

    Exam Skills Recap 1st November - fhsenglishrevise

    The subsequent lessons are more content-heavy and will probably require some 'tweaking' prior to delivery: you may need to add starters, break up the teacher talk with more tasks etc. However, they are included because the content is useful and...
  • Body Paragraphs  Support the thesis statement  Provide main

    Body Paragraphs Support the thesis statement Provide main

    a. Most tricycles now have handle bars covered with plastic protectors. b. Shooting marbles is a favorite childhood game. c. Too many toys for toddlers come with very small pieces, which can be put in the mouth and can cause...
  • Decision Support for Quality Improvement

    Decision Support for Quality Improvement

    Next, their workflow must be analyzed and a determination must be made as to how the clinical decision support will fit into that workflow. It is important to understand that there may be many different workflow patterns, not just a...
  • Estudo de casos - FTC

    Estudo de casos - FTC

    Punção- 2º espaço intercostal na LMC. Drenagem intercostal em selo d'água: Inserir dreno entre o 4º e 5º espaços intercostais, anteriormente à linha axilar média. ... Avaliado através do Glasgow e AVDI* ↓ oxigenação e/ou perfusão cerebral ou TCE. Pupilas....
  • Action plans at HUH Dr Tammy Rothenberg Starlight

    Action plans at HUH Dr Tammy Rothenberg Starlight

    June 2013: uploaded the Asthma UK action plans onto the intranet, and started the education regarding the need for this. October 2013, the nursing discharge checklist (for all admissions) was adjusted to include wheeze plan. Adjustment of nursing discharge to...
  • Chemistry

    Chemistry

    Title: Chemistry Author: Linda Zingg Last modified by: Sharon Bennett Created Date: 5/9/2006 12:42:09 PM Document presentation format: On-screen Show (4:3)