You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
author = {Li, Heng and Handsaker, Bob and Wysoker, Alec and Fennell, Tim and Ruan, Jue and Homer, Nils and Marth, Gabor and Abecasis, Goncalo and Durbin, Richard},
41
+
year = {2009},
42
+
month = jun,
43
+
pages = {2078–2079}
44
+
}
45
+
46
+
@article{DePristo2011,
47
+
title = {A framework for variation discovery and genotyping using next-generation DNA sequencing data},
48
+
volume = {43},
49
+
ISSN = {1546-1718},
50
+
url = {http://dx.doi.org/10.1038/ng.806},
51
+
DOI = {10.1038/ng.806},
52
+
number = {5},
53
+
journal = {Nature Genetics},
54
+
publisher = {Springer Science and Business Media LLC},
55
+
author = {DePristo, Mark A and Banks, Eric and Poplin, Ryan and Garimella, Kiran V and Maguire, Jared R and Hartl, Christopher and Philippakis, Anthony A and del Angel, Guillermo and Rivas, Manuel A and Hanna, Matt and McKenna, Aaron and Fennell, Tim J and Kernytsky, Andrew M and Sivachenko, Andrey Y and Cibulskis, Kristian and Gabriel, Stacey B and Altshuler, David and Daly, Mark J},
56
+
year = {2011},
57
+
month = apr,
58
+
pages = {491–498}
59
+
}
60
+
61
+
@article{Cingolani2012,
62
+
title = {A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3},
63
+
volume = {6},
64
+
ISSN = {1933-6942},
65
+
url = {http://dx.doi.org/10.4161/fly.19695},
66
+
DOI = {10.4161/fly.19695},
67
+
number = {2},
68
+
journal = {Fly},
69
+
publisher = {Informa UK Limited},
70
+
author = {Cingolani, Pablo and Platts, Adrian and Wang, Le Lily and Coon, Melissa and Nguyen, Tung and Wang, Luan and Land, Susan J. and Lu, Xiangyi and Ruden, Douglas M.},
Mapping/aligning to a reference genome is one way of reconstructing genomic information from DNA sequencing reads.
8
+
This allows for identification of differences between the genome from your sample and the reference genome.
9
+
This information can be used for example for comparative analyses such as in phylogenetics. For a detailed explanation of the read alignment problem and an overview of concepts for solving it, please see [@Reinert2015][https://doi.org/10.1146/annurev-genom-090413-025358](https://doi.org/10.1146/annurev-genom-090413-025358).
10
+
11
+
In this session we will map two samples to the _Yersinia pestis_ (plague) genome using different parameter sets.
12
+
We will do this "manually" in the sense that we will use all necessary commands one by one in the terminal.
13
+
These commands usually run in the background when you apply DNA sequencing data processing pipelines.
14
+
15
+
We will be using the Burrows-Wheeler Aligner ([@Li2010]– [http://bio-bwa.sourceforge.net](http://bio-bwa.sourceforge.net)).
16
+
There are different algorithms implemented for different types of data (e.g. different read lengths).
17
+
Here, we use BWA backtrack (_bwa aln_), which is suitable for Illumina sequences up to 100bp.
18
+
Other algorithms are _bwa mem_ and _bwa sw_ for longer reads.
19
+
20
+
Your learning objectives:
21
+
22
+
1.**Understand the Basics**: You will be able to define mapping and describe the basic principles of metagenomic mapping and the different parameters used.
23
+
2.**Apply Mapping Techniques**: You will be able to apply metagenomic mapping techniques to align raw sequence data to a reference genome in a step-by-step manner.
24
+
3.**Use Bioinformatics Tools**: You will be able to use the command line to apply different metagenomic mappers and perform genotype analysis via multivcfanalyzer in the standard settings. You will be able to inspect results in the IGV viewer.
25
+
4.**Interpret Results**: You will be able to interpret the results of a mapping experiment and discuss their implications. You will also be able to understand the genotyping tool multiVCFanalycer.
26
+
5.**Be Aware and Able to Read Up**: You will know about the existence of multiple mapping algorithms and the importance of parameter research and adjustment. You will know that the IGV viewer is one option to inspect mapping results but not the only one.
For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment.
One way of reconstructing genomic information from DNA sequencing reads is mapping/aligning them to a reference genome. This allows for identification of differences between the genome from your sample and the reference genome. This information can be used for example for comparative analyses such as in phylogenetics. For a detailed explanation of the read alignment problem and an overview of concepts for solving it, please see [https://doi.org/10.1146/annurev-genom-090413-025358](https://doi.org/10.1146/annurev-genom-090413-025358).
27
-
28
-
In this session we will map two samples to the _Yersinia pestis_ (plague) genome using different parameter sets. We will do this "manually" in the sense that we will use all necessary commands one by one in the terminal. These commands usually run in the background when you apply DNA sequencing data processing pipelines.
29
-
30
-
We will be using the Burrows-Wheeler Aligner
31
-
(Li et al. 2009 – [http://bio-bwa.sourceforge.net](http://bio-bwa.sourceforge.net)). There are
32
-
different algorithms implemented for different types of data (e.g. different read lengths).
33
-
Here, we use BWA backtrack (_bwa aln_), which is suitable for Illumina sequences up to 100bp.
34
-
Other algorithms are _bwa mem_ and _bwa sw_ for longer reads.
35
-
36
46
## Reference Genome
37
47
38
-
For mapping we need a reference genome in FASTA format. Ideally we use a genome from the same species that our data relates to or, if not available, a closely related species. The selection of the correct reference genome is highly relevant. E.g. if the chosen genome differs too much from the organism the data relates to, it might not be possible to map most of the reads.
48
+
For mapping we need a reference genome in FASTA format. Ideally we use a genome from the same species that our data relates to or, if not available, a closely related species.
49
+
The selection of the correct reference genome is highly relevant. E.g. if the chosen genome differs too much from the organism the data relates to, it might not be possible to map most of the reads.
39
50
Reference genomes can be retrieved from comprehensive databases such as [NCBI](https://www.ncbi.nlm.nih.gov/).
40
51
41
52
In your directory, you can find 2 samples and your reference.
@@ -122,7 +133,7 @@ Convert SAM file to binary format (BAM file):
The next step we need to perform is genotyping, i.e. the identification of all SNPs that differentiate the sample from the reference.
172
-
For this we use the _Genome Analysis Toolkit (GATK)_ (DePristo et al. 2011 – [http://www.broadinstitute.org/gatk/](http://www.broadinstitute.org/gatk/))
183
+
For this we use the _Genome Analysis Toolkit (GATK)_ ([@DePristo2011] – [http://www.broadinstitute.org/gatk/](http://www.broadinstitute.org/gatk/))
173
184
174
185
It uses the reference genome and the mapping as input and produces an output in _Variant Call Format (VCF)_ ([https://en.wikipedia.org/wiki/Variant_Call_Format](https://en.wikipedia.org/wiki/Variant_Call_Format)).
175
186
@@ -195,6 +206,8 @@ Let's now continue with mapping and genotyping for the other samples and paramet
195
206
196
207
::: {.callout-note}
197
208
This is a larger file and lenient mapping takes longer so this file will likely take a few minutes. If you are short on time, proceed with the other sample/parameter settings first and come back to this later if there is time.
209
+
210
+
The entire code block can be copied as it is and executed. It’s composed of all the steps we executed individually earlier.
198
211
:::
199
212
200
213
```bash
@@ -270,7 +283,7 @@ In order to combine the results from multiple samples and parameter settings we
It produces various output files and summary statistics and can integrate gene annotations for SNP effect analysis as done by the program _SnpEff_ (Cingolani et al. 2012 - [http://snpeff.sourceforge.net/](http://snpeff.sourceforge.net/)).
286
+
It produces various output files and summary statistics and can integrate gene annotations for SNP effect analysis as done by the program _SnpEff_ ([@Cingolani2012] - [http://snpeff.sourceforge.net/](http://snpeff.sourceforge.net/)).
274
287
275
288
Run _MultiVCFAnalyzer_ on all 4 files at once.
276
289
First `cd` one level up (if you type `ls` you should see your 4 directories, reference, etc.):
- Mapping results are the basis for genotyping, i.e. the detection of differences to the reference.
459
472
- The genotyping results can be aggregated from multiple samples and comparatively analysed e.g. in the context of phylogenomics.
460
473
- The chosen mapping parameters can have a strong influence on the results of any downstream analysis.
461
-
- This is particularly true when dealing with ancient DNA samples as they tend to contain DNA from multiple organisms. This can lead to mismapped reads and therefore incorrect genotypes, which can further influence downstream analyses.
474
+
- This is particularly true when dealing with ancient DNA samples as they tend to contain DNA from multiple organisms.This can lead to mismapped reads and therefore incorrect genotypes, which can further influence downstream analyses.
0 commit comments