Skip to content

Commit cb4c2f1

Browse files
authored
Merge pull request #98 from SPAAM-community/Tessa_mapping_markdown
mapping markdown - 2024 summer school version
2 parents 293a81c + 5a28a20 commit cb4c2f1

File tree

2 files changed

+108
-19
lines changed

2 files changed

+108
-19
lines changed

assets/references/genome-mapping.bib

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
@article{Reinert2015,
2+
title = {Alignment of Next-Generation Sequencing Reads},
3+
volume = {16},
4+
ISSN = {1545-293X},
5+
url = {http://dx.doi.org/10.1146/annurev-genom-090413-025358},
6+
DOI = {10.1146/annurev-genom-090413-025358},
7+
number = {1},
8+
journal = {Annual Review of Genomics and Human Genetics},
9+
publisher = {Annual Reviews},
10+
author = {Reinert, Knut and Langmead, Ben and Weese, David and Evers, Dirk J.},
11+
year = {2015},
12+
month = aug,
13+
pages = {133–151}
14+
}
15+
16+
@article{Li2010,
17+
title = {Fast and accurate long-read alignment with Burrows–Wheeler transform},
18+
volume = {26},
19+
ISSN = {1367-4803},
20+
url = {http://dx.doi.org/10.1093/bioinformatics/btp698},
21+
DOI = {10.1093/bioinformatics/btp698},
22+
number = {5},
23+
journal = {Bioinformatics},
24+
publisher = {Oxford University Press (OUP)},
25+
author = {Li, Heng and Durbin, Richard},
26+
year = {2010},
27+
month = jan,
28+
pages = {589–595}
29+
}
30+
31+
@article{Li2009,
32+
title = {The Sequence Alignment/Map format and SAMtools},
33+
volume = {25},
34+
ISSN = {1367-4803},
35+
url = {http://dx.doi.org/10.1093/bioinformatics/btp352},
36+
DOI = {10.1093/bioinformatics/btp352},
37+
number = {16},
38+
journal = {Bioinformatics},
39+
publisher = {Oxford University Press (OUP)},
40+
author = {Li, Heng and Handsaker, Bob and Wysoker, Alec and Fennell, Tim and Ruan, Jue and Homer, Nils and Marth, Gabor and Abecasis, Goncalo and Durbin, Richard},
41+
year = {2009},
42+
month = jun,
43+
pages = {2078–2079}
44+
}
45+
46+
@article{DePristo2011,
47+
title = {A framework for variation discovery and genotyping using next-generation DNA sequencing data},
48+
volume = {43},
49+
ISSN = {1546-1718},
50+
url = {http://dx.doi.org/10.1038/ng.806},
51+
DOI = {10.1038/ng.806},
52+
number = {5},
53+
journal = {Nature Genetics},
54+
publisher = {Springer Science and Business Media LLC},
55+
author = {DePristo, Mark A and Banks, Eric and Poplin, Ryan and Garimella, Kiran V and Maguire, Jared R and Hartl, Christopher and Philippakis, Anthony A and del Angel, Guillermo and Rivas, Manuel A and Hanna, Matt and McKenna, Aaron and Fennell, Tim J and Kernytsky, Andrew M and Sivachenko, Andrey Y and Cibulskis, Kristian and Gabriel, Stacey B and Altshuler, David and Daly, Mark J},
56+
year = {2011},
57+
month = apr,
58+
pages = {491–498}
59+
}
60+
61+
@article{Cingolani2012,
62+
title = {A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3},
63+
volume = {6},
64+
ISSN = {1933-6942},
65+
url = {http://dx.doi.org/10.4161/fly.19695},
66+
DOI = {10.4161/fly.19695},
67+
number = {2},
68+
journal = {Fly},
69+
publisher = {Informa UK Limited},
70+
author = {Cingolani, Pablo and Platts, Adrian and Wang, Le Lily and Coon, Melissa and Nguyen, Tung and Wang, Luan and Land, Susan J. and Lu, Xiangyi and Ruden, Douglas M.},
71+
year = {2012},
72+
month = apr,
73+
pages = {80–92}
74+
}

genome-mapping.qmd

Lines changed: 34 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,30 @@
11
---
22
title: Genome Mapping
3-
author: Alexander Herbig and Alina Hiß
3+
author: Alexander Herbig, Alina Hiß, and Teresa Zeibig
4+
bibliography: assets/references/genome-mapping.bib
5+
46
---
7+
Mapping/aligning to a reference genome is one way of reconstructing genomic information from DNA sequencing reads.
8+
This allows for identification of differences between the genome from your sample and the reference genome.
9+
This information can be used for example for comparative analyses such as in phylogenetics. For a detailed explanation of the read alignment problem and an overview of concepts for solving it, please see [@Reinert2015] [https://doi.org/10.1146/annurev-genom-090413-025358](https://doi.org/10.1146/annurev-genom-090413-025358).
10+
11+
In this session we will map two samples to the _Yersinia pestis_ (plague) genome using different parameter sets.
12+
We will do this "manually" in the sense that we will use all necessary commands one by one in the terminal.
13+
These commands usually run in the background when you apply DNA sequencing data processing pipelines.
14+
15+
We will be using the Burrows-Wheeler Aligner ([@Li2010][http://bio-bwa.sourceforge.net](http://bio-bwa.sourceforge.net)).
16+
There are different algorithms implemented for different types of data (e.g. different read lengths).
17+
Here, we use BWA backtrack (_bwa aln_), which is suitable for Illumina sequences up to 100bp.
18+
Other algorithms are _bwa mem_ and _bwa sw_ for longer reads.
19+
20+
Your learning objectives:
21+
22+
1. **Understand the Basics**: You will be able to define mapping and describe the basic principles of metagenomic mapping and the different parameters used.
23+
2. **Apply Mapping Techniques**: You will be able to apply metagenomic mapping techniques to align raw sequence data to a reference genome in a step-by-step manner.
24+
3. **Use Bioinformatics Tools**: You will be able to use the command line to apply different metagenomic mappers and perform genotype analysis via multivcfanalyzer in the standard settings. You will be able to inspect results in the IGV viewer.
25+
4. **Interpret Results**: You will be able to interpret the results of a mapping experiment and discuss their implications. You will also be able to understand the genotyping tool multiVCFanalycer.
26+
5. **Be Aware and Able to Read Up**: You will know about the existence of multiple mapping algorithms and the importance of parameter research and adjustment. You will know that the IGV viewer is one option to inspect mapping results but not the only one.
27+
528

629
::: {.callout-note collapse="true" title="Self guided: chapter environment setup"}
730
For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment.
@@ -20,22 +43,10 @@ conda env create -f genome-mapping.yml
2043
conda activate genome-mapping
2144
```
2245
:::
23-
24-
## Introduction
25-
26-
One way of reconstructing genomic information from DNA sequencing reads is mapping/aligning them to a reference genome. This allows for identification of differences between the genome from your sample and the reference genome. This information can be used for example for comparative analyses such as in phylogenetics. For a detailed explanation of the read alignment problem and an overview of concepts for solving it, please see [https://doi.org/10.1146/annurev-genom-090413-025358](https://doi.org/10.1146/annurev-genom-090413-025358).
27-
28-
In this session we will map two samples to the _Yersinia pestis_ (plague) genome using different parameter sets. We will do this "manually" in the sense that we will use all necessary commands one by one in the terminal. These commands usually run in the background when you apply DNA sequencing data processing pipelines.
29-
30-
We will be using the Burrows-Wheeler Aligner
31-
(Li et al. 2009 – [http://bio-bwa.sourceforge.net](http://bio-bwa.sourceforge.net)). There are
32-
different algorithms implemented for different types of data (e.g. different read lengths).
33-
Here, we use BWA backtrack (_bwa aln_), which is suitable for Illumina sequences up to 100bp.
34-
Other algorithms are _bwa mem_ and _bwa sw_ for longer reads.
35-
3646
## Reference Genome
3747

38-
For mapping we need a reference genome in FASTA format. Ideally we use a genome from the same species that our data relates to or, if not available, a closely related species. The selection of the correct reference genome is highly relevant. E.g. if the chosen genome differs too much from the organism the data relates to, it might not be possible to map most of the reads.
48+
For mapping we need a reference genome in FASTA format. Ideally we use a genome from the same species that our data relates to or, if not available, a closely related species.
49+
The selection of the correct reference genome is highly relevant. E.g. if the chosen genome differs too much from the organism the data relates to, it might not be possible to map most of the reads.
3950
Reference genomes can be retrieved from comprehensive databases such as [NCBI](https://www.ncbi.nlm.nih.gov/).
4051

4152
In your directory, you can find 2 samples and your reference.
@@ -122,7 +133,7 @@ Convert SAM file to binary format (BAM file):
122133
samtools view -b -S reads_mapped.sam > reads_mapped.bam
123134
```
124135

125-
For processing of _sam_ and _bam_ files we use _SAMtools_ (Li et al. 2009[http://samtools.sourceforge.net/](http://samtools.sourceforge.net/)).
136+
For processing of _sam_ and _bam_ files we use _SAMtools_ ([@Li2009][http://samtools.sourceforge.net/](http://samtools.sourceforge.net/)).
126137

127138
`-b` specifies to output in BAM format.
128139
(`-S` specifies input is SAM, can be omitted in recent versions.)
@@ -169,7 +180,7 @@ samtools idxstats reads_mapped_sorted_dedup.bam
169180
## Genotyping
170181

171182
The next step we need to perform is genotyping, i.e. the identification of all SNPs that differentiate the sample from the reference.
172-
For this we use the _Genome Analysis Toolkit (GATK)_ (DePristo et al. 2011[http://www.broadinstitute.org/gatk/](http://www.broadinstitute.org/gatk/))
183+
For this we use the _Genome Analysis Toolkit (GATK)_ ([@DePristo2011][http://www.broadinstitute.org/gatk/](http://www.broadinstitute.org/gatk/))
173184

174185
It uses the reference genome and the mapping as input and produces an output in _Variant Call Format (VCF)_ ([https://en.wikipedia.org/wiki/Variant_Call_Format](https://en.wikipedia.org/wiki/Variant_Call_Format)).
175186

@@ -195,6 +206,8 @@ Let's now continue with mapping and genotyping for the other samples and paramet
195206

196207
::: {.callout-note}
197208
This is a larger file and lenient mapping takes longer so this file will likely take a few minutes. If you are short on time, proceed with the other sample/parameter settings first and come back to this later if there is time.
209+
210+
The entire code block can be copied as it is and executed. It’s composed of all the steps we executed individually earlier.
198211
:::
199212

200213
```bash
@@ -270,7 +283,7 @@ In order to combine the results from multiple samples and parameter settings we
270283
For this we will use the software
271284
_MultiVCFAnalyzer_ ([https://github.com/alexherbig/MultiVCFAnalyzer](https://github.com/alexherbig/MultiVCFAnalyzer)).
272285

273-
It produces various output files and summary statistics and can integrate gene annotations for SNP effect analysis as done by the program _SnpEff_ (Cingolani et al. 2012 - [http://snpeff.sourceforge.net/](http://snpeff.sourceforge.net/)).
286+
It produces various output files and summary statistics and can integrate gene annotations for SNP effect analysis as done by the program _SnpEff_ ([@Cingolani2012] - [http://snpeff.sourceforge.net/](http://snpeff.sourceforge.net/)).
274287

275288
Run _MultiVCFAnalyzer_ on all 4 files at once.
276289
First `cd` one level up (if you type `ls` you should see your 4 directories, reference, etc.):
@@ -458,5 +471,7 @@ conda remove --name genome-mapping --all -y
458471
- Mapping results are the basis for genotyping, i.e. the detection of differences to the reference.
459472
- The genotyping results can be aggregated from multiple samples and comparatively analysed e.g. in the context of phylogenomics.
460473
- The chosen mapping parameters can have a strong influence on the results of any downstream analysis.
461-
- This is particularly true when dealing with ancient DNA samples as they tend to contain DNA from multiple organisms. This can lead to mismapped reads and therefore incorrect genotypes, which can further influence downstream analyses.
474+
- This is particularly true when dealing with ancient DNA samples as they tend to contain DNA from multiple organisms.This can lead to mismapped reads and therefore incorrect genotypes, which can further influence downstream analyses.
475+
476+
## References
462477

0 commit comments

Comments
 (0)