You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Mapping/aligning to a reference genome is one way of reconstructing genomic information from DNA sequencing reads.
8
-
This allows for identification of differences between the genome from your sample and the reference genome.
9
-
This information can be used for example for comparative analyses such as in phylogenetics. For a detailed explanation of the read alignment problem and an overview of concepts for solving it, please see [@Reinert2015][https://doi.org/10.1146/annurev-genom-090413-025358](https://doi.org/10.1146/annurev-genom-090413-025358).
10
-
11
-
In this session we will map two samples to the _Yersinia pestis_ (plague) genome using different parameter sets.
12
-
We will do this "manually" in the sense that we will use all necessary commands one by one in the terminal.
13
-
These commands usually run in the background when you apply DNA sequencing data processing pipelines.
14
6
15
-
We will be using the Burrows-Wheeler Aligner ([@Li2010]– [http://bio-bwa.sourceforge.net](http://bio-bwa.sourceforge.net)).
16
-
There are different algorithms implemented for different types of data (e.g. different read lengths).
17
-
Here, we use BWA backtrack (_bwa aln_), which is suitable for Illumina sequences up to 100bp.
18
-
Other algorithms are _bwa mem_ and _bwa sw_ for longer reads.
19
-
20
-
Your learning objectives:
21
-
22
-
1.**Understand the Basics**: You will be able to define mapping and describe the basic principles of metagenomic mapping and the different parameters used.
23
-
2.**Apply Mapping Techniques**: You will be able to apply metagenomic mapping techniques to align raw sequence data to a reference genome in a step-by-step manner.
24
-
3.**Use Bioinformatics Tools**: You will be able to use the command line to apply different metagenomic mappers and perform genotype analysis via multivcfanalyzer in the standard settings. You will be able to inspect results in the IGV viewer.
25
-
4.**Interpret Results**: You will be able to interpret the results of a mapping experiment and discuss their implications. You will also be able to understand the genotyping tool multiVCFanalycer.
26
-
5.**Be Aware and Able to Read Up**: You will know about the existence of multiple mapping algorithms and the importance of parameter research and adjustment. You will know that the IGV viewer is one option to inspect mapping results but not the only one.
For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment.
31
9
32
10
Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413204](https://doi.org/10.5281/zenodo.8413204), and unpack
Mapping/aligning to a reference genome is one way of reconstructing genomic information from DNA sequencing reads.
26
+
This allows for identification of differences between the genome from your sample and the reference genome.
27
+
This information can be used for example for comparative analyses such as in phylogenetics. For a detailed explanation of the read alignment problem and an overview of concepts for solving it, please see [@Reinert2015].
28
+
29
+
In this session we will map two samples to the _Yersinia pestis_ (plague) genome using different parameter sets.
30
+
We will do this "manually" in the sense that we will use all necessary commands one by one in the terminal.
31
+
These commands usually run in the background when you apply DNA sequencing data processing pipelines.
32
+
33
+
We will be using the Burrows-Wheeler Aligner [@Li2010, [http://bio-bwa.sourceforge.net](http://bio-bwa.sourceforge.net)].
34
+
There are different algorithms implemented for different types of data (e.g. different read lengths).
35
+
Here, we use BWA backtrack (`bwa aln`), which is suitable for Illumina sequences up to 100bp.
36
+
Other algorithms are `bwa mem` and `bwa sw` for longer reads.
37
+
38
+
Your learning objectives:
39
+
40
+
1.**Understand the Basics**: You will be able to define mapping and describe the basic principles of metagenomic mapping and the different parameters used.
41
+
2.**Apply Mapping Techniques**: You will be able to apply metagenomic mapping techniques to align raw sequence data to a reference genome in a step-by-step manner.
42
+
3.**Use Bioinformatics Tools**: You will be able to use the command line to apply different metagenomic mappers and perform genotype analysis via multivcfanalyzer in the standard settings. You will be able to inspect results in the IGV viewer.
43
+
4.**Interpret Results**: You will be able to interpret the results of a mapping experiment and discuss their implications. You will also be able to understand the genotyping tool multiVCFanalycer.
44
+
5.**Be Aware and Able to Read Up**: You will know about the existence of multiple mapping algorithms and the importance of parameter research and adjustment. You will know that the IGV viewer is one option to inspect mapping results but not the only one.
45
+
46
46
## Reference Genome
47
47
48
48
For mapping we need a reference genome in FASTA format. Ideally we use a genome from the same species that our data relates to or, if not available, a closely related species.
49
49
The selection of the correct reference genome is highly relevant. E.g. if the chosen genome differs too much from the organism the data relates to, it might not be possible to map most of the reads.
50
-
Reference genomes can be retrieved from comprehensive databases such as [NCBI](https://www.ncbi.nlm.nih.gov/).
50
+
Reference genomes can be retrieved from comprehensive databases such provided by the NCBI ([https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/)).
51
51
52
52
In your directory, you can find 2 samples and your reference.
53
53
As a first step we will index our reference genome (make sure you are inside your directory).
We will be using _bwa aln_, but we need to specify parameters.
75
+
We will be using `bwa aln`, but we need to specify parameters.
76
76
For now we will concentrate on the "seed length" and the "maximum edit distance". We will use the default setting for all other parameters during this session. The choice of the right parameters depend on many factors such as the type of data and the specific use case. One aspect is the mapping sensitivity, i.e. how different a read can be from the chosen reference and still be mapped. In this context we generally differentiate between _strict_ and _lenient_ mapping parameters.
77
77
78
-
As many other mapping algorithms _bwa_ uses a so-called "seed-and-extend" approach. I.e. it initially maps the first _N_ nucleotides of each read to the genome with relatively few mismatches and thereby determines candidate positions for the more time-intensive full alignment.
78
+
As many other mapping algorithms `bwa` uses a so-called "seed-and-extend" approach. I.e. it initially maps the first _N_ nucleotides of each read to the genome with relatively few mismatches and thereby determines candidate positions for the more time-intensive full alignment.
79
79
80
80
A short seed length will generate more such candidate positions and therefore mapping will take longer, but it will also be more sensitive, i.e. there can be more differences between the read and the genome. Long seeds are less sensitive but the mapping procedure is faster.
81
81
82
82
In this session we will use the following two parameter sets:
83
83
84
84
**Lenient**
85
85
86
-
Allow for more mismatches → -n 0.01
86
+
Allow for more mismatches → `-n 0.01`
87
87
88
-
Short seed length → -l 16
88
+
Short seed length → `-l 16`
89
89
90
90
**Strict**
91
91
92
-
Allow for less mismatches → -n 0.1
92
+
Allow for less mismatches → `-n 0.1`
93
93
94
-
Long seed length → -l 32
94
+
Long seed length → `-l 32`
95
95
96
96
We will be working with pre-processed files (`sample1.fastq.gz`, `sample2.fastq.gz`), i.e. any quality filtering and removal of sequencing adapters is already done.
97
97
@@ -112,13 +112,13 @@ Go into the corresponding folder:
112
112
cd sample1_lenient
113
113
```
114
114
115
-
Perform the _bwa_ alignment, here for sample1, and specify lenient mapping parameters:
115
+
Perform the `bwa` alignment, here for sample1, and specify lenient mapping parameters:
Proceed with writing the mapping in _sam_ format ([https://en.wikipedia.org/wiki/SAM\_(file_format)](<https://en.wikipedia.org/wiki/SAM_(file_format)>)):
121
+
Proceed with writing the mapping in `sam` format [@Li2009, [https://en.wikipedia.org/wiki/SAM\_(file_format)](<https://en.wikipedia.org/wiki/SAM_(file_format)>)]:
For processing of _sam_ and _bam_ files we use _SAMtools_ ([@Li2009] – [http://samtools.sourceforge.net/](http://samtools.sourceforge.net/)).
136
+
For processing of `sam` and `bam` files we use `samtools`[@Li2009, [https://github.com/samtools/samtools](https://github.com/samtools/samtools)].
137
137
138
138
`-b` specifies to output in BAM format.
139
139
(`-S` specifies input is SAM, can be omitted in recent versions.)
140
140
141
-
Now we sort the _bam_ file → Sort alignments by leftmost coordinates:
141
+
Now we sort the `bam` file → Sort alignments by leftmost coordinates:
@@ -171,7 +171,7 @@ samtools view reads_mapped_sorted_dedup.bam | less -S
171
171
172
172
(exit by pressing <kbd>q</kbd>)
173
173
174
-
We can also get a summary about the number of mapped reads. For this we use the _samtools idxstats_ command ([http://www.htslib.org/doc/samtools-idxstats.html](http://www.htslib.org/doc/samtools-idxstats.html)):
174
+
We can also get a summary about the number of mapped reads. For this we use the `samtools idxstats` command ([http://www.htslib.org/doc/samtools-idxstats.html](http://www.htslib.org/doc/samtools-idxstats.html)):
The next step we need to perform is genotyping, i.e. the identification of all SNPs that differentiate the sample from the reference.
183
-
For this we use the _Genome Analysis Toolkit (GATK)_ ([@DePristo2011] – [http://www.broadinstitute.org/gatk/](http://www.broadinstitute.org/gatk/))
183
+
For this we use the 'Genome Analysis Toolkit' (`gatk`) [@DePristo2011, [http://www.broadinstitute.org/gatk/](http://www.broadinstitute.org/gatk/)]
184
184
185
-
It uses the reference genome and the mapping as input and produces an output in _Variant Call Format (VCF)_ ([https://en.wikipedia.org/wiki/Variant_Call_Format](https://en.wikipedia.org/wiki/Variant_Call_Format)).
185
+
It uses the reference genome and the mapping as input and produces an output in 'Variant Call Format (VCF)' ([https://en.wikipedia.org/wiki/Variant_Call_Format](https://en.wikipedia.org/wiki/Variant_Call_Format)).
In order to combine the results from multiple samples and parameter settings we need to agregate and comparatively analyse the information from all the _vcf_ files.
It produces various output files and summary statistics and can integrate gene annotations for SNP effect analysis as done by the program _SnpEff_ ([@Cingolani2012] - [http://snpeff.sourceforge.net/](http://snpeff.sourceforge.net/)).
286
+
It produces various output files and summary statistics and can integrate gene annotations for SNP effect analysis as done by the program `SnpEff`[@Cingolani2012, [https://github.com/pcingola/SnpEff](https://github.com/pcingola/SnpEff)].
287
287
288
-
Run _MultiVCFAnalyzer_ on all 4 files at once.
288
+
Run `multivcfanalyzer` on all 4 files at once.
289
289
First `cd` one level up (if you type `ls` you should see your 4 directories, reference, etc.):
290
290
291
291
```bash
@@ -304,7 +304,7 @@ mkdir vcf_out
304
304
multivcfanalyzer NA YpestisCO92.fa NA vcf_out F 30 3 0.9 0.9 NA sample1_lenient/mysnps.vcf sample1_strict/mysnps.vcf sample2_lenient/mysnps.vcf sample2_strict/mysnps.vcf
305
305
```
306
306
307
-
Let’s have a look in the ‘vcf_out’ directory (`cd` into it):
307
+
Let’s have a look in the `vcf_out/` directory (`cd` into it):
308
308
309
309
```bash
310
310
cd vcf_out
@@ -351,15 +351,15 @@ The first column contains the dataset name and the second column the number of c
351
351
## Exploring the Results
352
352
353
353
For visual exploration of mapping results so-called "Genome Browsers" are used.
354
-
Here we will use the _Integrative Genomics Viewer (IGV)_ ([https://software.broadinstitute.org/software/igv/](https://software.broadinstitute.org/software/igv/)).
354
+
Here we will use the 'Integrative Genomics Viewer' (`igv`)' ([https://software.broadinstitute.org/software/igv/](https://software.broadinstitute.org/software/igv/)).
355
355
356
-
To open IGV, simply type the following command and the app will open:
356
+
To open `igv`, simply type the following command and the app will open:
357
357
358
358
```bash
359
359
igv
360
360
```
361
361
362
-
Note that you cannot use the terminal while IGV is open. If you want to use it anyways, open a second terminal via the bar on the bottom.
362
+
Note that you cannot use the terminal while `igv` is open. If you want to use it anyways, open a second terminal via the bar on the bottom.
363
363
364
364
Load your reference (`YpestisCO92.fa`):
365
365
@@ -390,7 +390,7 @@ Have a look at `snpTable.tsv`.
390
390
Can you identify SNPs that were called with lenient but not with strict parameters or vice versa?
0 commit comments