You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment.
9
9
10
-
Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413204](https://doi.org/10.5281/zenodo.8413204), and unpack
10
+
Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413204](https://doi.org/10.5281/zenodo.8413204), and unpack.
11
11
12
12
```bash
13
13
tar xvf genome-mapping.tar.gz
14
14
cd genome-mapping/
15
15
```
16
16
17
-
You can then create the subsequently activate environment with
17
+
You can then create the subsequently activate environment with the following.
18
18
19
19
```bash
20
20
conda env create -f genome-mapping.yml
@@ -58,13 +58,13 @@ The first index we will generate is for `bwa`.
58
58
bwa index YpestisCO92.fa
59
59
```
60
60
61
-
The second index will be used by the genome browser we will apply to our results later on:
61
+
The second index will be used by the genome browser we will apply to our results later on.
62
62
63
63
```bash
64
64
samtools faidx YpestisCO92.fa
65
65
```
66
66
67
-
We need to build a third index that is necessary for the genotyping step, which comes later after mapping:
67
+
We need to build a third index that is necessary for the genotyping step, which comes later after mapping.
68
68
69
69
```bash
70
70
picard CreateSequenceDictionary R=YpestisCO92.fa
@@ -96,7 +96,7 @@ Long seed length → `-l 32`
96
96
We will be working with pre-processed files (`sample1.fastq.gz`, `sample2.fastq.gz`), i.e. any quality filtering and removal of sequencing adapters is already done.
97
97
98
98
We will map each file once with lenient and once with strict parameters.
99
-
For this, we will make 4 separate directories, to avoid mixing up files:
99
+
For this, we will make 4 separate directories, to avoid mixing up files.
Proceed with writing the mapping in `sam` format [@Li2009, [https://en.wikipedia.org/wiki/SAM\_(file_format)](<https://en.wikipedia.org/wiki/SAM_(file_format)>)]:
121
+
Proceed with writing the mapping in `sam` format [@Li2009, [https://en.wikipedia.org/wiki/SAM\_(file_format)](<https://en.wikipedia.org/wiki/SAM_(file_format)>)].
For processing of `sam` and `bam` files we use `samtools`[@Li2009, [https://github.com/samtools/samtools](https://github.com/samtools/samtools)].
137
137
138
138
`-b` specifies to output in BAM format.
139
-
(`-S` specifies input is SAM, can be omitted in recent versions.)
139
+
(`-S` specifies input is SAM, can be omitted in recent versions).
140
140
141
-
Now we sort the `bam` file → Sort alignments by leftmost coordinates:
141
+
Now we sort the `bam` file → Sort alignments by leftmost coordinates.
@@ -163,15 +163,15 @@ samtools index reads_mapped_sorted_dedup.bam
163
163
Duplicated reads are usually a consequence of amplification of the DNA fragments in the lab. Therefore, they are not biologically meaningful.
164
164
165
165
We have now completed the mapping procedure.
166
-
Let's have a look at our mapping results:
166
+
Let's have a look at our mapping results.
167
167
168
168
```bash
169
169
samtools view reads_mapped_sorted_dedup.bam | less -S
170
170
```
171
171
172
172
(exit by pressing <kbd>q</kbd>)
173
173
174
-
We can also get a summary about the number of mapped reads. For this we use the `samtools idxstats` command ([http://www.htslib.org/doc/samtools-idxstats.html](http://www.htslib.org/doc/samtools-idxstats.html)):
174
+
We can also get a summary about the number of mapped reads. For this we use the `samtools idxstats` command ([http://www.htslib.org/doc/samtools-idxstats.html](http://www.htslib.org/doc/samtools-idxstats.html)).
175
175
176
176
```bash
177
177
samtools idxstats reads_mapped_sorted_dedup.bam
@@ -184,7 +184,7 @@ For this we use the 'Genome Analysis Toolkit' (`gatk`) [@DePristo2011, [http://w
184
184
185
185
It uses the reference genome and the mapping as input and produces an output in 'Variant Call Format (VCF)' ([https://en.wikipedia.org/wiki/Variant_Call_Format](https://en.wikipedia.org/wiki/Variant_Call_Format)).
@@ -286,31 +286,31 @@ For this we will use the software
286
286
It produces various output files and summary statistics and can integrate gene annotations for SNP effect analysis as done by the program `SnpEff`[@Cingolani2012, [https://github.com/pcingola/SnpEff](https://github.com/pcingola/SnpEff)].
287
287
288
288
Run `multivcfanalyzer` on all 4 files at once.
289
-
First `cd` one level up (if you type `ls` you should see your 4 directories, reference, etc.):
289
+
First `cd` one level up (if you type `ls` you should see your 4 directories, reference, etc.).
290
290
291
291
```bash
292
292
cd ..
293
293
```
294
294
295
-
Then make a new directory…
295
+
Then make a new directory.
296
296
297
297
```bash
298
298
mkdir vcf_out
299
299
```
300
300
301
-
…and run the programme:
301
+
And run the programme.
302
302
303
303
```bash
304
304
multivcfanalyzer NA YpestisCO92.fa NA vcf_out F 30 3 0.9 0.9 NA sample1_lenient/mysnps.vcf sample1_strict/mysnps.vcf sample2_lenient/mysnps.vcf sample2_strict/mysnps.vcf
305
305
```
306
306
307
-
Let’s have a look in the `vcf_out/` directory (`cd` into it):
307
+
Let’s have a look in the `vcf_out/` directory (`cd` into it).
308
308
309
309
```bash
310
310
cd vcf_out
311
311
```
312
312
313
-
Check the parameters we set earlier:
313
+
Check the parameters we set earlier.
314
314
315
315
```bash
316
316
less -S info.txt
@@ -326,7 +326,7 @@ less -S snpStatistics.tsv
326
326
327
327
(exit by pressing <kbd>q</kbd>)
328
328
329
-
The file content should look like this:
329
+
The file content should look like this.
330
330
331
331
```bash
332
332
SNP statistics for 4 samples.
@@ -353,7 +353,7 @@ The first column contains the dataset name and the second column the number of c
353
353
For visual exploration of mapping results so-called "Genome Browsers" are used.
354
354
Here we will use the 'Integrative Genomics Viewer' (`igv`)' ([https://software.broadinstitute.org/software/igv/](https://software.broadinstitute.org/software/igv/)).
355
355
356
-
To open `igv`, simply type the following command and the app will open:
356
+
To open `igv`, simply type the following command and the app will open.
- Mapping DNA sequencing reads to a reference genome is a complex procedure that requires multiple steps.
467
-
- Mapping results are the basis for genotyping, i.e. the detection of differences to the reference.
468
-
- The genotyping results can be aggregated from multiple samples and comparatively analysed e.g. in the context of phylogenomics.
469
-
- The chosen mapping parameters can have a strong influence on the results of any downstream analysis.
470
-
- This is particularly true when dealing with ancient DNA samples as they tend to contain DNA from multiple organisms.This can lead to mismapped reads and therefore incorrect genotypes, which can further influence downstream analyses.
466
+
- Mapping DNA sequencing reads to a reference genome is a complex procedure that requires multiple steps
467
+
- Mapping results are the basis for genotyping, i.e. the detection of differences to the reference
468
+
- The genotyping results can be aggregated from multiple samples and comparatively analysed e.g. in the context of phylogenomics
469
+
- The chosen mapping parameters can have a strong influence on the results of any downstream analysis
470
+
- This is particularly true when dealing with ancient DNA samples as they tend to contain DNA from multiple organisms.This can lead to mismapped reads and therefore incorrect genotypes, which can further influence downstream analyses
0 commit comments