Skip to content

Commit b508c20

Browse files
committed
Minor formatting fixes
1 parent cb4c2f1 commit b508c20

File tree

3 files changed

+132
-58
lines changed

3 files changed

+132
-58
lines changed

assets/references/genome-mapping.bib

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,4 +71,50 @@ @article{Cingolani2012
7171
year = {2012},
7272
month = apr,
7373
pages = {80–92}
74-
}
74+
}
75+
76+
@ARTICLE{Bos2014-xe,
77+
title = "Pre-Columbian mycobacterial genomes reveal seals as a source of
78+
New World human tuberculosis",
79+
author = "Bos, Kirsten I and Harkins, Kelly M and Herbig, Alexander and
80+
Coscolla, Mireia and Weber, Nico and Comas, Iñaki and Forrest,
81+
Stephen A and Bryant, Josephine M and Harris, Simon R and
82+
Schuenemann, Verena J and Campbell, Tessa J and Majander, Kerttu
83+
and Wilbur, Alicia K and Guichon, Ricardo A and Wolfe Steadman,
84+
Dawnie L and Cook, Della Collins and Niemann, Stefan and Behr,
85+
Marcel A and Zumarraga, Martin and Bastida, Ricardo and Huson,
86+
Daniel and Nieselt, Kay and Young, Douglas and Parkhill, Julian
87+
and Buikstra, Jane E and Gagneux, Sebastien and Stone, Anne C and
88+
Krause, Johannes",
89+
journal = "Nature",
90+
volume = 514,
91+
number = 7523,
92+
pages = "494--497",
93+
abstract = "Modern strains of Mycobacterium tuberculosis from the Americas are
94+
closely related to those from Europe, supporting the assumption
95+
that human tuberculosis was introduced post-contact. This notion,
96+
however, is incompatible with archaeological evidence of
97+
pre-contact tuberculosis in the New World. Comparative genomics of
98+
modern isolates suggests that M. tuberculosis attained its
99+
worldwide distribution following human dispersals out of Africa
100+
during the Pleistocene epoch, although this has yet to be
101+
confirmed with ancient calibration points. Here we present three
102+
1,000-year-old mycobacterial genomes from Peruvian human
103+
skeletons, revealing that a member of the M. tuberculosis complex
104+
caused human disease before contact. The ancient strains are
105+
distinct from known human-adapted forms and are most closely
106+
related to those adapted to seals and sea lions. Two independent
107+
dating approaches suggest a most recent common ancestor for the M.
108+
tuberculosis complex less than 6,000 years ago, which supports a
109+
Holocene dispersal of the disease. Our results implicate sea
110+
mammals as having played a role in transmitting the disease to
111+
humans across the ocean.",
112+
month = oct,
113+
year = 2014,
114+
url = "http://dx.doi.org/10.1038/nature13591",
115+
doi = "10.1038/nature13591",
116+
pmc = "PMC4550673",
117+
pmid = 25141181,
118+
issn = "0028-0836,1476-4687",
119+
language = "en"
120+
}

authentication.qmd

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1011,6 +1011,38 @@ And click "Open folder"
10111011
You can double-click on the pdf files to visualise them.
10121012
:::
10131013

1014+
1015+
## (Optional) clean-up
1016+
1017+
Let's clean up our working directory by removing all the data and output from this chapter.
1018+
1019+
The command below will remove the `/<PATH>/<TO>/authentication` _as well as all of its contents_.
1020+
1021+
::: {.callout-tip}
1022+
## Pro Tip
1023+
Always be VERY careful when using `rm -r`.
1024+
Check 3x that the path you are specifying is exactly what you want to delete and nothing more before pressing ENTER!
1025+
:::
1026+
1027+
```bash
1028+
rm -r /<PATH>/<TO>/authentication*
1029+
```
1030+
1031+
Once deleted we can move elsewhere (e.g. `cd ~`).
1032+
1033+
We can also get out of the `conda` environment with.
1034+
1035+
```bash
1036+
conda deactivate
1037+
```
1038+
1039+
Then to delete the conda environment.
1040+
1041+
```bash
1042+
conda remove --name authentication --all -y
1043+
```
1044+
1045+
10141046
## Summary
10151047

10161048
In addition, we:

genome-mapping.qmd

Lines changed: 53 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -2,31 +2,9 @@
22
title: Genome Mapping
33
author: Alexander Herbig, Alina Hiß, and Teresa Zeibig
44
bibliography: assets/references/genome-mapping.bib
5-
65
---
7-
Mapping/aligning to a reference genome is one way of reconstructing genomic information from DNA sequencing reads.
8-
This allows for identification of differences between the genome from your sample and the reference genome.
9-
This information can be used for example for comparative analyses such as in phylogenetics. For a detailed explanation of the read alignment problem and an overview of concepts for solving it, please see [@Reinert2015] [https://doi.org/10.1146/annurev-genom-090413-025358](https://doi.org/10.1146/annurev-genom-090413-025358).
10-
11-
In this session we will map two samples to the _Yersinia pestis_ (plague) genome using different parameter sets.
12-
We will do this "manually" in the sense that we will use all necessary commands one by one in the terminal.
13-
These commands usually run in the background when you apply DNA sequencing data processing pipelines.
146

15-
We will be using the Burrows-Wheeler Aligner ([@Li2010][http://bio-bwa.sourceforge.net](http://bio-bwa.sourceforge.net)).
16-
There are different algorithms implemented for different types of data (e.g. different read lengths).
17-
Here, we use BWA backtrack (_bwa aln_), which is suitable for Illumina sequences up to 100bp.
18-
Other algorithms are _bwa mem_ and _bwa sw_ for longer reads.
19-
20-
Your learning objectives:
21-
22-
1. **Understand the Basics**: You will be able to define mapping and describe the basic principles of metagenomic mapping and the different parameters used.
23-
2. **Apply Mapping Techniques**: You will be able to apply metagenomic mapping techniques to align raw sequence data to a reference genome in a step-by-step manner.
24-
3. **Use Bioinformatics Tools**: You will be able to use the command line to apply different metagenomic mappers and perform genotype analysis via multivcfanalyzer in the standard settings. You will be able to inspect results in the IGV viewer.
25-
4. **Interpret Results**: You will be able to interpret the results of a mapping experiment and discuss their implications. You will also be able to understand the genotyping tool multiVCFanalycer.
26-
5. **Be Aware and Able to Read Up**: You will know about the existence of multiple mapping algorithms and the importance of parameter research and adjustment. You will know that the IGV viewer is one option to inspect mapping results but not the only one.
27-
28-
29-
::: {.callout-note collapse="true" title="Self guided: chapter environment setup"}
7+
:::{.callout-note collapse="true" title="Self guided: chapter environment setup"}
308
For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment.
319

3210
Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413204](https://doi.org/10.5281/zenodo.8413204), and unpack
@@ -43,16 +21,38 @@ conda env create -f genome-mapping.yml
4321
conda activate genome-mapping
4422
```
4523
:::
24+
25+
Mapping/aligning to a reference genome is one way of reconstructing genomic information from DNA sequencing reads.
26+
This allows for identification of differences between the genome from your sample and the reference genome.
27+
This information can be used for example for comparative analyses such as in phylogenetics. For a detailed explanation of the read alignment problem and an overview of concepts for solving it, please see [@Reinert2015].
28+
29+
In this session we will map two samples to the _Yersinia pestis_ (plague) genome using different parameter sets.
30+
We will do this "manually" in the sense that we will use all necessary commands one by one in the terminal.
31+
These commands usually run in the background when you apply DNA sequencing data processing pipelines.
32+
33+
We will be using the Burrows-Wheeler Aligner [@Li2010, [http://bio-bwa.sourceforge.net](http://bio-bwa.sourceforge.net)].
34+
There are different algorithms implemented for different types of data (e.g. different read lengths).
35+
Here, we use BWA backtrack (`bwa aln`), which is suitable for Illumina sequences up to 100bp.
36+
Other algorithms are `bwa mem` and `bwa sw` for longer reads.
37+
38+
Your learning objectives:
39+
40+
1. **Understand the Basics**: You will be able to define mapping and describe the basic principles of metagenomic mapping and the different parameters used.
41+
2. **Apply Mapping Techniques**: You will be able to apply metagenomic mapping techniques to align raw sequence data to a reference genome in a step-by-step manner.
42+
3. **Use Bioinformatics Tools**: You will be able to use the command line to apply different metagenomic mappers and perform genotype analysis via multivcfanalyzer in the standard settings. You will be able to inspect results in the IGV viewer.
43+
4. **Interpret Results**: You will be able to interpret the results of a mapping experiment and discuss their implications. You will also be able to understand the genotyping tool multiVCFanalycer.
44+
5. **Be Aware and Able to Read Up**: You will know about the existence of multiple mapping algorithms and the importance of parameter research and adjustment. You will know that the IGV viewer is one option to inspect mapping results but not the only one.
45+
4646
## Reference Genome
4747

4848
For mapping we need a reference genome in FASTA format. Ideally we use a genome from the same species that our data relates to or, if not available, a closely related species.
4949
The selection of the correct reference genome is highly relevant. E.g. if the chosen genome differs too much from the organism the data relates to, it might not be possible to map most of the reads.
50-
Reference genomes can be retrieved from comprehensive databases such as [NCBI](https://www.ncbi.nlm.nih.gov/).
50+
Reference genomes can be retrieved from comprehensive databases such provided by the NCBI ([https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/)).
5151

5252
In your directory, you can find 2 samples and your reference.
5353
As a first step we will index our reference genome (make sure you are inside your directory).
5454

55-
The first index we will generate is for _bwa_.
55+
The first index we will generate is for `bwa`.
5656

5757
```bash
5858
bwa index YpestisCO92.fa
@@ -72,26 +72,26 @@ picard CreateSequenceDictionary R=YpestisCO92.fa
7272

7373
## Mapping Parameters
7474

75-
We will be using _bwa aln_, but we need to specify parameters.
75+
We will be using `bwa aln`, but we need to specify parameters.
7676
For now we will concentrate on the "seed length" and the "maximum edit distance". We will use the default setting for all other parameters during this session. The choice of the right parameters depend on many factors such as the type of data and the specific use case. One aspect is the mapping sensitivity, i.e. how different a read can be from the chosen reference and still be mapped. In this context we generally differentiate between _strict_ and _lenient_ mapping parameters.
7777

78-
As many other mapping algorithms _bwa_ uses a so-called "seed-and-extend" approach. I.e. it initially maps the first _N_ nucleotides of each read to the genome with relatively few mismatches and thereby determines candidate positions for the more time-intensive full alignment.
78+
As many other mapping algorithms `bwa` uses a so-called "seed-and-extend" approach. I.e. it initially maps the first _N_ nucleotides of each read to the genome with relatively few mismatches and thereby determines candidate positions for the more time-intensive full alignment.
7979

8080
A short seed length will generate more such candidate positions and therefore mapping will take longer, but it will also be more sensitive, i.e. there can be more differences between the read and the genome. Long seeds are less sensitive but the mapping procedure is faster.
8181

8282
In this session we will use the following two parameter sets:
8383

8484
**Lenient**
8585

86-
Allow for more mismatches → -n 0.01
86+
Allow for more mismatches → `-n 0.01`
8787

88-
Short seed length → -l 16
88+
Short seed length → `-l 16`
8989

9090
**Strict**
9191

92-
Allow for less mismatches → -n 0.1
92+
Allow for less mismatches → `-n 0.1`
9393

94-
Long seed length → -l 32
94+
Long seed length → `-l 32`
9595

9696
We will be working with pre-processed files (`sample1.fastq.gz`, `sample2.fastq.gz`), i.e. any quality filtering and removal of sequencing adapters is already done.
9797

@@ -112,13 +112,13 @@ Go into the corresponding folder:
112112
cd sample1_lenient
113113
```
114114

115-
Perform the _bwa_ alignment, here for sample1, and specify lenient mapping parameters:
115+
Perform the `bwa` alignment, here for sample1, and specify lenient mapping parameters:
116116

117117
```bash
118118
bwa aln -n 0.01 -l 16 ../YpestisCO92.fa ../sample1.fastq.gz > reads_file.sai
119119
```
120120

121-
Proceed with writing the mapping in _sam_ format ([https://en.wikipedia.org/wiki/SAM\_(file_format)](<https://en.wikipedia.org/wiki/SAM_(file_format)>)):
121+
Proceed with writing the mapping in `sam` format [@Li2009, [https://en.wikipedia.org/wiki/SAM\_(file_format)](<https://en.wikipedia.org/wiki/SAM_(file_format)>)]:
122122

123123
```bash
124124
bwa samse -r '@RG\tID:all\tLB:NA\tPL:illumina\tPU:NA\tSM:NA' ../YpestisCO92.fa reads_file.sai ../sample1.fastq.gz > reads_mapped.sam
@@ -133,12 +133,12 @@ Convert SAM file to binary format (BAM file):
133133
samtools view -b -S reads_mapped.sam > reads_mapped.bam
134134
```
135135

136-
For processing of _sam_ and _bam_ files we use _SAMtools_ ([@Li2009][http://samtools.sourceforge.net/](http://samtools.sourceforge.net/)).
136+
For processing of `sam` and `bam` files we use `samtools` [@Li2009, [https://github.com/samtools/samtools](https://github.com/samtools/samtools)].
137137

138138
`-b` specifies to output in BAM format.
139139
(`-S` specifies input is SAM, can be omitted in recent versions.)
140140

141-
Now we sort the _bam_ file → Sort alignments by leftmost coordinates:
141+
Now we sort the `bam` file → Sort alignments by leftmost coordinates:
142142

143143
```bash
144144
samtools sort reads_mapped.bam > reads_mapped_sorted.bam
@@ -171,7 +171,7 @@ samtools view reads_mapped_sorted_dedup.bam | less -S
171171

172172
(exit by pressing <kbd>q</kbd>)
173173

174-
We can also get a summary about the number of mapped reads. For this we use the _samtools idxstats_ command ([http://www.htslib.org/doc/samtools-idxstats.html](http://www.htslib.org/doc/samtools-idxstats.html)):
174+
We can also get a summary about the number of mapped reads. For this we use the `samtools idxstats` command ([http://www.htslib.org/doc/samtools-idxstats.html](http://www.htslib.org/doc/samtools-idxstats.html)):
175175

176176
```bash
177177
samtools idxstats reads_mapped_sorted_dedup.bam
@@ -180,9 +180,9 @@ samtools idxstats reads_mapped_sorted_dedup.bam
180180
## Genotyping
181181

182182
The next step we need to perform is genotyping, i.e. the identification of all SNPs that differentiate the sample from the reference.
183-
For this we use the _Genome Analysis Toolkit (GATK)_ ([@DePristo2011][http://www.broadinstitute.org/gatk/](http://www.broadinstitute.org/gatk/))
183+
For this we use the 'Genome Analysis Toolkit' (`gatk`) [@DePristo2011, [http://www.broadinstitute.org/gatk/](http://www.broadinstitute.org/gatk/)]
184184

185-
It uses the reference genome and the mapping as input and produces an output in _Variant Call Format (VCF)_ ([https://en.wikipedia.org/wiki/Variant_Call_Format](https://en.wikipedia.org/wiki/Variant_Call_Format)).
185+
It uses the reference genome and the mapping as input and produces an output in 'Variant Call Format (VCF)' ([https://en.wikipedia.org/wiki/Variant_Call_Format](https://en.wikipedia.org/wiki/Variant_Call_Format)).
186186

187187
Perform genotyping on the mapping file:
188188

@@ -281,11 +281,11 @@ gatk3 -T UnifiedGenotyper -R ../YpestisCO92.fa -I reads_mapped_sorted_dedup.bam
281281

282282
In order to combine the results from multiple samples and parameter settings we need to agregate and comparatively analyse the information from all the _vcf_ files.
283283
For this we will use the software
284-
_MultiVCFAnalyzer_ ([https://github.com/alexherbig/MultiVCFAnalyzer](https://github.com/alexherbig/MultiVCFAnalyzer)).
284+
`multivcfanalyzer` [@Bos2014-xe, [https://github.com/alexherbig/MultiVCFAnalyzer](https://github.com/alexherbig/MultiVCFAnalyzer)].
285285

286-
It produces various output files and summary statistics and can integrate gene annotations for SNP effect analysis as done by the program _SnpEff_ ([@Cingolani2012] - [http://snpeff.sourceforge.net/](http://snpeff.sourceforge.net/)).
286+
It produces various output files and summary statistics and can integrate gene annotations for SNP effect analysis as done by the program `SnpEff` [@Cingolani2012, [https://github.com/pcingola/SnpEff](https://github.com/pcingola/SnpEff)].
287287

288-
Run _MultiVCFAnalyzer_ on all 4 files at once.
288+
Run `multivcfanalyzer` on all 4 files at once.
289289
First `cd` one level up (if you type `ls` you should see your 4 directories, reference, etc.):
290290

291291
```bash
@@ -304,7 +304,7 @@ mkdir vcf_out
304304
multivcfanalyzer NA YpestisCO92.fa NA vcf_out F 30 3 0.9 0.9 NA sample1_lenient/mysnps.vcf sample1_strict/mysnps.vcf sample2_lenient/mysnps.vcf sample2_strict/mysnps.vcf
305305
```
306306

307-
Let’s have a look in the vcf_out directory (`cd` into it):
307+
Let’s have a look in the `vcf_out/` directory (`cd` into it):
308308

309309
```bash
310310
cd vcf_out
@@ -351,15 +351,15 @@ The first column contains the dataset name and the second column the number of c
351351
## Exploring the Results
352352

353353
For visual exploration of mapping results so-called "Genome Browsers" are used.
354-
Here we will use the _Integrative Genomics Viewer (IGV)_ ([https://software.broadinstitute.org/software/igv/](https://software.broadinstitute.org/software/igv/)).
354+
Here we will use the 'Integrative Genomics Viewer' (`igv`)' ([https://software.broadinstitute.org/software/igv/](https://software.broadinstitute.org/software/igv/)).
355355

356-
To open IGV, simply type the following command and the app will open:
356+
To open `igv`, simply type the following command and the app will open:
357357

358358
```bash
359359
igv
360360
```
361361

362-
Note that you cannot use the terminal while IGV is open. If you want to use it anyways, open a second terminal via the bar on the bottom.
362+
Note that you cannot use the terminal while `igv` is open. If you want to use it anyways, open a second terminal via the bar on the bottom.
363363

364364
Load your reference (`YpestisCO92.fa`):
365365

@@ -390,7 +390,7 @@ Have a look at `snpTable.tsv`.
390390
Can you identify SNPs that were called with lenient but not with strict parameters or vice versa?
391391
:::
392392

393-
Let’s check out some of these in IGV.
393+
Let’s check out some of these in `igv`.
394394

395395
::: {.callout-tip title="Question" appearance="simple"}
396396
Do you observe certain patterns in these genomic regions?
@@ -433,33 +433,29 @@ Such regions can be fairly large. For example, see this 20 kb region around posi
433433

434434
## (Optional) clean-up
435435

436-
Let's clean up your working directory by removing all the data and output from this chapter.
437-
438-
When closing your `jupyter` notebook(s), say no to saving any additional files.
436+
Let's clean up our working directory by removing all the data and output from this chapter.
439437

440-
Press <kbd>ctrl</kbd> + <kbd>c</kbd> on your terminal, and type <kbd>y</kbd> when requested.
441-
Once completed, the command below will remove the `/<PATH>/<TO>/genome-mapping directory` **as well as all of its contents**.
438+
The command below will remove the `/<PATH>/<TO>/genome-mapping` _as well as all of its contents_.
442439

443440
::: {.callout-tip}
444441
## Pro Tip
445-
Always be VERY careful when using `rm -r`. Check 3x that the path you are
446-
specifying is exactly what you want to delete and nothing more before pressing
447-
ENTER!
442+
Always be VERY careful when using `rm -r`.
443+
Check 3x that the path you are specifying is exactly what you want to delete and nothing more before pressing ENTER!
448444
:::
449445

450446
```bash
451447
rm -r /<PATH>/<TO>/genome-mapping*
452448
```
453449

454-
Once deleted you can move elsewhere (e.g. `cd ~`).
450+
Once deleted we can move elsewhere (e.g. `cd ~`).
455451

456-
We can also get out of the `conda` environment with
452+
We can also get out of the `conda` environment with.
457453

458454
```bash
459455
conda deactivate
460456
```
461457

462-
To delete the conda environment
458+
Then to delete the conda environment.
463459

464460
```bash
465461
conda remove --name genome-mapping --all -y

0 commit comments

Comments
 (0)