Skip to content

Commit 29f050f

Browse files
committed
Further minior formatting
1 parent a5b2963 commit 29f050f

File tree

1 file changed

+28
-28
lines changed

1 file changed

+28
-28
lines changed

genome-mapping.qmd

Lines changed: 28 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,14 @@ bibliography: assets/references/genome-mapping.bib
77
:::{.callout-note collapse="true" title="Self guided: chapter environment setup"}
88
For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment.
99

10-
Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413204](https://doi.org/10.5281/zenodo.8413204), and unpack
10+
Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413204](https://doi.org/10.5281/zenodo.8413204), and unpack.
1111

1212
```bash
1313
tar xvf genome-mapping.tar.gz
1414
cd genome-mapping/
1515
```
1616

17-
You can then create the subsequently activate environment with
17+
You can then create the subsequently activate environment with the following.
1818

1919
```bash
2020
conda env create -f genome-mapping.yml
@@ -58,13 +58,13 @@ The first index we will generate is for `bwa`.
5858
bwa index YpestisCO92.fa
5959
```
6060

61-
The second index will be used by the genome browser we will apply to our results later on:
61+
The second index will be used by the genome browser we will apply to our results later on.
6262

6363
```bash
6464
samtools faidx YpestisCO92.fa
6565
```
6666

67-
We need to build a third index that is necessary for the genotyping step, which comes later after mapping:
67+
We need to build a third index that is necessary for the genotyping step, which comes later after mapping.
6868

6969
```bash
7070
picard CreateSequenceDictionary R=YpestisCO92.fa
@@ -96,7 +96,7 @@ Long seed length → `-l 32`
9696
We will be working with pre-processed files (`sample1.fastq.gz`, `sample2.fastq.gz`), i.e. any quality filtering and removal of sequencing adapters is already done.
9797

9898
We will map each file once with lenient and once with strict parameters.
99-
For this, we will make 4 separate directories, to avoid mixing up files:
99+
For this, we will make 4 separate directories, to avoid mixing up files.
100100

101101
```bash
102102
mkdir sample1_lenient sample2_lenient sample1_strict sample2_strict
@@ -106,19 +106,19 @@ mkdir sample1_lenient sample2_lenient sample1_strict sample2_strict
106106

107107
Let’s begin with a lenient mapping of sample1.
108108

109-
Go into the corresponding folder:
109+
Go into the corresponding folder.
110110

111111
```bash
112112
cd sample1_lenient
113113
```
114114

115-
Perform the `bwa` alignment, here for sample1, and specify lenient mapping parameters:
115+
Perform the `bwa` alignment, here for sample1, and specify lenient mapping parameters.
116116

117117
```bash
118118
bwa aln -n 0.01 -l 16 ../YpestisCO92.fa ../sample1.fastq.gz > reads_file.sai
119119
```
120120

121-
Proceed with writing the mapping in `sam` format [@Li2009, [https://en.wikipedia.org/wiki/SAM\_(file_format)](<https://en.wikipedia.org/wiki/SAM_(file_format)>)]:
121+
Proceed with writing the mapping in `sam` format [@Li2009, [https://en.wikipedia.org/wiki/SAM\_(file_format)](<https://en.wikipedia.org/wiki/SAM_(file_format)>)].
122122

123123
```bash
124124
bwa samse -r '@RG\tID:all\tLB:NA\tPL:illumina\tPU:NA\tSM:NA' ../YpestisCO92.fa reads_file.sai ../sample1.fastq.gz > reads_mapped.sam
@@ -127,7 +127,7 @@ bwa samse -r '@RG\tID:all\tLB:NA\tPL:illumina\tPU:NA\tSM:NA' ../YpestisCO92.fa r
127127
Note that we have specified the sequencing platform (Illumina) by creating a so-called "Read Group" (`-r`).
128128
This information is used later during the genotyping step.
129129

130-
Convert SAM file to binary format (BAM file):
130+
Convert SAM file to binary format (BAM file).
131131

132132
```bash
133133
samtools view -b -S reads_mapped.sam > reads_mapped.bam
@@ -136,21 +136,21 @@ samtools view -b -S reads_mapped.sam > reads_mapped.bam
136136
For processing of `sam` and `bam` files we use `samtools` [@Li2009, [https://github.com/samtools/samtools](https://github.com/samtools/samtools)].
137137

138138
`-b` specifies to output in BAM format.
139-
(`-S` specifies input is SAM, can be omitted in recent versions.)
139+
(`-S` specifies input is SAM, can be omitted in recent versions).
140140

141-
Now we sort the `bam` file → Sort alignments by leftmost coordinates:
141+
Now we sort the `bam` file → Sort alignments by leftmost coordinates.
142142

143143
```bash
144144
samtools sort reads_mapped.bam > reads_mapped_sorted.bam
145145
```
146146

147-
The sorted bam file needs to be indexed → more efficient for further processing:
147+
The sorted bam file needs to be indexed → more efficient for further processing.
148148

149149
```bash
150150
samtools index reads_mapped_sorted.bam
151151
```
152152

153-
Deduplication → Removal of reads from duplicated fragments:
153+
Deduplication → Removal of reads from duplicated fragments.
154154

155155
```bash
156156
samtools rmdup -s reads_mapped_sorted.bam reads_mapped_sorted_dedup.bam
@@ -163,15 +163,15 @@ samtools index reads_mapped_sorted_dedup.bam
163163
Duplicated reads are usually a consequence of amplification of the DNA fragments in the lab. Therefore, they are not biologically meaningful.
164164

165165
We have now completed the mapping procedure.
166-
Let's have a look at our mapping results:
166+
Let's have a look at our mapping results.
167167

168168
```bash
169169
samtools view reads_mapped_sorted_dedup.bam | less -S
170170
```
171171

172172
(exit by pressing <kbd>q</kbd>)
173173

174-
We can also get a summary about the number of mapped reads. For this we use the `samtools idxstats` command ([http://www.htslib.org/doc/samtools-idxstats.html](http://www.htslib.org/doc/samtools-idxstats.html)):
174+
We can also get a summary about the number of mapped reads. For this we use the `samtools idxstats` command ([http://www.htslib.org/doc/samtools-idxstats.html](http://www.htslib.org/doc/samtools-idxstats.html)).
175175

176176
```bash
177177
samtools idxstats reads_mapped_sorted_dedup.bam
@@ -184,7 +184,7 @@ For this we use the 'Genome Analysis Toolkit' (`gatk`) [@DePristo2011, [http://w
184184

185185
It uses the reference genome and the mapping as input and produces an output in 'Variant Call Format (VCF)' ([https://en.wikipedia.org/wiki/Variant_Call_Format](https://en.wikipedia.org/wiki/Variant_Call_Format)).
186186

187-
Perform genotyping on the mapping file:
187+
Perform genotyping on the mapping file.
188188

189189
```bash
190190
gatk3 -T UnifiedGenotyper -R ../YpestisCO92.fa -I reads_mapped_sorted_dedup.bam --output_mode EMIT_ALL_SITES -o mysnps.vcf
@@ -286,31 +286,31 @@ For this we will use the software
286286
It produces various output files and summary statistics and can integrate gene annotations for SNP effect analysis as done by the program `SnpEff` [@Cingolani2012, [https://github.com/pcingola/SnpEff](https://github.com/pcingola/SnpEff)].
287287

288288
Run `multivcfanalyzer` on all 4 files at once.
289-
First `cd` one level up (if you type `ls` you should see your 4 directories, reference, etc.):
289+
First `cd` one level up (if you type `ls` you should see your 4 directories, reference, etc.).
290290

291291
```bash
292292
cd ..
293293
```
294294

295-
Then make a new directory
295+
Then make a new directory.
296296

297297
```bash
298298
mkdir vcf_out
299299
```
300300

301-
…and run the programme:
301+
And run the programme.
302302

303303
```bash
304304
multivcfanalyzer NA YpestisCO92.fa NA vcf_out F 30 3 0.9 0.9 NA sample1_lenient/mysnps.vcf sample1_strict/mysnps.vcf sample2_lenient/mysnps.vcf sample2_strict/mysnps.vcf
305305
```
306306

307-
Let’s have a look in the `vcf_out/` directory (`cd` into it):
307+
Let’s have a look in the `vcf_out/` directory (`cd` into it).
308308

309309
```bash
310310
cd vcf_out
311311
```
312312

313-
Check the parameters we set earlier:
313+
Check the parameters we set earlier.
314314

315315
```bash
316316
less -S info.txt
@@ -326,7 +326,7 @@ less -S snpStatistics.tsv
326326

327327
(exit by pressing <kbd>q</kbd>)
328328

329-
The file content should look like this:
329+
The file content should look like this.
330330

331331
```bash
332332
SNP statistics for 4 samples.
@@ -353,7 +353,7 @@ The first column contains the dataset name and the second column the number of c
353353
For visual exploration of mapping results so-called "Genome Browsers" are used.
354354
Here we will use the 'Integrative Genomics Viewer' (`igv`)' ([https://software.broadinstitute.org/software/igv/](https://software.broadinstitute.org/software/igv/)).
355355

356-
To open `igv`, simply type the following command and the app will open:
356+
To open `igv`, simply type the following command and the app will open.
357357

358358
```bash
359359
igv
@@ -463,11 +463,11 @@ conda remove --name genome-mapping --all -y
463463

464464
## Summary
465465

466-
- Mapping DNA sequencing reads to a reference genome is a complex procedure that requires multiple steps.
467-
- Mapping results are the basis for genotyping, i.e. the detection of differences to the reference.
468-
- The genotyping results can be aggregated from multiple samples and comparatively analysed e.g. in the context of phylogenomics.
469-
- The chosen mapping parameters can have a strong influence on the results of any downstream analysis.
470-
- This is particularly true when dealing with ancient DNA samples as they tend to contain DNA from multiple organisms.This can lead to mismapped reads and therefore incorrect genotypes, which can further influence downstream analyses.
466+
- Mapping DNA sequencing reads to a reference genome is a complex procedure that requires multiple steps
467+
- Mapping results are the basis for genotyping, i.e. the detection of differences to the reference
468+
- The genotyping results can be aggregated from multiple samples and comparatively analysed e.g. in the context of phylogenomics
469+
- The chosen mapping parameters can have a strong influence on the results of any downstream analysis
470+
- This is particularly true when dealing with ancient DNA samples as they tend to contain DNA from multiple organisms.This can lead to mismapped reads and therefore incorrect genotypes, which can further influence downstream analyses
471471

472472
## References
473473

0 commit comments

Comments
 (0)