Minor formatting fixes

jfy133 · jfy133 · commit b508c20f0508 · 2024-08-06T09:46:02.000Z
diff --git a/assets/references/genome-mapping.bib b/assets/references/genome-mapping.bib
@@ -71,4 +71,50 @@ @article{Cingolani2012
   year = {2012},
   month = apr,
   pages = {80–92}
-}
+}
+
+@ARTICLE{Bos2014-xe,
+  title    = "Pre-Columbian mycobacterial genomes reveal seals as a source of
+              New World human tuberculosis",
+  author   = "Bos, Kirsten I and Harkins, Kelly M and Herbig, Alexander and
+              Coscolla, Mireia and Weber, Nico and Comas, Iñaki and Forrest,
+              Stephen A and Bryant, Josephine M and Harris, Simon R and
+              Schuenemann, Verena J and Campbell, Tessa J and Majander, Kerttu
+              and Wilbur, Alicia K and Guichon, Ricardo A and Wolfe Steadman,
+              Dawnie L and Cook, Della Collins and Niemann, Stefan and Behr,
+              Marcel A and Zumarraga, Martin and Bastida, Ricardo and Huson,
+              Daniel and Nieselt, Kay and Young, Douglas and Parkhill, Julian
+              and Buikstra, Jane E and Gagneux, Sebastien and Stone, Anne C and
+              Krause, Johannes",
+  journal  = "Nature",
+  volume   =  514,
+  number   =  7523,
+  pages    = "494--497",
+  abstract = "Modern strains of Mycobacterium tuberculosis from the Americas are
+              closely related to those from Europe, supporting the assumption
+              that human tuberculosis was introduced post-contact. This notion,
+              however, is incompatible with archaeological evidence of
+              pre-contact tuberculosis in the New World. Comparative genomics of
+              modern isolates suggests that M. tuberculosis attained its
+              worldwide distribution following human dispersals out of Africa
+              during the Pleistocene epoch, although this has yet to be
+              confirmed with ancient calibration points. Here we present three
+              1,000-year-old mycobacterial genomes from Peruvian human
+              skeletons, revealing that a member of the M. tuberculosis complex
+              caused human disease before contact. The ancient strains are
+              distinct from known human-adapted forms and are most closely
+              related to those adapted to seals and sea lions. Two independent
+              dating approaches suggest a most recent common ancestor for the M.
+              tuberculosis complex less than 6,000 years ago, which supports a
+              Holocene dispersal of the disease. Our results implicate sea
+              mammals as having played a role in transmitting the disease to
+              humans across the ocean.",
+  month    =  oct,
+  year     =  2014,
+  url      = "http://dx.doi.org/10.1038/nature13591",
+  doi      = "10.1038/nature13591",
+  pmc      = "PMC4550673",
+  pmid     =  25141181,
+  issn     = "0028-0836,1476-4687",
+  language = "en"
+}
diff --git a/authentication.qmd b/authentication.qmd
@@ -1011,6 +1011,38 @@ And click "Open folder"
 You can double-click on the pdf files to visualise them.
 :::
 
+
+## (Optional) clean-up
+
+Let's clean up our working directory by removing all the data and output from this chapter.
+
+The command below will remove the `/<PATH>/<TO>/authentication` _as well as all of its contents_. 
+
+::: {.callout-tip}
+## Pro Tip
+Always be VERY careful when using `rm -r`.
+Check 3x that the path you are specifying is exactly what you want to delete and nothing more before pressing ENTER!
+:::
+
+```bash
+rm -r /<PATH>/<TO>/authentication*
+```
+
+Once deleted we can move elsewhere (e.g. `cd ~`).
+
+We can also get out of the `conda` environment with.
+
+```bash
+conda deactivate
+```
+
+Then to delete the conda environment.
+
+```bash
+conda remove --name authentication --all -y
+```
+
+
 ## Summary
 
 In addition, we:
diff --git a/genome-mapping.qmd b/genome-mapping.qmd
@@ -2,31 +2,9 @@
 title: Genome Mapping
 author: Alexander Herbig, Alina Hiß, and Teresa Zeibig
 bibliography: assets/references/genome-mapping.bib
-
 ---
-Mapping/aligning to a reference genome is one way of reconstructing genomic information from DNA sequencing reads.
-This allows for identification of differences between the genome from your sample and the reference genome.
-This information can be used for example for comparative analyses such as in phylogenetics. For a detailed explanation of the read alignment problem and an overview of concepts for solving it, please see [@Reinert2015] [https://doi.org/10.1146/annurev-genom-090413-025358](https://doi.org/10.1146/annurev-genom-090413-025358).
-
-In this session we will map two samples to the _Yersinia pestis_ (plague) genome using different parameter sets. 
-We will do this "manually" in the sense that we will use all necessary commands one by one in the terminal. 
-These commands usually run in the background when you apply DNA sequencing data processing pipelines.
 
-We will be using the Burrows-Wheeler Aligner ([@Li2010]– [http://bio-bwa.sourceforge.net](http://bio-bwa.sourceforge.net)).
-There are different algorithms implemented for different types of data (e.g. different read lengths).
-Here, we use BWA backtrack (_bwa aln_), which is suitable for Illumina sequences up to 100bp.
-Other algorithms are _bwa mem_ and _bwa sw_ for longer reads.
-
-Your learning objectives:
-
-1. **Understand the Basics**: You will be able to define mapping and describe the basic principles of metagenomic mapping and the different parameters used.  
-2. **Apply Mapping Techniques**: You will be able to apply metagenomic mapping techniques to align raw sequence data to a reference genome in a step-by-step manner.  
-3. **Use Bioinformatics Tools**: You will be able to use the command line to apply different metagenomic mappers and perform genotype analysis via multivcfanalyzer in the standard settings. You will be able to inspect results in the IGV viewer.  
-4. **Interpret Results**: You will be able to interpret the results of a mapping experiment and discuss their implications. You will also be able to understand the genotyping tool multiVCFanalycer.  
-5. **Be Aware and Able to Read Up**: You will know about the existence of multiple mapping algorithms and the importance of parameter research and adjustment. You will know that the IGV viewer is one option to inspect mapping results but not the only one.
-
-
-::: {.callout-note collapse="true" title="Self guided: chapter environment setup"}
+:::{.callout-note collapse="true" title="Self guided: chapter environment setup"}
 For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment.
 
 Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.8413204](https://doi.org/10.5281/zenodo.8413204), and unpack
@@ -43,16 +21,38 @@ conda env create -f genome-mapping.yml
 conda activate genome-mapping
 ```
 :::
+
+Mapping/aligning to a reference genome is one way of reconstructing genomic information from DNA sequencing reads.
+This allows for identification of differences between the genome from your sample and the reference genome.
+This information can be used for example for comparative analyses such as in phylogenetics. For a detailed explanation of the read alignment problem and an overview of concepts for solving it, please see [@Reinert2015].
+
+In this session we will map two samples to the _Yersinia pestis_ (plague) genome using different parameter sets. 
+We will do this "manually" in the sense that we will use all necessary commands one by one in the terminal. 
+These commands usually run in the background when you apply DNA sequencing data processing pipelines.
+
+We will be using the Burrows-Wheeler Aligner [@Li2010, [http://bio-bwa.sourceforge.net](http://bio-bwa.sourceforge.net)].
+There are different algorithms implemented for different types of data (e.g. different read lengths).
+Here, we use BWA backtrack (`bwa aln`), which is suitable for Illumina sequences up to 100bp.
+Other algorithms are `bwa mem` and `bwa sw` for longer reads.
+
+Your learning objectives:
+
+1. **Understand the Basics**: You will be able to define mapping and describe the basic principles of metagenomic mapping and the different parameters used.  
+2. **Apply Mapping Techniques**: You will be able to apply metagenomic mapping techniques to align raw sequence data to a reference genome in a step-by-step manner.  
+3. **Use Bioinformatics Tools**: You will be able to use the command line to apply different metagenomic mappers and perform genotype analysis via multivcfanalyzer in the standard settings. You will be able to inspect results in the IGV viewer.  
+4. **Interpret Results**: You will be able to interpret the results of a mapping experiment and discuss their implications. You will also be able to understand the genotyping tool multiVCFanalycer.  
+5. **Be Aware and Able to Read Up**: You will know about the existence of multiple mapping algorithms and the importance of parameter research and adjustment. You will know that the IGV viewer is one option to inspect mapping results but not the only one.
+
 ## Reference Genome
 
 For mapping we need a reference genome in FASTA format. Ideally we use a genome from the same species that our data relates to or, if not available, a closely related species.
 The selection of the correct reference genome is highly relevant. E.g. if the chosen genome differs too much from the organism the data relates to, it might not be possible to map most of the reads.
-Reference genomes can be retrieved from comprehensive databases such as [NCBI](https://www.ncbi.nlm.nih.gov/).
+Reference genomes can be retrieved from comprehensive databases such provided by the NCBI ([https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/)).
 
 In your directory, you can find 2 samples and your reference.
 As a first step we will index our reference genome (make sure you are inside your directory).
 
-The first index we will generate is for _bwa_.
+The first index we will generate is for `bwa`.
 
 ```bash
 bwa index YpestisCO92.fa
@@ -72,26 +72,26 @@ picard CreateSequenceDictionary R=YpestisCO92.fa
 
 ## Mapping Parameters
 
-We will be using _bwa aln_, but we need to specify parameters.
+We will be using `bwa aln`, but we need to specify parameters.
 For now we will concentrate on the "seed length" and the "maximum edit distance". We will use the default setting for all other parameters during this session. The choice of the right parameters depend on many factors such as the type of data and the specific use case. One aspect is the mapping sensitivity, i.e. how different a read can be from the chosen reference and still be mapped. In this context we generally differentiate between _strict_ and _lenient_ mapping parameters.
 
-As many other mapping algorithms _bwa_ uses a so-called "seed-and-extend" approach. I.e. it initially maps the first _N_ nucleotides of each read to the genome with relatively few mismatches and thereby determines candidate positions for the more time-intensive full alignment.
+As many other mapping algorithms `bwa` uses a so-called "seed-and-extend" approach. I.e. it initially maps the first _N_ nucleotides of each read to the genome with relatively few mismatches and thereby determines candidate positions for the more time-intensive full alignment.
 
 A short seed length will generate more such candidate positions and therefore mapping will take longer, but it will also be more sensitive, i.e. there can be more differences between the read and the genome. Long seeds are less sensitive but the mapping procedure is faster.
 
 In this session we will use the following two parameter sets:
 
 **Lenient**
 
-Allow for more mismatches → -n 0.01
+Allow for more mismatches → `-n 0.01`
 
-Short seed length → -l 16
+Short seed length → `-l 16`
 
 **Strict**
 
-Allow for less mismatches → -n 0.1
+Allow for less mismatches → `-n 0.1`
 
-Long seed length → -l 32
+Long seed length → `-l 32`
 
 We will be working with pre-processed files (`sample1.fastq.gz`, `sample2.fastq.gz`), i.e. any quality filtering and removal of sequencing adapters is already done.
 
@@ -112,13 +112,13 @@ Go into the corresponding folder:
 cd sample1_lenient
 ```
 
-Perform the _bwa_ alignment, here for sample1, and specify lenient mapping parameters:
+Perform the `bwa` alignment, here for sample1, and specify lenient mapping parameters:
 
 ```bash
 bwa aln -n 0.01 -l 16 ../YpestisCO92.fa ../sample1.fastq.gz > reads_file.sai
 ```
 
-Proceed with writing the mapping in _sam_ format ([https://en.wikipedia.org/wiki/SAM\_(file_format)](<https://en.wikipedia.org/wiki/SAM_(file_format)>)):
+Proceed with writing the mapping in `sam` format [@Li2009, [https://en.wikipedia.org/wiki/SAM\_(file_format)](<https://en.wikipedia.org/wiki/SAM_(file_format)>)]:
 
 ```bash
 bwa samse -r '@RG\tID:all\tLB:NA\tPL:illumina\tPU:NA\tSM:NA' ../YpestisCO92.fa reads_file.sai ../sample1.fastq.gz > reads_mapped.sam
@@ -133,12 +133,12 @@ Convert SAM file to binary format (BAM file):
 samtools view -b -S reads_mapped.sam > reads_mapped.bam
 ```
 
-For processing of _sam_ and _bam_ files we use _SAMtools_ ([@Li2009] – [http://samtools.sourceforge.net/](http://samtools.sourceforge.net/)).
+For processing of `sam` and `bam` files we use `samtools` [@Li2009, [https://github.com/samtools/samtools](https://github.com/samtools/samtools)].
 
 `-b` specifies to output in BAM format.
 (`-S` specifies input is SAM, can be omitted in recent versions.)
 
-Now we sort the _bam_ file → Sort alignments by leftmost coordinates:
+Now we sort the `bam` file → Sort alignments by leftmost coordinates:
 
 ```bash
 samtools sort reads_mapped.bam > reads_mapped_sorted.bam
@@ -171,7 +171,7 @@ samtools view reads_mapped_sorted_dedup.bam | less -S
 
 (exit by pressing <kbd>q</kbd>)
 
-We can also get a summary about the number of mapped reads. For this we use the _samtools idxstats_ command ([http://www.htslib.org/doc/samtools-idxstats.html](http://www.htslib.org/doc/samtools-idxstats.html)):
+We can also get a summary about the number of mapped reads. For this we use the `samtools idxstats` command ([http://www.htslib.org/doc/samtools-idxstats.html](http://www.htslib.org/doc/samtools-idxstats.html)):
 
 ```bash
 samtools idxstats reads_mapped_sorted_dedup.bam
@@ -180,9 +180,9 @@ samtools idxstats reads_mapped_sorted_dedup.bam
 ## Genotyping
 
 The next step we need to perform is genotyping, i.e. the identification of all SNPs that differentiate the sample from the reference.
-For this we use the _Genome Analysis Toolkit (GATK)_ ([@DePristo2011] – [http://www.broadinstitute.org/gatk/](http://www.broadinstitute.org/gatk/))
+For this we use the 'Genome Analysis Toolkit' (`gatk`) [@DePristo2011, [http://www.broadinstitute.org/gatk/](http://www.broadinstitute.org/gatk/)]
 
-It uses the reference genome and the mapping as input and produces an output in _Variant Call Format (VCF)_ ([https://en.wikipedia.org/wiki/Variant_Call_Format](https://en.wikipedia.org/wiki/Variant_Call_Format)).
+It uses the reference genome and the mapping as input and produces an output in 'Variant Call Format (VCF)' ([https://en.wikipedia.org/wiki/Variant_Call_Format](https://en.wikipedia.org/wiki/Variant_Call_Format)).
 
 Perform genotyping on the mapping file:
 
@@ -281,11 +281,11 @@ gatk3 -T UnifiedGenotyper -R ../YpestisCO92.fa -I reads_mapped_sorted_dedup.bam
 
 In order to combine the results from multiple samples and parameter settings we need to agregate and comparatively analyse the information from all the _vcf_ files.
 For this we will use the software
-_MultiVCFAnalyzer_ ([https://github.com/alexherbig/MultiVCFAnalyzer](https://github.com/alexherbig/MultiVCFAnalyzer)).
+`multivcfanalyzer` [@Bos2014-xe, [https://github.com/alexherbig/MultiVCFAnalyzer](https://github.com/alexherbig/MultiVCFAnalyzer)].
 
-It produces various output files and summary statistics and can integrate gene annotations for SNP effect analysis as done by the program _SnpEff_ ([@Cingolani2012] - [http://snpeff.sourceforge.net/](http://snpeff.sourceforge.net/)).
+It produces various output files and summary statistics and can integrate gene annotations for SNP effect analysis as done by the program `SnpEff` [@Cingolani2012, [https://github.com/pcingola/SnpEff](https://github.com/pcingola/SnpEff)].
 
-Run _MultiVCFAnalyzer_ on all 4 files at once.
+Run `multivcfanalyzer` on all 4 files at once.
 First `cd` one level up (if you type `ls` you should see your 4 directories, reference, etc.):
 
 ```bash
@@ -304,7 +304,7 @@ mkdir vcf_out
 multivcfanalyzer NA YpestisCO92.fa NA vcf_out F 30 3 0.9 0.9 NA sample1_lenient/mysnps.vcf sample1_strict/mysnps.vcf sample2_lenient/mysnps.vcf sample2_strict/mysnps.vcf
 ```
 
-Let’s have a look in the ‘vcf_out’ directory (`cd` into it):
+Let’s have a look in the `vcf_out/` directory (`cd` into it):
 
 ```bash
 cd vcf_out
@@ -351,15 +351,15 @@ The first column contains the dataset name and the second column the number of c
 ## Exploring the Results
 
 For visual exploration of mapping results so-called "Genome Browsers" are used.
-Here we will use the _Integrative Genomics Viewer (IGV)_ ([https://software.broadinstitute.org/software/igv/](https://software.broadinstitute.org/software/igv/)).
+Here we will use the 'Integrative Genomics Viewer' (`igv`)' ([https://software.broadinstitute.org/software/igv/](https://software.broadinstitute.org/software/igv/)).
 
-To open IGV, simply type the following command and the app will open:
+To open `igv`, simply type the following command and the app will open:
 
 ```bash
 igv
 ```
 
-Note that you cannot use the terminal while IGV is open. If you want to use it anyways, open a second terminal via the bar on the bottom.
+Note that you cannot use the terminal while `igv` is open. If you want to use it anyways, open a second terminal via the bar on the bottom.
 
 Load your reference (`YpestisCO92.fa`):
 
@@ -390,7 +390,7 @@ Have a look at `snpTable.tsv`.
 Can you identify SNPs that were called with lenient but not with strict parameters or vice versa?
 :::
 
-Let’s check out some of these in IGV.
+Let’s check out some of these in `igv`.
 
 ::: {.callout-tip title="Question" appearance="simple"}
 Do you observe certain patterns in these genomic regions?
@@ -433,33 +433,29 @@ Such regions can be fairly large. For example, see this 20 kb region around posi
 
 ## (Optional) clean-up
 
-Let's clean up your working directory by removing all the data and output from this chapter.
-
-When closing your `jupyter` notebook(s), say no to saving any additional files.
+Let's clean up our working directory by removing all the data and output from this chapter.
 
-Press <kbd>ctrl</kbd> + <kbd>c</kbd> on your terminal, and type <kbd>y</kbd> when requested. 
-Once completed, the command below will remove the `/<PATH>/<TO>/genome-mapping directory` **as well as all of its contents**. 
+The command below will remove the `/<PATH>/<TO>/genome-mapping` _as well as all of its contents_. 
 
 ::: {.callout-tip}
 ## Pro Tip
-Always be VERY careful when using `rm -r`. Check 3x that the path you are
-specifying is exactly what you want to delete and nothing more before pressing
-ENTER!
+Always be VERY careful when using `rm -r`.
+Check 3x that the path you are specifying is exactly what you want to delete and nothing more before pressing ENTER!
 :::
 
 ```bash
 rm -r /<PATH>/<TO>/genome-mapping*
 ```
 
-Once deleted you can move elsewhere (e.g. `cd ~`).
+Once deleted we can move elsewhere (e.g. `cd ~`).
 
-We can also get out of the `conda` environment with
+We can also get out of the `conda` environment with.
 
 ```bash
 conda deactivate
 ```
 
-To delete the conda environment
+Then to delete the conda environment.
 
 ```bash
 conda remove --name genome-mapping --all -y