Skip to content

Commit fd603b0

Browse files
authored
Merge pull request #50 from LouisLeNezet/nf-core-template-merge-2.14.1
Nf core template merge 2.14.1
2 parents e9337f4 + e8cb498 commit fd603b0

File tree

307 files changed

+15016
-626
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

307 files changed

+15016
-626
lines changed

.editorconfig

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,3 +31,9 @@ indent_size = unset
3131
# ignore python and markdown
3232
[*.{py,md}]
3333
indent_style = unset
34+
35+
[/docs/*.xml]
36+
indent_style = unset
37+
38+
[/docs/images/metro/*.xml]
39+
indent_style = unset

.github/workflows/ci.yml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
name: nf-core CI
22
# This workflow runs the pipeline with the minimal test dataset to check that it completes without any syntax errors
3+
34
on:
45
push:
56
branches:
@@ -26,6 +27,11 @@ jobs:
2627
NXF_VER:
2728
- "23.04.0"
2829
- "latest-everything"
30+
TEST_PROFILE:
31+
- "test"
32+
- "test_sim"
33+
- "test_quilt"
34+
- "test_stitch"
2935
steps:
3036
- name: Check out pipeline code
3137
uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4
@@ -43,4 +49,4 @@ jobs:
4349
# For example: adding multiple test runs with different parameters
4450
# Remember that you can parallelise this by using strategy.matrix
4551
run: |
46-
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results
52+
nextflow run ${GITHUB_WORKSPACE} -profile "${{ matrix.TEST_PROFILE }}",docker --outdir ./results

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,5 @@ results/
66
testing/
77
testing*
88
*.pyc
9+
*.code-workspace
10+
.nf-test*

CHANGELOG.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,23 @@ Initial release of nf-core/phaseimpute, created with the [nf-core](https://nf-co
99

1010
### `Added`
1111

12+
### `Changed`
13+
14+
- [#18](https://github.com/nf-core/phaseimpute/pull/18)
15+
- Maps and region by chromosome
16+
- update tests config files
17+
- correct meta map propagation
18+
- Test impute and test sim works
19+
- [#19](https://github.com/nf-core/phaseimpute/pull/19) - Changed reference panel to accept a csv, update modules and subworkflows (glimpse1/2 and shapeit5)
20+
- [#20](https://github.com/nf-core/phaseimpute/pull/20) - Added automatic detection of vcf contigs for the reference panel and automatic renaming available
21+
- [#22](https://github.com/nf-core/phaseimpute/pull/20) - Add validation step for concordance analysis. Input channels changed to match inputs steps. Outdir folder organised by steps. Modules config by subworkflows.
22+
- [#26](https://github.com/nf-core/phaseimpute/pull/26) - Added QUILT method
23+
1224
### `Fixed`
1325

26+
- [#15](https://github.com/nf-core/phaseimpute/pull/15) - Changed test csv files to point to nf-core repository
27+
- [#16](https://github.com/nf-core/phaseimpute/pull/16) - Removed outdir from test config files
28+
1429
### `Dependencies`
1530

1631
### `Deprecated`

CITATIONS.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,21 @@
1010
1111
## Pipeline tools
1212

13-
- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
13+
- [QUILT](https://pubmed.ncbi.nlm.nih.gov/34083788/)
1414

15-
> Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].
15+
> Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., ... & Myers, S. (2021). Rapid genotype imputation from sequence with reference panels. Nature genetics, 53(7), 1104-1111.
16+
17+
- [GLIMPSE](https://www.nature.com/articles/s41588-020-00756-0)
18+
19+
> Rubinacci, S., Ribeiro, D. M., Hofmeister, R. J., & Delaneau, O. (2021). Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nature Genetics, 53(1), 120-126.
20+
21+
- [Shapeit](https://odelaneau.github.io/shapeit5/)
22+
23+
> Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O. (2023). Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nature Genetics doi: https://doi.org/10.1038/s41588-023-01415-w
24+
25+
- [bcftools](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3198575/)
26+
27+
> Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
1628
1729
- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
1830

README.md

Lines changed: 41 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -19,50 +19,43 @@
1919

2020
## Introduction
2121

22-
**nf-core/phaseimpute** is a bioinformatics pipeline that ...
22+
**nf-core/phaseimpute** is a bioinformatics pipeline to phase and impute genetic data. Different steps are available each corresponding to a dedicated modes.
2323

24-
<!-- TODO nf-core:
25-
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
26-
major pipeline sections and the types of output it produces. You're giving an overview to someone new
27-
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
28-
-->
24+
### Main steps of the pipeline
2925

30-
<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
31-
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
32-
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
26+
The **phaseimpute** pipeline is constituted of 5 main steps:
3327

34-
1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))
35-
2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
28+
| Metro map | Modes |
29+
| ---------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
30+
| <img src="docs/images/metro/MetroMap.png" alt="metromap" width="800"/> | - **Pre-processing**: Phasing, QC, variant filtering, variant annotation of the reference panel <br> - **Phase**: Phasing of the target dataset on the reference panel <br> - **Simulate**: Simulation of the target dataset from high quality target data <br> - **Concordance**: Concordance between the target dataset and a truth dataset <br> - **Post-processing**: Variant filtering based on their imputation quality |
3631

3732
## Usage
3833

3934
> [!NOTE]
4035
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
4136
42-
<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
43-
Explain what rows and columns represent. For instance (please edit as appropriate):
44-
37+
The basic usage of this pipeline is to impute a target dataset based on a phased panel.
4538
First, prepare a samplesheet with your input data that looks as follows:
4639

4740
`samplesheet.csv`:
4841

4942
```csv
50-
sample,fastq_1,fastq_2
51-
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
43+
sample,bam,bai
44+
1_BAM_1X,/path/to/.bam,/path/to/.bai
5245
```
5346

54-
Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
55-
56-
-->
47+
Each row represents a bam file with its index file.
5748

5849
Now, you can run the pipeline using:
5950

60-
<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
61-
6251
```bash
6352
nextflow run nf-core/phaseimpute \
6453
-profile <docker/singularity/.../institute> \
6554
--input samplesheet.csv \
55+
--genome "GRCh38" \
56+
--panel <phased_reference_panel.vcf.gz> \
57+
--steps "impute" \
58+
--tools "glimpse1" \
6659
--outdir <OUTDIR>
6760
```
6861

@@ -72,6 +65,19 @@ nextflow run nf-core/phaseimpute \
7265
7366
For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/phaseimpute/usage) and the [parameter documentation](https://nf-co.re/phaseimpute/parameters).
7467

68+
## Description of the different mode of the pipeline
69+
70+
Here is a short description of the different mode of the pipeline.
71+
For more information please refer to the [documentation](https://nf-core.github.io/phaseimpute/usage/).
72+
73+
| Mode | Flow chart | Description |
74+
| ------------------ | ---------------------------------------------------------------------------------------- ||
75+
| **Preprocessing** | <img src="docs/images/metro/PreProcessing.png" alt="phase_metro" width="600"/> | The preprocessing mode is responsible to the preparation of the multiple input file that will be used by the phasing process. <br> The main processes are : <br> - **Haplotypes phasing** of the reference panel using [**Shapeit5**](https://odelaneau.github.io/shapeit5/). <br> - **Filter** the reference panel to select only the necessary variants. <br> - **Chunking the reference panel** in a subset of region for all the chromosomes. <br> - **Extract** the positions where to perform the imputation. |
76+
| **Phasing** | <img src="docs/images/metro/Phase.png" alt="phase_metro" width="600"/> | The phasing mode is the core mode of this pipeline. <br> It is constituted of 3 main steps: <br> - **Phasing**: Phasing of the target dataset on the reference panel using either: <br> &emsp; - [**Glimpse1**](https://odelaneau.github.io/GLIMPSE/glimpse1/index.html) <br> &emsp; It's come with the necessety to compute the genotype likelihoods of the target dataset. <br> &emsp; This step is done using [BCFTOOLS_mpileup](https://samtools.github.io/bcftools/bcftools.html#mpileup) <br> &emsp; - [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/glimpse2/index.html) For this step the reference panel is transformed to binary chunks. <br> &emsp; - [**Stitch**](https://github.com/rwdavies/stitch) <br> &emsp; - [**Quilt**](https://github.com/rwdavies/QUILT) <br> - **Ligation**: all the different chunks are merged together. <br> - **Sampling** (optional) |
77+
| **Simulate** | <img src="docs/images/metro/Simulate.png" alt="simulate_metro" width="600"/> | The simulation mode is used to create artificial low informative genetic information from high density data. This allow to compare the imputed result to a _truth_ and therefore evaluate the quality of the imputation. <br> For the moment it is possible to simulate: <br> - Low-pass data by **downsample** BAM or CRAM using [SAMTOOLS_view -s]() at different depth <br> - Genotype data by **SNP selecting** the position used by a designated SNP chip. <br> The simulation mode will also compute the **Genotype likelihoods** of the high density data. |
78+
| **Concordance** | <img src="docs/images/metro/Concordance.png" alt="concordance_metro" width="600"/> | This mode compare two vcf together to compute a summary of the differences between them. <br> To do so it use either: <br> - [**Glimpse1**](https://odelaneau.github.io/GLIMPSE/glimpse1/index.html) concordance process. <br> - [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/glimpse2/index.html) concordance process <br> - Or convert the two vcf fill to `.zarr` using [**Scikit allele**](https://scikit-allel.readthedocs.io/en/stable/) and [**anndata**](https://anndata.readthedocs.io/en/latest/) before comparing the SNPs. |
79+
| **Postprocessing** | <img src="docs/images/metro/PostProcessing.png" alt="postprocessing_metro" width="600"/> | This final process unable to loop the whole pipeline for increasing the performance of the imputation. To do so it filter out the best imputed position and rerun the analysis using this positions. |
80+
7581
## Pipeline output
7682

7783
To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/phaseimpute/results) tab on the nf-core website pipeline page.
@@ -80,16 +86,20 @@ For more details about the output files and reports, please refer to the
8086

8187
## Credits
8288

83-
nf-core/phaseimpute was originally written by LouisLeNezet.
89+
nf-core/phaseimpute was originally written by Louis Le Nézet.
8490

8591
We thank the following people for their extensive assistance in the development of this pipeline:
8692

87-
<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
93+
- Anabella Trigila
94+
- Saul Pierotti
95+
- Eugenia Fontecha
96+
- Matias Romero Victorica
8897

8998
## Contributions and Support
9099

91100
If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).
92101

102+
For further information or help, don't hesitate to get in touch on the [Slack `#phaseimpute` channel](https://nfcore.slack.com/channels/phaseimpute) (you can join with [this invite](https://nf-co.re/join/slack)).
93103
For further information or help, don't hesitate to get in touch on the [Slack `#phaseimpute` channel](https://nfcore.slack.com/channels/phaseimpute) (you can join with [this invite](https://nf-co.re/join/slack)).
94104

95105
## Citations
@@ -99,6 +109,14 @@ For further information or help, don't hesitate to get in touch on the [Slack `#
99109

100110
<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->
101111

112+
You can cite one of the main imputation methods ([`QUILT`](https://github.com/rwdavies/QUILT)) as follows:
113+
114+
> **Rapid genotype imputation from sequence with reference panels.**
115+
>
116+
> Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., Chan, Y. F., & Myers, S.
117+
>
118+
> _Nature genetics_ 2021 June 03. doi: [10.1038/s41588-021-00877-0](https://doi.org/10.1038/s41588-021-00877-0)
119+
102120
An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
103121

104122
You can cite the `nf-core` publication as follows:

assets/chr_rename_add.txt

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
1 chr1
2+
2 chr2
3+
3 chr3
4+
4 chr4
5+
5 chr5
6+
6 chr6
7+
7 chr7
8+
8 chr8
9+
9 chr9
10+
10 chr10
11+
11 chr11
12+
12 chr12
13+
13 chr13
14+
14 chr14
15+
15 chr15
16+
16 chr16
17+
17 chr17
18+
18 chr18
19+
19 chr19
20+
20 chr20
21+
21 chr21
22+
22 chr22
23+
23 chr23
24+
24 chr24
25+
25 chr25
26+
26 chr26
27+
27 chr27
28+
28 chr28
29+
29 chr29
30+
30 chr30
31+
31 chr31
32+
32 chr32
33+
33 chr33
34+
34 chr34
35+
35 chr35
36+
36 chr36
37+
37 chr37
38+
38 chr38
39+
X chrX

assets/chr_rename_del.txt

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
chr1 1
2+
chr2 2
3+
chr3 3
4+
chr4 4
5+
chr5 5
6+
chr6 6
7+
chr7 7
8+
chr8 8
9+
chr9 9
10+
chr10 10
11+
chr11 11
12+
chr12 12
13+
chr13 13
14+
chr14 14
15+
chr15 15
16+
chr16 16
17+
chr17 17
18+
chr18 18
19+
chr19 19
20+
chr20 20
21+
chr21 21
22+
chr22 22
23+
chr23 23
24+
chr24 24
25+
chr25 25
26+
chr26 26
27+
chr27 27
28+
chr28 28
29+
chr29 29
30+
chr30 30
31+
chr31 31
32+
chr32 32
33+
chr33 33
34+
chr34 34
35+
chr35 35
36+
chr36 36
37+
chr37 37
38+
chr38 38
39+
chr39 X

assets/panel.csv

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
panel,chr,vcf,index
2+
1000GP,chr21,1000GP_21.vcf,1000GP_21.vcf.csi
3+
1000GP,chr22,1000GP_22.vcf,1000GP_22.vcf.csi

assets/regionsheet.csv

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
chr,start,end
2+
20,20000000,2200000

assets/samplesheet.csv

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
1-
sample,fastq_1,fastq_2
2-
SAMPLE_PAIRED_END,/path/to/fastq/files/AEG588A1_S1_L002_R1_001.fastq.gz,/path/to/fastq/files/AEG588A1_S1_L002_R2_001.fastq.gz
3-
SAMPLE_SINGLE_END,/path/to/fastq/files/AEG588A4_S4_L003_R1_001.fastq.gz,
1+
sample,bam,bai
2+
1_BAM_1X,/path/to/.bam,/path/to/.bai
3+
1_BAM_SNP,/path/to/.bam,/path/to/.bai

assets/schema_input.json

Lines changed: 8 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"$schema": "http://json-schema.org/draft-07/schema",
33
"$id": "https://raw.githubusercontent.com/nf-core/phaseimpute/master/assets/schema_input.json",
4-
"title": "nf-core/phaseimpute pipeline - params.input schema",
4+
"title": "nf-core/phaseimpute pipeline - params.input",
55
"description": "Schema for the file provided with params.input",
66
"type": "array",
77
"items": {
@@ -13,21 +13,17 @@
1313
"errorMessage": "Sample name must be provided and cannot contain spaces",
1414
"meta": ["id"]
1515
},
16-
"fastq_1": {
16+
"file": {
1717
"type": "string",
18-
"format": "file-path",
19-
"exists": true,
20-
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
21-
"errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
18+
"pattern": "^\\S+\\.(bam)|((vcf|bcf)(\\.gz))?$",
19+
"errorMessage": "BAM, VCF or BCF file must be provided, cannot contain spaces and must have extension '.bam' or '.vcf', '.bcf' with optional '.gz' extension"
2220
},
23-
"fastq_2": {
21+
"index": {
22+
"errorMessage": "Input file index must be provided, cannot contain spaces and must have extension '.bai', '.tbi' or '.csi'",
2423
"type": "string",
25-
"format": "file-path",
26-
"exists": true,
27-
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
28-
"errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
24+
"pattern": "^\\S+\\.(bai|tbi|csi)$"
2925
}
3026
},
31-
"required": ["sample", "fastq_1"]
27+
"required": ["sample", "file", "index"]
3228
}
3329
}

0 commit comments

Comments
 (0)