You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[#19](https://github.com/nf-core/phaseimpute/pull/19) - Changed reference panel to accept a csv, update modules and subworkflows (glimpse1/2 and shapeit5)
20
+
-[#20](https://github.com/nf-core/phaseimpute/pull/20) - Added automatic detection of vcf contigs for the reference panel and automatic renaming available
21
+
-[#22](https://github.com/nf-core/phaseimpute/pull/20) - Add validation step for concordance analysis. Input channels changed to match inputs steps. Outdir folder organised by steps. Modules config by subworkflows.
> Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].
15
+
> Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., ... & Myers, S. (2021). Rapid genotype imputation from sequence with reference panels. Nature genetics, 53(7), 1104-1111.
> Rubinacci, S., Ribeiro, D. M., Hofmeister, R. J., & Delaneau, O. (2021). Efficient phasing and imputation of low-coverage sequencing data using large reference panels. Nature Genetics, 53(1), 120-126.
20
+
21
+
-[Shapeit](https://odelaneau.github.io/shapeit5/)
22
+
23
+
> Hofmeister RJ, Ribeiro DM, Rubinacci S., Delaneau O. (2023). Accurate rare variant phasing of whole-genome and whole-exome sequencing data in the UK Biobank. Nature Genetics doi: https://doi.org/10.1038/s41588-023-01415-w
> Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987-2993.
Copy file name to clipboardExpand all lines: README.md
+41-23Lines changed: 41 additions & 23 deletions
Original file line number
Diff line number
Diff line change
@@ -19,50 +19,43 @@
19
19
20
20
## Introduction
21
21
22
-
**nf-core/phaseimpute** is a bioinformatics pipeline that ...
22
+
**nf-core/phaseimpute** is a bioinformatics pipeline to phase and impute genetic data. Different steps are available each corresponding to a dedicated modes.
23
23
24
-
<!-- TODO nf-core:
25
-
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
26
-
major pipeline sections and the types of output it produces. You're giving an overview to someone new
27
-
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
28
-
-->
24
+
### Main steps of the pipeline
29
25
30
-
<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
31
-
workflows use the "tube map" design for that. See https://nf-co.re/docs/contributing/design_guidelines#examples for examples. -->
32
-
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->
26
+
The **phaseimpute** pipeline is constituted of 5 main steps:
| <imgsrc="docs/images/metro/MetroMap.png"alt="metromap"width="800"/> | - **Pre-processing**: Phasing, QC, variant filtering, variant annotation of the reference panel <br> - **Phase**: Phasing of the target dataset on the reference panel <br> - **Simulate**: Simulation of the target dataset from high quality target data <br> - **Concordance**: Concordance between the target dataset and a truth dataset <br> - **Post-processing**: Variant filtering based on their imputation quality |
36
31
37
32
## Usage
38
33
39
34
> [!NOTE]
40
35
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
41
36
42
-
<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
43
-
Explain what rows and columns represent. For instance (please edit as appropriate):
44
-
37
+
The basic usage of this pipeline is to impute a target dataset based on a phased panel.
45
38
First, prepare a samplesheet with your input data that looks as follows:
Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
55
-
56
-
-->
47
+
Each row represents a bam file with its index file.
57
48
58
49
Now, you can run the pipeline using:
59
50
60
-
<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
61
-
62
51
```bash
63
52
nextflow run nf-core/phaseimpute \
64
53
-profile <docker/singularity/.../institute> \
65
54
--input samplesheet.csv \
55
+
--genome "GRCh38" \
56
+
--panel <phased_reference_panel.vcf.gz> \
57
+
--steps "impute" \
58
+
--tools "glimpse1" \
66
59
--outdir <OUTDIR>
67
60
```
68
61
@@ -72,6 +65,19 @@ nextflow run nf-core/phaseimpute \
72
65
73
66
For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/phaseimpute/usage) and the [parameter documentation](https://nf-co.re/phaseimpute/parameters).
74
67
68
+
## Description of the different mode of the pipeline
69
+
70
+
Here is a short description of the different mode of the pipeline.
71
+
For more information please refer to the [documentation](https://nf-core.github.io/phaseimpute/usage/).
|**Preprocessing**| <imgsrc="docs/images/metro/PreProcessing.png"alt="phase_metro"width="600"/> | The preprocessing mode is responsible to the preparation of the multiple input file that will be used by the phasing process. <br> The main processes are : <br> - **Haplotypes phasing** of the reference panel using [**Shapeit5**](https://odelaneau.github.io/shapeit5/). <br> - **Filter** the reference panel to select only the necessary variants. <br> - **Chunking the reference panel** in a subset of region for all the chromosomes. <br> - **Extract** the positions where to perform the imputation. |
76
+
|**Phasing**| <imgsrc="docs/images/metro/Phase.png"alt="phase_metro"width="600"/> | The phasing mode is the core mode of this pipeline. <br> It is constituted of 3 main steps: <br> - **Phasing**: Phasing of the target dataset on the reference panel using either: <br>   - [**Glimpse1**](https://odelaneau.github.io/GLIMPSE/glimpse1/index.html) <br>   It's come with the necessety to compute the genotype likelihoods of the target dataset. <br>   This step is done using [BCFTOOLS_mpileup](https://samtools.github.io/bcftools/bcftools.html#mpileup) <br>   - [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/glimpse2/index.html) For this step the reference panel is transformed to binary chunks. <br>   - [**Stitch**](https://github.com/rwdavies/stitch) <br>   - [**Quilt**](https://github.com/rwdavies/QUILT) <br> - **Ligation**: all the different chunks are merged together. <br> - **Sampling** (optional) |
77
+
|**Simulate**| <imgsrc="docs/images/metro/Simulate.png"alt="simulate_metro"width="600"/> | The simulation mode is used to create artificial low informative genetic information from high density data. This allow to compare the imputed result to a _truth_ and therefore evaluate the quality of the imputation. <br> For the moment it is possible to simulate: <br> - Low-pass data by **downsample** BAM or CRAM using [SAMTOOLS_view -s]() at different depth <br> - Genotype data by **SNP selecting** the position used by a designated SNP chip. <br> The simulation mode will also compute the **Genotype likelihoods** of the high density data. |
78
+
|**Concordance**| <imgsrc="docs/images/metro/Concordance.png"alt="concordance_metro"width="600"/> | This mode compare two vcf together to compute a summary of the differences between them. <br> To do so it use either: <br> - [**Glimpse1**](https://odelaneau.github.io/GLIMPSE/glimpse1/index.html) concordance process. <br> - [**Glimpse2**](https://odelaneau.github.io/GLIMPSE/glimpse2/index.html) concordance process <br> - Or convert the two vcf fill to `.zarr` using [**Scikit allele**](https://scikit-allel.readthedocs.io/en/stable/) and [**anndata**](https://anndata.readthedocs.io/en/latest/) before comparing the SNPs. |
79
+
|**Postprocessing**| <imgsrc="docs/images/metro/PostProcessing.png"alt="postprocessing_metro"width="600"/> | This final process unable to loop the whole pipeline for increasing the performance of the imputation. To do so it filter out the best imputed position and rerun the analysis using this positions. |
80
+
75
81
## Pipeline output
76
82
77
83
To see the results of an example test run with a full size dataset refer to the [results](https://nf-co.re/phaseimpute/results) tab on the nf-core website pipeline page.
@@ -80,16 +86,20 @@ For more details about the output files and reports, please refer to the
80
86
81
87
## Credits
82
88
83
-
nf-core/phaseimpute was originally written by LouisLeNezet.
89
+
nf-core/phaseimpute was originally written by Louis Le Nézet.
84
90
85
91
We thank the following people for their extensive assistance in the development of this pipeline:
86
92
87
-
<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
93
+
- Anabella Trigila
94
+
- Saul Pierotti
95
+
- Eugenia Fontecha
96
+
- Matias Romero Victorica
88
97
89
98
## Contributions and Support
90
99
91
100
If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md).
92
101
102
+
For further information or help, don't hesitate to get in touch on the [Slack `#phaseimpute` channel](https://nfcore.slack.com/channels/phaseimpute) (you can join with [this invite](https://nf-co.re/join/slack)).
93
103
For further information or help, don't hesitate to get in touch on the [Slack `#phaseimpute` channel](https://nfcore.slack.com/channels/phaseimpute) (you can join with [this invite](https://nf-co.re/join/slack)).
94
104
95
105
## Citations
@@ -99,6 +109,14 @@ For further information or help, don't hesitate to get in touch on the [Slack `#
99
109
100
110
<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->
101
111
112
+
You can cite one of the main imputation methods ([`QUILT`](https://github.com/rwdavies/QUILT)) as follows:
113
+
114
+
> **Rapid genotype imputation from sequence with reference panels.**
115
+
>
116
+
> Davies, R. W., Kucka, M., Su, D., Shi, S., Flanagan, M., Cunniff, C. M., Chan, Y. F., & Myers, S.
117
+
>
118
+
> _Nature genetics_ 2021 June 03. doi: [10.1038/s41588-021-00877-0](https://doi.org/10.1038/s41588-021-00877-0)
119
+
102
120
An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
103
121
104
122
You can cite the `nf-core` publication as follows:
"description": "Schema for the file provided with params.input",
6
6
"type": "array",
7
7
"items": {
@@ -13,21 +13,17 @@
13
13
"errorMessage": "Sample name must be provided and cannot contain spaces",
14
14
"meta": ["id"]
15
15
},
16
-
"fastq_1": {
16
+
"file": {
17
17
"type": "string",
18
-
"format": "file-path",
19
-
"exists": true,
20
-
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
21
-
"errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
18
+
"pattern": "^\\S+\\.(bam)|((vcf|bcf)(\\.gz))?$",
19
+
"errorMessage": "BAM, VCF or BCF file must be provided, cannot contain spaces and must have extension '.bam' or '.vcf', '.bcf' with optional '.gz' extension"
22
20
},
23
-
"fastq_2": {
21
+
"index": {
22
+
"errorMessage": "Input file index must be provided, cannot contain spaces and must have extension '.bai', '.tbi' or '.csi'",
24
23
"type": "string",
25
-
"format": "file-path",
26
-
"exists": true,
27
-
"pattern": "^\\S+\\.f(ast)?q\\.gz$",
28
-
"errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'"
0 commit comments