Skip to content

Nextflow pipeline for downloading raw GWAS summary statistics to be then processed using `nf-munge-sumstats`

Notifications You must be signed in to change notification settings

comp-med/nf-download-sumstats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nf-download-sumstats

Introduction

This is a Nextflow Pipeline to streamline downloading of GWAS summary statistics to then be processed using the nf-munge-sumstats pipeline to format into a harmonized format (GWAS-VCF) and and lift them into both the GRCh37 and GRCh38 genome build.

The pipeline covers the following steps:

  1. Work with local files or download summary statistics from openGWAS, GWAS Catalog or an arbitrary link

Input Table

The pipeline requires a single input table in CSV format that is to be specified in the input parameter in nextflow.config.

Create the input table with the following columns.

library(data.table)
input_table <- data.table(
    phenotype_id = NA_character_,
    data_source = NA_character_,
    data_id = NA_character_,
    data_link = NA_character_,
    data_location = NA_character_
)

Make sure that the table contains exactly these columns and that the contents conform to the following:

  • All entries in phenotype_id must be unique, since this will be the name of the output directory for the formatted summary statistics
  • data_source must be one of gwas_catalog, open_gwas, other or local and based on this, contain a valid entry in one of the remaining column:
    • When open_gwas: Contain openGWAS study accession (e.g. ieu-b-5118) in data_id
    • When gwas_catalog: Contain a GWAS Catalog study accession (e.g. GCST90204201) in data_id with full summary statistics available from their FTP servers
    • When other: Contain a download link that can be directly accessed via wget in other
    • When local: Contain an absolute & valid directory path to the corresponding file in data_location

Currently, there are no check in place if the inputs are incorrect and the pipeline will simply fail if any single input is faulty!

# Make sure `phenotype_id` is unique!
stopifnot(
    "`phenotype_id` must be unique!" = !any(
        duplicated(input_table$phenotype_id)
    )
)

# Make sure `data_source` contains valid entries! 
stopifnot(
    "Unexpected entry in `data_source`!" = all(
        unique(input_table$data_source) %in% c(
            "gwas_catalog", 
            "open_gwas",
            "other",
            "local"
        )
    )
)

Software

Currently, the pipeline does not automatically set up all required software and R packages so make sure you have the following software dependencies set up and configured in nextflow.config.

Gettings Started

Create an input table with the source for each phenotype, save it and provide the path to the input table in the input_table parameter.

library(data.table)
input_table <- data.table(
    phenotype_id = c("atrial_fibrillation", "body_mass_index"),
    data_source = c("gwas_catalog", "body_mass_index"),
    data_id = c("GCST90204201", "ieu-b-5118"),
    data_link = NA_character_,
    data_location = NA_character_
)
fwrite(input_table, "</PATH/TO/>input_table.csv", sep = ",")

Make sure to add all the required parameters and path to the input table in nextflow.config. Afterwards, you can run the pipeline.

First, see if the the pipeline executes by starting a dry-run.

# When running the pipeline locally, e.g. on your laptop
nextflow run main.nf -stub-run 

# When on a HPC with SLURM, use the `cluster` profile
nextflow run main.nf -stub-run -profile cluster

If that finished successfully, run the actual pipeline.

# Again, when on an HPC with SLURM, add the `-profile cluster` flag
nextflow run main.nf -profile cluster

Output

By default, all output will be saved in the ./output directory in the pipeline's base directory. Raw summary statistic files can be found in the ./raw subdirectory. Files for each phenotype will be saved in subdirectories named after each line in phenotype_id of params.input_table.

output/
└── raw
    ├── phentype_1
    │   └── raw_sumstat_file.gz
    ├── phenotype_2
    │   └── raw_sumstat_file.vcf.gz
[...]

About

Nextflow pipeline for downloading raw GWAS summary statistics to be then processed using `nf-munge-sumstats`

Resources

Stars

Watchers

Forks

Packages

No packages published