Skip to content

Nextflow pipeline to construct a chromatin accessibility peak index

License

Notifications You must be signed in to change notification settings

wishabc/nf-index

 
 

Repository files navigation

nf-index

Nextflow pipelines to build an index of accessible elements and do follow-up analysis

Requirements

Quick start

  • Follow first steps according to General usage section.
  • Create files required by pipeline (e.g. samples_meta, samples_order e.t.c.)
  • Fill in params required by pipeline in params.config.
  • Run pipeline using the corresponding command as described in Usage section. For example, to run NMF use nextflow run nmf.nf -profile Altius -resume.

Description of pipelines:

  • build_masterlist.nf - Build an index of accessible elements using the approach described in Meuleman et al.
  • generate_matrices.nf - Using constructed index as a scaffold to generate count (# of reads overlapping DHS) and binary (absence/presence of a peak) matricies.
  • filter_peaks.nf - Filter peaks and convert data to np binary format for follow-up analysis. We filter:
    1. Peaks overlapping ENCODE blacklisted regions
    2. Low-signal singletons (low signal peaks identified in only one sample)
    3. Peaks located on non-autosomal chromosomes (for NMF analysis and normalization)
  • normalize_signal.nf - Normalize filtered count matrix by running lowess normalization followed by DEseq2 variance-stabilizing transformation (VST). There is a workflow to apply normalization with existing parameters to new samples.
  • nmf.nf - Run non-negative matrix factorization (NMF) for set of matrices. More details below
  • variance_partition.nf - Run variance partition using normalized matrix

Main workflow

main.nf - run build_masterlist, generate_matrices, filter_peaks and normalize_signal pipelines + annotate resulting index with genomic annotations.

Usage

General usage

  1. (Optional) Create conda environment from environment.yml file with mamba env create -n super-index -f environment.yml. Activate the environment (conda activate super-index)
  2. Modify nextflow.config to computing enviroment specifications
  3. Fill in params paths in params.config. You can also specify parameters in command line. Please find detailed explanation of the parameters in the Config section.
  4. Run the pipeline with nextflow run <workflow.nf> -profile Altius -resume

NMF.nf

The pipeline consists of two parts:

  • Performing NMF
  • Running QC visualizations

To run both stages of the pipeline use:

nextflow run nmf.nf -profile Altius -resume

To run just the last, vizualization step (expected to run previous command first):

nextflow run nmf.nf -profile Altius -entry visualize --nmf_results_path <launchDir>/output/nmf>

The --nmf_results_path option can be omitted if you are running the pipeline in the same folder as nextflow run nmf.nf -profile Altius.

Note that output files are named according to provided prefix and n_components in nmf_params_list. No warning are made in case of name collisions.

TODO:

Add other workflows description here

Config

There are two config files in the repository.

Parameters for each process can be specified either in params.config file or with a command line. See below detailed description of parameters for each workflow

Params

Common params

  • samples_file: Samples metadata in tsv format. File should contain id (unique identifier of the sample) and sample_label columns. Other columns are permitted and ignored.

  • outdir - directory to save results into. Defaults to output folder in the launch directory

  • conda - (optional) path to installed conda (from environment.yml). If not present, nextflow creates environment from environment.yml (was not tested).

nmf.nf params

  • nmf_params_list: A tsv file with information required to run NMF. Should contain all required columns. NA values in optional columns are permitted. Other, non-specified columns are permitted and ignored. See columns description below:

    • (required) n_components - number of components for NMF.

    • (required) prefix: prefix for all input files. n_components will be added to prefix.

    • (required) matrix_path: path to matrix to run NMF on in .npy format. Expected shape: DHSs x samples. For fast conversion from txt format (using datatable package), you can use python3 bin/convert_to_numpy.py <matrix> <converted-matrix.npy> --dtype <matrix-dtype> script.

    • (required) sample_names: one-column txt file without header that contains names of the samples. They should match with values in id column of samples metadata (samples_file option). Should be a subset of samples defined in samples_file.
      File format:

      sample1
      sample2
      ...
      sampleX
    • (required) dhs_meta: metadata for DHSs (rows) in tsv format without header. First 4 columns are treated as chr, start, end, dhs_id, where dhs_id is a unique identifier of a DHS. Other columns are ignored.

    • (optional) samples_weights: sample weights in tsv format. NMF prioritizes reconstruction of samples with larger weights. Useful when you have class imbalance, e.g. abundance of samples of some specific cell type/condition.

      Expected to be a two column tsv file:

      id weight
      Sample1 0.9
      Sample2 0.3
      ... ...
      SampleN 1.0
    • (optional) peaks_weights: weights for the DHSs in tsv format. NMF prioritizes reconstruction of peaks with larger weights. Useful when you have different confidence in different DHSs (rows of the matrix). id corresponds to dhs_id (4th column in dhs_meta)

      Expected to be a two column tsv file:

      id weight
      chunk0001 0.9
      chunk0002 0.3
      ... ...
      chunk9999 1.0
  • dhs_annotations: (optional) (used only for visualizations) A tsv file with DHSs annotations. Should contain dhs_id and dist_tss columns. Other columns are permitted and ignored. If provided, plot cumulative distance to TSS for DHSs of each component.

TODO: add details about other workflows

About

Nextflow pipeline to construct a chromatin accessibility peak index

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 52.8%
  • Nextflow 20.7%
  • R 10.8%
  • Shell 10.7%
  • C++ 4.8%
  • Awk 0.2%