Nextflow pipelines to build an index of accessible elements and do follow-up analysis
- Nextflow (https://www.nextflow.io/)
- conda (https://conda.io/projects/conda/en/latest/index.html)
- Follow first steps according to General usage section.
- Create files required by pipeline (e.g.
samples_meta
,samples_order
e.t.c.) - Fill in params required by pipeline in params.config.
- Run pipeline using the corresponding command as described in Usage section.
For example, to run NMF use
nextflow run nmf.nf -profile Altius -resume
.
- build_masterlist.nf - Build an index of accessible elements using the approach described in Meuleman et al.
- generate_matrices.nf - Using constructed index as a scaffold to generate count (# of reads overlapping DHS) and binary (absence/presence of a peak) matricies.
- filter_peaks.nf - Filter peaks and convert data to np binary format for follow-up analysis. We filter:
- Peaks overlapping ENCODE blacklisted regions
- Low-signal singletons (low signal peaks identified in only one sample)
- Peaks located on non-autosomal chromosomes (for NMF analysis and normalization)
- normalize_signal.nf - Normalize filtered count matrix by running lowess normalization followed by DEseq2 variance-stabilizing transformation (VST). There is a workflow to apply normalization with existing parameters to new samples.
- nmf.nf - Run non-negative matrix factorization (NMF) for set of matrices. More details below
- variance_partition.nf - Run variance partition using normalized matrix
main.nf - run build_masterlist, generate_matrices, filter_peaks and normalize_signal
pipelines + annotate resulting index with genomic annotations.
- (Optional) Create conda environment from
environment.yml
file withmamba env create -n super-index -f environment.yml
. Activate the environment (conda activate super-index) - Modify
nextflow.config
to computing enviroment specifications - Fill in params paths in
params.config
. You can also specify parameters in command line. Please find detailed explanation of the parameters in the Config section. - Run the pipeline with
nextflow run <workflow.nf> -profile Altius -resume
The pipeline consists of two parts:
- Performing NMF
- Running QC visualizations
To run both stages of the pipeline use:
nextflow run nmf.nf -profile Altius -resume
To run just the last, vizualization step (expected to run previous command first):
nextflow run nmf.nf -profile Altius -entry visualize --nmf_results_path <launchDir>/output/nmf>
The --nmf_results_path
option can be omitted if you are running the pipeline in the same folder as nextflow run nmf.nf -profile Altius
.
Note that output files are named according to provided prefix
and n_components
in nmf_params_list
. No warning are made in case of name collisions.
Add other workflows description here
There are two config files in the repository.
nextflow.config
- contains enviornment configuration. Detailed explanation can be found at https://www.nextflow.io/docs/latest/config.html.params.config
- specifies thresholds and paths to input files.
Parameters for each process can be specified either in params.config
file or with a command line. See below detailed description of parameters for each workflow
-
samples_file: Samples metadata in tsv format. File should contain
id
(unique identifier of the sample) andsample_label
columns. Other columns are permitted and ignored. -
outdir - directory to save results into. Defaults to
output
folder in the launch directory -
conda - (optional) path to installed conda (from environment.yml). If not present, nextflow creates environment from environment.yml (was not tested).
-
nmf_params_list: A tsv file with information required to run NMF. Should contain all required columns. NA values in optional columns are permitted. Other, non-specified columns are permitted and ignored. See columns description below:
-
(required)
n_components
- number of components for NMF. -
(required)
prefix
: prefix for all input files.n_components
will be added to prefix. -
(required)
matrix_path
: path to matrix to run NMF on in.npy
format. Expected shape:DHSs x samples
. For fast conversion from txt format (using datatable package), you can usepython3 bin/convert_to_numpy.py <matrix> <converted-matrix.npy> --dtype <matrix-dtype>
script. -
(required)
sample_names
: one-column txt file without header that contains names of the samples. They should match with values inid
column of samples metadata (samples_file
option). Should be a subset of samples defined insamples_file
.
File format:sample1 sample2 ... sampleX -
(required)
dhs_meta
: metadata for DHSs (rows) in tsv format without header. First 4 columns are treated aschr
,start
,end
,dhs_id
, wheredhs_id
is a unique identifier of a DHS. Other columns are ignored. -
(optional)
samples_weights
: sample weights in tsv format. NMF prioritizes reconstruction of samples with larger weights. Useful when you have class imbalance, e.g. abundance of samples of some specific cell type/condition.Expected to be a two column tsv file:
id weight Sample1 0.9 Sample2 0.3 ... ... SampleN 1.0 -
(optional)
peaks_weights
: weights for the DHSs in tsv format. NMF prioritizes reconstruction of peaks with larger weights. Useful when you have different confidence in different DHSs (rows of the matrix).id
corresponds to dhs_id (4th column indhs_meta
)Expected to be a two column tsv file:
id weight chunk0001 0.9 chunk0002 0.3 ... ... chunk9999 1.0
-
-
dhs_annotations: (optional) (used only for visualizations) A tsv file with DHSs annotations. Should contain
dhs_id
anddist_tss
columns. Other columns are permitted and ignored. If provided, plot cumulative distance to TSS for DHSs of each component.