This is a repository with tutorials and reproducibility notebooks for the worm glia scRNA-seq atlas available at https://wormglia.org.
To run the tutorial, clone this repository and follow the environment set up below.
git clone https://github.com/settylab/worm-glia-atlas.git
envName=worm-glia-atlas
conda env create -n "$envName" --file envs/environment.yaml
conda activate "$envName"
envName=<your-environment-name>
conda create -n "$envName" python=3.8.10 pip=21.1.3
conda activate "$envName"
pip install -r envs/requirements.txt
The following packages need to be installed for running the notebooks, which can also be installed by following the environment setup above:
scanpy
: https://scanpy.readthedocs.io/en/stable/installation.htmlsklearn
: https://scikit-learn.org/stable/install.htmlplotly
: https://plotly.com/python/getting-started/tqdm
: https://github.com/tqdm/tqdm#installation
The tutorial notebook for pairwise differential expression analysis is available here. Pairwise differential analysis performs pairwise differential analysis to identify cluster enriched genes rather than one-vs-all approach.
The input is an anndata
object with normalized, log-transformed data and an obs
variable containing information about clusters for pairwise comparison.
anndata.obs[LEIDEN_NAME]
: WhereLEIDEN_NAME
is the name of the column inanndata.obs
field containing the groups to be used for the anlaysis (leiden clusters in this analysis).
The anndata
object is updated with the following information
anndata.varm['pairwise_cluster_count']
: Gene X Cluster matrix indicating how many comparisons the gene is differential in.anndata.varm['cluster_means']
: Gene X Cluster matrix of mean expression of gene per cluster.- HTML files of the pairwise analyses results can saved using the
plot_pairwise_results()
function by specifying a path to thesave
parameter.
The tutorial notebooks for Sheath/Socket marker analysis is here & for Pan-Glia marker analysis is available here. A logistic regression
model is trained and employed for binary classification
of cells using gene expression. Sheath/Socket notebook describes identification of markers for both classes where Pan-glia notebook describes identification of markers only for the positive class.
Subsequently, a ranking of the learned features within the model is then performed with the objective being to rank features that are highly informative and correlated with the specified target classes or cell type.
These analyses can be readily extended to other datasets by providing the appropriate inputs, as outlined below.
The key inputs for this analyses is the anndata
object with normalized & log-transformed counts, imputed gene expression values as well as the following anndata attribute fields below:
anndata.obs[CLASS_LABELS]
: WhereCLASS_LABELS
is the name of the column inanndata.obs
containing the ground truth labels for each cells in the anndata object.anndata.obs[CLUSTER_LABELS]
: WhereCLUSTER_LABELS
is the name of the column inanndata.obs
containing the cluster labels for each cells in the anndata object.anndata.var[USE_GENES]
: WhereUSE_GENES
is the name of the column inanndata.var
containing boolean values that specifies whether a gene is to be used for analysis or ignored (default ishighly_variable
genes columns).anndata.layers[USE_LAYER]
: WhereUSE_LAYER
is a key inanndata.layers
dictionary corresponding to the imputed Cell X Gene count matrix. If not specified, will use the normalized and log-transformed counts as values for the constructed feature matrix & feature ranking analysis.
The output of the analysis is a trained logistic regression model and an updated anndata object as follows:
-
anndata.uns['<target_class>_marker_results']
: A dictionary object is stored inanndata.uns
, containing results of the feature ranking analysis specific to a designated target class. -
anndata.uns['<target_class>_probEst_Summary']
: A DataFrame object is stored inanndata.uns
, containing the mean probability estimates for each cluster belonging to the specified target class. -
anndata.uns['<target_class>_AUROCC_Summary']
: A DataFrame object is stored inanndata.uns
, containing summary information about the AUROCC (Area Under the Receiver Operating Characteristic Curve) scores for each cluster belonging to the specified target class. -
anndata.obs['data_splits']
: A new column inanndata.obs
containing labels indicating whether each cell belongs to the training, validation, or test dataset after the feature matrix and target vector are split accordingly. -
anndata.uns['model_selection_metrics']
: A DataFrame object is stored inanndata.uns
, containing mean accuracy scores of trained regularized models on the training, validation, and test datasets.
Worm glia atlas manuscript is available on bioRxiv. Please cite our paper if you use these tutorials for your analyses:
@article {Purice2023.03.21.533668,
author = {Maria D. Purice and Elgene J.A. Quitevis and R. Sean Manning and Liza J. Severs and Nina-Tuyen Tran and Violet Sorrentino and Manu Setty and Aakanksha Singhvi},
title = {Molecular heterogeneity of C. elegans glia across sexes},
elocation-id = {2023.03.21.533668},
year = {2023},
doi = {10.1101/2023.03.21.533668},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2023/03/24/2023.03.21.533668},
eprint = {https://www.biorxiv.org/content/early/2023/03/24/2023.03.21.533668.full.pdf},
journal = {bioRxiv}
}