Azimuth is a web application that uses an annotated reference dataset to automate the processing, analysis, and interpretation of a new single-cell RNA-seq experiment. Azimuth leverages a 'reference-based mapping' pipeline that inputs a counts matrix of gene expression in single cells, and performs normalization, visualization, cell annotation, and differential expression (biomarker discovery). All results can be explored within the app, and easily downloaded for additional downstream analysis.
The development of Azimuth is led by the New York Genome Center Mapping Component as part of the NIH Human Biomolecular Atlas Project (HuBMAP). Eleven molecular reference maps are currently available, with more coming soon.
- Upload a single-cell gene expression matrix, or click the
Load demo datasetbutton.
- If desired, filter cells based on common QC metrics in the Preprocessing tab.
- Click the
Map cells to referencebutton to launch analysis. A query dataset of 10,000 cells will typically finish processing in less than 1 minute.
- View results.
- “Cell Plots” tab: Visualize query cells and annotations projected onto the reference UMAP.
- “Feature Plots” tab: Explore the expression of individual features (genes) in your data, and automatically identify differentially expressed genes and biomarkers.
- If desired, download files for further analysis from the “Download Results” tab.
Run Azimuth Locally
You can also bypass the web application and run Azimuth on your local computer, directly in R. The following vignette demonstrates how to download a reference and map new data (either in Seurat, h5, or h5ad format), in only a few commands. Check it out here.
Frequently Asked Questions
The app didn’t work!
To respect user privacy, we only collect basic usage statistics and do not store logs from user sessions of the app. We aim to clearly document the requirements for user-uploaded data, provide a detailed FAQ here, and display descriptive error messages in the app whenever possible.
If the app returns an error message, you can also perform the same identical analysis using Seurat v4 following mapping vignette and find support for any problems that arise during your use of the Seurat package here.
If you are receiving an error message or are unable generate output with the app, please read the FAQ below to ensure that your dataset meets our requirements. If this doesn’t resolve your issue, and you’d like to help us improve the app, please file a Github issue describing the issue here.
Can I run the app myself?
The app won’t load!
Can I open multiple apps simultaneously?
What is HuBMAP and how can I learn more?
How do I cite Azimuth if I use it in my own work?
How can I explore gene expression patterns in the reference?
What file types can I upload?
We accept the following file types as input:
- Seurat objects as RDS
- 10x Genomics H5
- Matrix/matrix/data.frame as RDS
If uploading a Seurat object, it must contain an assay named ‘RNA’ with raw data in the ‘counts’ slot. Note that Azimuth uses only the (unnormalized) counts matrix.
Can you provide me with a sample input?
How big can my uploaded dataset be?
Uploads must be smaller than 1GB and contain less than 100,000 cells. If your dataset is larger than 100,000 cells, you can divide it into smaller chunks for mapping, or you can map your dataset locally using Seurat v4.
If you would like to upload an existing Seurat object, you can use
DietSeurat to pare down the Seurat object before uploading it. This will preserve the RNA counts data and cell metadata, but discard everything else.
DefaultAssay(object) <- “RNA” object <- DietSeurat(object = object, assays = “RNA”)
What datasets can I map?
Should I filter the genes in my dataset before uploading?
Should I map my batches separately or combined?
What optimizations are in the app that are not default in Seurat?
To optimize the web app time and resource consumption, we made several changes to the base Seurat mapping workflow.
- When fitting generalized linear models, we use a representative set of 2000 genes and 2000 cells
- To further speed up GLM model fitting, we use the recently developed glmGamPoi package from Constantin Ahlmann-Eltze and Wolfgang Huber.
- For many references, we leverage a downsampled reference. Downsampling is done to ensure good representation of all datasets and celltypes present in the full reference.
- We leverage a previously computed and cached neighbor index and neighbor list for the reference. This speeds up the neighbor-finding steps in the mapping algorithm.
- For the approximate nearest neighbor finding steps in the algorithm, we use
n.trees = 20, which provides speedup compared to default
n.trees = 50with minimal impact on the quality of downstream results.
- We leverage the presto package from Ilya Korsunsky and Soumya Rayachauduri, for differential expression
Can I preprocess my data myself?
I can’t map cells after filtering in the “Preprocessing” tab
Can I map at different levels of resolution?
Reference metadata to transferbox of the Preprocessing tab. For example, with the human PBMC reference, you can annotate cells based on broad celltype definition (celltype.l1), but also at two additional levels of granularity (celltype.l2 and celltype.l3)
How long will mapping take?
Can I run the mapping algorithm myself?
Can I map to a different reference?
What do the columns in the biomarkers table mean?
The top 10 biomarkers for predicted cell type clusters with at least 15 query cells are calculated using differential expression analysis, using the presto package. The columns of the table are:
- avgExpr: mean value of feature for cells in cluster
- auc: area under ROC
- padj: Benjamini-Hochberg adjusted p value
- pct_in: percent of cells in the cluster with nonzero feature value
- pct_out: percent of cells out of the cluster with nonzero feature value
What if my query dataset contains cell types that aren't present in the reference?
Can I visualize my own metadata?
Why aren’t all the predicted cell types available in the biomarkers table?
Where do the imputed protein values come from?
Can I save my results?
How can I tell if my mapping results are accurate?
Azimuth computes a series of metrics that relate to QC for the mapping procedure. We’ve found that a single metric is insufficient to describe the quality of mapping, and therefore compute each of the metrics below. We emphasize that users should not limit their evaluation of mapping to these QC metrics, and can should explore their results. In particular, we encourage users to explore whether the differentially expressed genes associated with each annotated biological population are consistent with their biological expectations. We intend to support these metrics and QC analyses with additional visualizations in future versions.
- % of query cells with anchors: The Azimuth reference-mapping procedure first identifies a set of ‘anchors’, or pairwise correspondences between cells predicted to be in a similar biological state, between query and reference datasets. Here we report the percentage of query cells participating in an anchor correspondence. Typically, we observe values >15% when mapping is successful. Processing mismatched biological datasets (i.e. mapping a query dataset of human brain cells onto a reference dataset of human blood cells) will return few anchors (<5%). However, in some cases, when there is a large batch effect between query and reference datasets from the same tissue, this metric can fall below 15% even when mapping is successful.
- Cluster preservation score: For each query dataset, we downsample to at most 5,000 cells, and perform an unsupervised clustering. This score reflects the preservation of the unsupervised cluster structure, and is based on the entropy of unsupervised cluster labels in each query cell’s local neighborhood after mapping. Scores are scaled from 0 (poor) to 5 (best). This metric relies on the unsupervised clustering representing corresponding to biologically distinct cell states. If the query dataset consists of a homogeneous group of cells, or if the query dataset contains cells from multiple batches (which would be corrected by Azimuth), this metric may return a low value even in cases where mapping is successful. The score is calculated using the
ClusterPreservationScorefunction in Azimuth.
- Prediction scores: Cell prediction scores range from 0 to 1 and reflect the confidence associated with each annotation. Cells with high-confidence annotations (for example, prediction scores > 0.75) reflect predictions that are supported by mulitple consistent anchors. Prediction scores can be visualized on the Feature Plots tab, or downloaded on the Download Results tab. The prediction depends on the specific annotation for each cell. Therefore, if you are mapping cells at multiple levels of resolution (for example level 1/2/3 annotations in the Human PBMC reference), each level will be associated with a different prediction score.
- Mapping scores: This value from 0 to 1 reflects confidence that this cell is well represented by the reference. The “mapping.score” column is available to plot in the Feature Plots tab, and is provided in the download TSV file. The mapping score is independent of a specific annotation, is calculated using the
MappingScorefunction in Seurat, and reflects how well the unique structure of a cell’s local neighborhood is preserved during reference mapping.