Gene ID → Gene Symbol conversion in dynamo¶
This notebook is a practical companion for a very common preprocessing step in single-cell workflows:
converting feature identifiers (e.g. Ensembl gene IDs like ENSG00000141510) into human-readable gene symbols
(e.g. TP53) and storing the result in your AnnData object.
Why this matters:
Many public datasets (and some pipelines) ship with Ensembl IDs as
adata.var_names.Downstream steps like marker inspection, gene set scoring, plotting, and cross-dataset integration are usually easier when gene symbols are used consistently.
Some methods/tools also expect symbols, or at least benefit from a standardized identifier space.
In this tutorial we demonstrate two typical scenarios:
Human (Hematopoiesis): convert Ensembl IDs in a
dynamo.sample_datadataset.Zebrafish: convert Ensembl IDs (often with version suffixes like
.1) and show how to specify the expected ID “scope” / database release.
Tip: ID conversion is never perfectly lossless. Always check the mapping rate and how you want to handle duplicates after conversion (two Ensembl IDs mapping to the same symbol).
1. Setup¶
We start by importing dynamo and setting plotting defaults.
This follows the style of the official dynamo tutorials: keep the notebook reproducible, and make figures look
consistent across machines (especially important when sharing results).
If you run into font warnings on a server (common on Linux/HPC), you can ignore them; it won’t affect the analysis.
2. Load an example dataset (human)¶
Here we use dynamo.sample_data.hematopoiesis_raw() as a compact, real-world example.
It contains raw count matrices and metadata needed for typical preprocessing.
In many datasets like this, genes can be indexed by Ensembl IDs instead of symbols.
We’ll inspect adata.var_names first to understand what identifier system we are starting from.
%load_ext autoreload
%autoreload 2
import dynamo as dyn
3. convert2gene_symbol¶
dynamo.pp.convert2gene_symbol performs batch ID mapping and returns a table containing at least:
query: the ID used for the query (often the Ensembl ID without version suffix)symbol: the mapped gene symbol
Under the hood, the conversion relies on an identifier query service (the same concept as MyGene.info-style
“batch query” APIs), where the key idea is you must tell the service what kind of IDs you are providing via the
scopes argument (e.g. ensembl.gene or ensembl.transcript).
4. Convert test¶
A typical safe pattern is:
Store the “query-ready” ID in
adata.var['query'](often stripping version suffix).Call
convert2gene_symbol(...).Merge the returned table back into
adata.var.Subset to successfully mapped genes (optional but common).
Set
adata.var_names = adata.var['symbol'].
After you set adata.var_names to symbols, consider also keeping the original IDs in a separate column (e.g.
adata.var['ensembl_id']) for traceability.
dyn.preprocessing.convert2gene_symbol(
['ENSG00000167286','ENSG00000156738'],#ensembl_release=109,
)
|-----> Auto-detected species: human
|-----> Conversion finished. Found 2/2 symbols.
| symbol | _score | |
|---|---|---|
| query | ||
| ENSG00000167286 | CD3D | 1.0 |
| ENSG00000156738 | MS4A1 | 1.0 |
5. Zebrafish example (when IDs are ENSDARG...)¶
For non-human datasets, pay extra attention to species and annotation version.
Zebrafish Ensembl gene IDs typically start with ENSDARG. Many pipelines also append a version suffix (e.g. .1),
so we strip it before conversion.
Depending on your pipeline, you may also want to pass an organism-specific Ensembl release (or otherwise match the annotation build you used for quantification). If the mapping looks unexpectedly poor, the most common causes are:
using the wrong release / annotation build
providing transcript IDs while querying as gene IDs (or vice versa)
keeping the version suffix
The goal of this section is not to claim one “correct” release universally, but to show the pattern for making the mapping explicit and reproducible.
dyn.preprocessing.convert2gene_symbol(
['ENSDARG00000035558'],ensembl_release=77,
)
|-----> Auto-detected species: zebrafish
|-----> Conversion finished. Found 1/1 symbols.
| symbol | _score | |
|---|---|---|
| query | ||
| ENSDARG00000035558 | gps2 | 1.0 |
adata = dyn.sample_data.hematopoiesis_raw()
adata.var.head()
|-----> Downloading raw hematopoiesis adata
|-----> Downloading data to ./data/hematopoiesis_raw.h5ad
|-----> File ./data/hematopoiesis_raw.h5ad already exists.
| gene_name_mapping | |
|---|---|
| gene_id | |
| ENSG00000000003 | None |
| ENSG00000000005 | None |
| ENSG00000000419 | None |
| ENSG00000000457 | None |
| ENSG00000000460 | None |
6. (Optional) ID conversion automatically in preprocess¶
Once your gene identifiers are standardized, you can proceed with your preferred preprocessing recipe.
Here we show dyn.pp.recipe_monocle, which performs typical steps (filtering, normalization, feature selection, PCA)
and is commonly used in dynamo workflows.
Note: If you change
var_namesafter preprocessing, you may break assumptions in downstream cached results. In practice, it’s best to do ID conversion before running the main preprocessing pipeline.
preprocessor = dyn.pp.Preprocessor()
preprocessor.config_monocle_recipe(
adata,
n_top_genes=2000
)
preprocessor.preprocess_adata_monocle(
adata,
tkey="time",
experiment_type="one-shot",
)
adata.var.head()
|-----> Running monocle preprocessing pipeline...
|-----> convert ensemble name to official gene name
|-----? Your adata object uses non-official gene names as gene index.
Dynamo is converting those names to official gene names.
|-----> Auto-detected species: human
|-----> Conversion finished. Found 24635/26193 symbols.
|-----------> filtered out 0 outlier cells
|-----------> filtered out 23299 outlier genes
|-----> PCA dimension reduction
|-----> <insert> X_pca to obsm in AnnData Object.
|-----> [Preprocessor-monocle] completed [1.6145s]
| gene_name_mapping | query | scopes | symbol | _score | nCells | nCounts | pass_basic_filter | log_m | log_cv | score | frac | use_for_pca | ntr | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TSPAN6 | None | ENSG00000000003 | ENSG00000000003 | TSPAN6 | 1.0 | 16 | 16.0 | False | NaN | NaN | NaN | 0.000004 | False | 0.238095 |
| TNMD | None | ENSG00000000005 | ENSG00000000005 | TNMD | 1.0 | 0 | 0.0 | False | NaN | NaN | NaN | 0.000000 | False | 0.000000 |
| DPM1 | None | ENSG00000000419 | ENSG00000000419 | DPM1 | 1.0 | 161 | 180.0 | True | -3.630229 | 2.004751 | 0.019326 | 0.000046 | True | 0.601476 |
| SCYL3 | None | ENSG00000000457 | ENSG00000000457 | SCYL3 | 1.0 | 76 | 79.0 | False | NaN | NaN | NaN | 0.000024 | False | 0.364486 |
| C1orf112 | None | ENSG00000000460 | ENSG00000000460 | C1orf112 | 1.0 | 73 | 81.0 | False | NaN | NaN | NaN | 0.000020 | False | 0.500000 |