Gene ID → Gene Symbol conversion in dynamo

This notebook is a practical companion for a very common preprocessing step in single-cell workflows: converting feature identifiers (e.g. Ensembl gene IDs like ENSG00000141510) into human-readable gene symbols (e.g. TP53) and storing the result in your AnnData object.

Why this matters:

  • Many public datasets (and some pipelines) ship with Ensembl IDs as adata.var_names.

  • Downstream steps like marker inspection, gene set scoring, plotting, and cross-dataset integration are usually easier when gene symbols are used consistently.

  • Some methods/tools also expect symbols, or at least benefit from a standardized identifier space.

In this tutorial we demonstrate two typical scenarios:

  1. Human (Hematopoiesis): convert Ensembl IDs in a dynamo.sample_data dataset.

  2. Zebrafish: convert Ensembl IDs (often with version suffixes like .1) and show how to specify the expected ID “scope” / database release.

Tip: ID conversion is never perfectly lossless. Always check the mapping rate and how you want to handle duplicates after conversion (two Ensembl IDs mapping to the same symbol).

1. Setup

We start by importing dynamo and setting plotting defaults.
This follows the style of the official dynamo tutorials: keep the notebook reproducible, and make figures look consistent across machines (especially important when sharing results).

If you run into font warnings on a server (common on Linux/HPC), you can ignore them; it won’t affect the analysis.

2. Load an example dataset (human)

Here we use dynamo.sample_data.hematopoiesis_raw() as a compact, real-world example.

  • It contains raw count matrices and metadata needed for typical preprocessing.

  • In many datasets like this, genes can be indexed by Ensembl IDs instead of symbols.

We’ll inspect adata.var_names first to understand what identifier system we are starting from.

%load_ext autoreload
%autoreload 2
import dynamo as dyn

3. convert2gene_symbol

dynamo.pp.convert2gene_symbol performs batch ID mapping and returns a table containing at least:

  • query: the ID used for the query (often the Ensembl ID without version suffix)

  • symbol: the mapped gene symbol

Under the hood, the conversion relies on an identifier query service (the same concept as MyGene.info-style “batch query” APIs), where the key idea is you must tell the service what kind of IDs you are providing via the scopes argument (e.g. ensembl.gene or ensembl.transcript).

4. Convert test

A typical safe pattern is:

  1. Store the “query-ready” ID in adata.var['query'] (often stripping version suffix).

  2. Call convert2gene_symbol(...).

  3. Merge the returned table back into adata.var.

  4. Subset to successfully mapped genes (optional but common).

  5. Set adata.var_names = adata.var['symbol'].

After you set adata.var_names to symbols, consider also keeping the original IDs in a separate column (e.g. adata.var['ensembl_id']) for traceability.

dyn.preprocessing.convert2gene_symbol(
    ['ENSG00000167286','ENSG00000156738'],#ensembl_release=109,
)
|-----> Auto-detected species: human
|-----> Conversion finished. Found 2/2 symbols.
symbol _score
query
ENSG00000167286 CD3D 1.0
ENSG00000156738 MS4A1 1.0

5. Zebrafish example (when IDs are ENSDARG...)

For non-human datasets, pay extra attention to species and annotation version.

Zebrafish Ensembl gene IDs typically start with ENSDARG. Many pipelines also append a version suffix (e.g. .1), so we strip it before conversion.

Depending on your pipeline, you may also want to pass an organism-specific Ensembl release (or otherwise match the annotation build you used for quantification). If the mapping looks unexpectedly poor, the most common causes are:

  • using the wrong release / annotation build

  • providing transcript IDs while querying as gene IDs (or vice versa)

  • keeping the version suffix

The goal of this section is not to claim one “correct” release universally, but to show the pattern for making the mapping explicit and reproducible.

dyn.preprocessing.convert2gene_symbol(
    ['ENSDARG00000035558'],ensembl_release=77,
)
|-----> Auto-detected species: zebrafish
|-----> Conversion finished. Found 1/1 symbols.
symbol _score
query
ENSDARG00000035558 gps2 1.0
adata = dyn.sample_data.hematopoiesis_raw()
adata.var.head()
|-----> Downloading raw hematopoiesis adata
|-----> Downloading data to ./data/hematopoiesis_raw.h5ad
|-----> File ./data/hematopoiesis_raw.h5ad already exists.
gene_name_mapping
gene_id
ENSG00000000003 None
ENSG00000000005 None
ENSG00000000419 None
ENSG00000000457 None
ENSG00000000460 None

6. (Optional) ID conversion automatically in preprocess

Once your gene identifiers are standardized, you can proceed with your preferred preprocessing recipe.

Here we show dyn.pp.recipe_monocle, which performs typical steps (filtering, normalization, feature selection, PCA) and is commonly used in dynamo workflows.

Note: If you change var_names after preprocessing, you may break assumptions in downstream cached results. In practice, it’s best to do ID conversion before running the main preprocessing pipeline.

preprocessor = dyn.pp.Preprocessor()
preprocessor.config_monocle_recipe(
    adata,
    n_top_genes=2000
)
preprocessor.preprocess_adata_monocle(
   	adata,
    tkey="time",
    experiment_type="one-shot",
)
adata.var.head()
|-----> Running monocle preprocessing pipeline...
|-----> convert ensemble name to official gene name
|-----? Your adata object uses non-official gene names as gene index. 
Dynamo is converting those names to official gene names.
|-----> Auto-detected species: human
|-----> Conversion finished. Found 24635/26193 symbols.
|-----------> filtered out 0 outlier cells
|-----------> filtered out 23299 outlier genes
|-----> PCA dimension reduction
|-----> <insert> X_pca to obsm in AnnData Object.
|-----> [Preprocessor-monocle] completed [1.6145s]
gene_name_mapping query scopes symbol _score nCells nCounts pass_basic_filter log_m log_cv score frac use_for_pca ntr
TSPAN6 None ENSG00000000003 ENSG00000000003 TSPAN6 1.0 16 16.0 False NaN NaN NaN 0.000004 False 0.238095
TNMD None ENSG00000000005 ENSG00000000005 TNMD 1.0 0 0.0 False NaN NaN NaN 0.000000 False 0.000000
DPM1 None ENSG00000000419 ENSG00000000419 DPM1 1.0 161 180.0 True -3.630229 2.004751 0.019326 0.000046 True 0.601476
SCYL3 None ENSG00000000457 ENSG00000000457 SCYL3 1.0 76 79.0 False NaN NaN NaN 0.000024 False 0.364486
C1orf112 None ENSG00000000460 ENSG00000000460 C1orf112 1.0 73 81.0 False NaN NaN NaN 0.000020 False 0.500000