Gene ID → Gene Symbol conversion in `dynamo`¶

This notebook is a practical companion for a very common preprocessing step in single-cell workflows: converting feature identifiers (e.g. Ensembl gene IDs like ENSG00000141510) into human-readable gene symbols (e.g. TP53) and storing the result in your AnnData object.

Why this matters:

Many public datasets (and some pipelines) ship with Ensembl IDs as adata.var_names.
Downstream steps like marker inspection, gene set scoring, plotting, and cross-dataset integration are usually easier when gene symbols are used consistently.
Some methods/tools also expect symbols, or at least benefit from a standardized identifier space.

In this tutorial we demonstrate two typical scenarios:

Human (Hematopoiesis): convert Ensembl IDs in a dynamo.sample_data dataset.
Zebrafish: convert Ensembl IDs (often with version suffixes like .1) and show how to specify the expected ID “scope” / database release.

Tip: ID conversion is never perfectly lossless. Always check the mapping rate and how you want to handle duplicates after conversion (two Ensembl IDs mapping to the same symbol).

1. Setup¶

We start by importing dynamo and setting plotting defaults.
This follows the style of the official dynamo tutorials: keep the notebook reproducible, and make figures look consistent across machines (especially important when sharing results).

If you run into font warnings on a server (common on Linux/HPC), you can ignore them; it won’t affect the analysis.

2. Load an example dataset (human)¶

Here we use dynamo.sample_data.hematopoiesis_raw() as a compact, real-world example.

It contains raw count matrices and metadata needed for typical preprocessing.
In many datasets like this, genes can be indexed by Ensembl IDs instead of symbols.

We’ll inspect adata.var_names first to understand what identifier system we are starting from.

%load_ext autoreload
%autoreload 2

import dynamo as dyn

3. `convert2gene_symbol`¶

dynamo.pp.convert2gene_symbol performs batch ID mapping and returns a table containing at least:

query: the ID used for the query (often the Ensembl ID without version suffix)
symbol: the mapped gene symbol

Under the hood, the conversion relies on an identifier query service (the same concept as MyGene.info-style “batch query” APIs), where the key idea is you must tell the service what kind of IDs you are providing via the scopes argument (e.g. ensembl.gene or ensembl.transcript).

4. Convert test¶

A typical safe pattern is:

Store the “query-ready” ID in adata.var['query'] (often stripping version suffix).
Call convert2gene_symbol(...).
Merge the returned table back into adata.var.
Subset to successfully mapped genes (optional but common).
Set adata.var_names = adata.var['symbol'].

After you set adata.var_names to symbols, consider also keeping the original IDs in a separate column (e.g. adata.var['ensembl_id']) for traceability.

dyn.preprocessing.convert2gene_symbol(
    ['ENSG00000167286','ENSG00000156738'],#ensembl_release=109,
)

|-----> Auto-detected species: human
|-----> Conversion finished. Found 2/2 symbols.

	symbol	_score
query
ENSG00000167286	CD3D	1.0
ENSG00000156738	MS4A1	1.0

5. Zebrafish example (when IDs are `ENSDARG...`)¶

For non-human datasets, pay extra attention to species and annotation version.

Zebrafish Ensembl gene IDs typically start with ENSDARG. Many pipelines also append a version suffix (e.g. .1), so we strip it before conversion.

Depending on your pipeline, you may also want to pass an organism-specific Ensembl release (or otherwise match the annotation build you used for quantification). If the mapping looks unexpectedly poor, the most common causes are:

using the wrong release / annotation build
providing transcript IDs while querying as gene IDs (or vice versa)
keeping the version suffix

The goal of this section is not to claim one “correct” release universally, but to show the pattern for making the mapping explicit and reproducible.

dyn.preprocessing.convert2gene_symbol(
    ['ENSDARG00000035558'],ensembl_release=77,
)

|-----> Auto-detected species: zebrafish
|-----> Conversion finished. Found 1/1 symbols.

	symbol	_score
query
ENSDARG00000035558	gps2	1.0

adata = dyn.sample_data.hematopoiesis_raw()
adata.var.head()

|-----> Downloading raw hematopoiesis adata
|-----> Downloading data to ./data/hematopoiesis_raw.h5ad
|-----> File ./data/hematopoiesis_raw.h5ad already exists.

	gene_name_mapping
gene_id
ENSG00000000003	None
ENSG00000000005	None
ENSG00000000419	None
ENSG00000000457	None
ENSG00000000460	None

6. (Optional) ID conversion automatically in preprocess¶

Once your gene identifiers are standardized, you can proceed with your preferred preprocessing recipe.

Here we show dyn.pp.recipe_monocle, which performs typical steps (filtering, normalization, feature selection, PCA) and is commonly used in dynamo workflows.

Note: If you change var_names after preprocessing, you may break assumptions in downstream cached results. In practice, it’s best to do ID conversion before running the main preprocessing pipeline.

preprocessor = dyn.pp.Preprocessor()
preprocessor.config_monocle_recipe(
    adata,
    n_top_genes=2000
)
preprocessor.preprocess_adata_monocle(
   	adata,
    tkey="time",
    experiment_type="one-shot",
)
adata.var.head()

|-----> Running monocle preprocessing pipeline...
|-----> convert ensemble name to official gene name
|-----? Your adata object uses non-official gene names as gene index. 
Dynamo is converting those names to official gene names.
|-----> Auto-detected species: human
|-----> Conversion finished. Found 24635/26193 symbols.
|-----------> filtered out 0 outlier cells
|-----------> filtered out 23299 outlier genes
|-----> PCA dimension reduction
|-----> <insert> X_pca to obsm in AnnData Object.
|-----> [Preprocessor-monocle] completed [1.6145s]

	gene_name_mapping	query	scopes	symbol	_score	nCells	nCounts	pass_basic_filter	log_m	log_cv	score	frac	use_for_pca	ntr
TSPAN6	None	ENSG00000000003	ENSG00000000003	TSPAN6	1.0	16	16.0	False	NaN	NaN	NaN	0.000004	False	0.238095
TNMD	None	ENSG00000000005	ENSG00000000005	TNMD	1.0	0	0.0	False	NaN	NaN	NaN	0.000000	False	0.000000
DPM1	None	ENSG00000000419	ENSG00000000419	DPM1	1.0	161	180.0	True	-3.630229	2.004751	0.019326	0.000046	True	0.601476
SCYL3	None	ENSG00000000457	ENSG00000000457	SCYL3	1.0	76	79.0	False	NaN	NaN	NaN	0.000024	False	0.364486
C1orf112	None	ENSG00000000460	ENSG00000000460	C1orf112	1.0	73	81.0	False	NaN	NaN	NaN	0.000020	False	0.500000

Gene ID → Gene Symbol conversion in dynamo¶