Skip to content

More interpretable error in get_esm_embedding #211

@MUCDK

Description

@MUCDK

Description of feature

Hi @guillaumehu,

Writing a tutorial where I need get_esm_embedding, I get the following error:

HTTPError                                 Traceback (most recent call last)
Cell In[55], line 1
----> 1 get_esm_embedding(adata, gene_key="gene_ensembl", gene_emb_key="esm_embeddings", null_value = "control", esm_model_name="esm2_t36_3B_UR50D")

File /ictstr01/home/icb/dominik.klein/git_repos/cell_flow_perturbation/src/cellflow/preprocessing/_gene_emb.py:359, in get_esm_embedding(adata, gene_key, null_value, gene_emb_key, copy, esm_model_name, toks_per_batch, trunc_len, truncation, use_cuda, cache_dir)
    357     genes_todo.extend(adata.obs[col].unique().tolist())
    358 unique_genes = list(set(genes_todo) - {null_value, None})
--> 359 results, metadata = protein_features_from_genes(
    360     genes=unique_genes,
    361     esm_model_name=esm_model_name,
    362     toks_per_batch=toks_per_batch,
    363     trunc_len=trunc_len,
    364     truncation=truncation,
    365     use_cuda=use_cuda,
    366     cache_dir=cache_dir,
    367 )
    368 adata.uns[gene_emb_key] = results
    369 adata.uns[gene_emb_key + "_metadata"] = metadata

File /ictstr01/home/icb/dominik.klein/git_repos/cell_flow_perturbation/src/cellflow/preprocessing/_gene_emb.py:274, in protein_features_from_genes(genes, esm_model_name, toks_per_batch, trunc_len, truncation, use_cuda, cache_dir)
    269 if os.getenv("HF_HOME") is None and cache_dir is None:
    270     logger.warning(
    271         "HF_HOME environment variable is not set and `cache_dir` is None. \
    272             Cache will be stored in the current directory."
    273     )
--> 274 metadata = prot_sequence_from_ensembl(genes)
    275 to_emb = metadata[metadata.protein_sequence.notnull()]
    276 use_cuda = use_cuda and torch.cuda.is_available()

File /ictstr01/home/icb/dominik.klein/git_repos/cell_flow_perturbation/src/cellflow/preprocessing/_gene_emb.py:119, in prot_sequence_from_ensembl(ensembl_gene_id)
    117 df = pd.DataFrame(columns=columns)
    118 for gene_id in ensembl_gene_id:
--> 119     gene_info = GeneInfo(gene_id)
    120     results[gene_id] = gene_info.protein_sequence
    121     data = [
    122         [
    123             gene_id,
   (...)
    129         ]
    130     ]

File <string>:4, in __init__(self, gene_id)

File /ictstr01/home/icb/dominik.klein/git_repos/cell_flow_perturbation/src/cellflow/preprocessing/_gene_emb.py:83, in GeneInfo.__post_init__(self)
     81 self.transcript_id: str | None = None
     82 self.display_name: str | None = None
---> 83 self.canonical_transcript_info = fetch_canonical_transcript_info(self.gene_id)
     84 if self.canonical_transcript_info:
     85     self.transcript_id = self.canonical_transcript_info["transcript_id"]

File /ictstr01/home/icb/dominik.klein/git_repos/cell_flow_perturbation/src/cellflow/preprocessing/_gene_emb.py:43, in fetch_canonical_transcript_info(ensembl_gene_id)
     41 response = requests.get(server + ext, headers=headers)
     42 if not response.ok:
---> 43     response.raise_for_status()
     45 gene_data = response.json()
     46 transcripts = gene_data.get("Transcript", [])

File ~/mambaforge/envs/cellflow/lib/python3.12/site-packages/requests/models.py:1024, in Response.raise_for_status(self)
   1019     http_error_msg = (
   1020         f"{self.status_code} Server Error: {reason} for url: {self.url}"
   1021     )
   1023 if http_error_msg:
-> 1024     raise HTTPError(http_error_msg, response=self)

HTTPError: 400 Client Error: Bad Request for url: https://rest.ensembl.org/lookup/id/nan?expand=1

Can we raise which gene the error happens for? This would help the users a lot.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions