Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anndata Object with counts? #23

Open
Demond-dev opened this issue Jan 12, 2025 · 4 comments
Open

Anndata Object with counts? #23

Demond-dev opened this issue Jan 12, 2025 · 4 comments
Labels
help wanted Extra attention is needed

Comments

@Demond-dev
Copy link

Demond-dev commented Jan 12, 2025

Hi,

I was wondering if there was anyway to incorporate the output from the ENACT pipeline into an anndata object with the binned counts and the whole slide image for use in spatial plots with other packages? I've tried running the pipeline on data our lab has generated and it only resulted in an anndata object with 1000 variables and no image to use for spatial plotting.

Thanks!

@LucFrancisENX
Copy link

LucFrancisENX commented Jan 28, 2025

Hi @Demond-dev , you can try this function I adapted from the bin2cell read_visium function. It tracks back the changes made to the pixel coordinates within enact so that the spots fit to the high resolution image

All you need is the object_path (Path to the enact adata.h5 output)
The spaceranger_path (Path to spaceranger 002_binned output directory for visium datafiles)
The source_image_path (Path to your btf file)

adata = read_visium_enact(enact_object_path=object_path,
                          spaceranger_path=spaceranger_path,
                          source_image_path=source_image_path)
def read_visium_enact(
    enact_object_path: Path | str,
    spaceranger_path: Path | str,
    genome: str | None = None,
    *,
    count_file: str = "filtered_feature_bc_matrix.h5",
    library_id: str | None = None,
    load_images: bool | None = True,
    source_image_path: Path | str | None = None,
    spaceranger_image_path: Path | str | None = None,
) -> AnnData:
    """\
    Read 10x-Genomics-formatted visum dataset with ENACT pipeline output.

    In addition to reading regular 10x output,
    this looks for the `spatial` folder and loads images,
    coordinates and scale factors.
    Based on the `Space Ranger output docs`_.

    See :func:`~scanpy.pl.spatial` for a compatible plotting function.

    .. _Space Ranger output docs: https://support.10xgenomics.com/spatial-gene-expression/software/pipelines/latest/output/overview
    
    The function b2c.read_visium function has been adapted with the following changes

    1. Read in enact output h5 file
    2. When reading in spaceranger output, find the pixel boundaries of the cropped image
    3. The pixel boundary minimum values are then added to each cell coordinate within the enact output

    #### This should result in enact output object complete with associated image data ready for squidpy downstream analysis

    Parameters
    ----------
    enact_object_path
        Path to enact output h5 output file. 
        Should be in a similar format to: ".../output_files/{study}/tmap/{segmentation-method}|cell_typist_cells_adata.h5" (if celltypist step was performed)
    spaceranger_path
        Path to spaceranger 002_binned output directory for visium datafiles.
    genome
        Filter expression to genes within this genome.
    count_file
        Which file in the passed directory to use as the count file. Typically would be one of:
        'filtered_feature_bc_matrix.h5' or 'raw_feature_bc_matrix.h5'.
    library_id
        Identifier for the visium library. Can be modified when concatenating multiple adata objects.
    source_image_path
        Path to the high-resolution tissue image. This would be your btf file
    spaceranger_image_path
        Path to the folder containing the spaceranger output hires/lowres tissue images. If `None`, 
        will go with the `spatial` folder of the provided `spaceranger_path`.

    Returns
    -------
    Annotated data matrix, where observations/cells are named by their
    barcode and variables/genes by gene name. Stores the following information:

    :attr:`~anndata.AnnData.X`
        The data matrix is stored
    :attr:`~anndata.AnnData.obs_names`
        Cell names
    :attr:`~anndata.AnnData.var_names`
        Gene names for a feature barcode matrix, probe names for a probe bc matrix
    # To be added:
    :attr:`~anndata.AnnData.var`\\ `['gene_ids']`
        Gene IDs
    :attr:`~anndata.AnnData.var`\\ `['feature_types']`
        Feature types
    #
    :attr:`~anndata.AnnData.var`
        Any additional metadata present in /matrix/features is read in.
    :attr:`~anndata.AnnData.uns`\\ `['spatial']`
        Dict of spaceranger output files with 'library_id' as key
    :attr:`~anndata.AnnData.uns`\\ `['spatial'][library_id]['images']`
        Dict of images (`'hires'` and `'lowres'`)
    :attr:`~anndata.AnnData.uns`\\ `['spatial'][library_id]['scalefactors']`
        Scale factors for the spots
    :attr:`~anndata.AnnData.uns`\\ `['spatial'][library_id]['metadata']`
        Files metadata: 'chemistry_description', 'software_version', 'source_image_path'
    :attr:`~anndata.AnnData.obsm`\\ `['spatial']`
        Spatial spot coordinates, usable as `basis` by :func:`~scanpy.pl.embedding`.
    """
 

    path = Path(spaceranger_path)
    #if not provided, assume the hires/lowres images are in the same folder as everything
    #except in the spatial subdirectory
    if spaceranger_image_path is None:
        spaceranger_image_path = path / "spatial"
    else:
        spaceranger_image_path = Path(spaceranger_image_path) 

    adata = read_h5ad(enact_object_path)   

    adata.uns["spatial"] = dict()

    from h5py import File

    with File(path / count_file, mode="r") as f:
        attrs = dict(f.attrs)
    if library_id is None:
        library_id = str(attrs.pop("library_ids")[0], "utf-8")

    adata.uns["spatial"][library_id] = dict()

    with File(path / count_file, mode="r") as f:
        # Extract data from the HDF5 file
        feature_names = f['matrix/features/name'][:]  # Load the full dataset
        gene_ids = f['matrix/features/id'][:]
        feature_types = f['matrix/features/feature_type'][:]
        genome = f['matrix/features/genome'][:]

    # Decode byte strings (if necessary) and create a pandas DataFrame
    adata_var = pd.DataFrame({
        "gene_ids": [x.decode('utf-8') for x in gene_ids],
        "feature_types": [x.decode('utf-8') for x in feature_types],
        "genome": [x.decode('utf-8') for x in genome]
    })

    # Set feature names as the index (assuming they are unique)
    adata_var.index = [x.decode('utf-8') for x in feature_names]

    ## Some genes are duplicated in adata_var. During troubleshooting found these TBCE HSPA14 TMSB15B
    ## Since not linked to any gene expression data - can drop
    # # Drop the duplicated rows based on the index (keeping only the first occurrence of each)
    adata_var = adata_var[~adata_var.index.duplicated(keep='first')]

    adata.var = adata_var

    if load_images:
        tissue_positions_file = (
            path / "spatial/tissue_positions.csv"
            if (path / "spatial/tissue_positions.csv").exists()
            else path / "spatial/tissue_positions.parquet" if (path / "spatial/tissue_positions.parquet").exists()
            else path / "spatial/tissue_positions_list.csv"
        )
        files = dict(
            tissue_positions_file=tissue_positions_file,
            scalefactors_json_file=path / "spatial/scalefactors_json.json",
            hires_image=spaceranger_image_path / "tissue_hires_image.png",
            lowres_image=spaceranger_image_path / "tissue_lowres_image.png",
        )
        # check if files exists, continue if images are missing
        for f in files.values():
            if not f.exists():
                if any(x in str(f) for x in ["hires_image", "lowres_image"]):
                    logg.warning(
                        f"You seem to be missing an image file.\n"
                        f"Could not find '{f}'."
                    )
                else:
                    raise OSError(f"Could not find '{f}'")

        adata.uns["spatial"][library_id]["images"] = dict()
        for res in ["hires", "lowres"]:
            try:
                adata.uns["spatial"][library_id]["images"][res] = imread(
                    str(files[f"{res}_image"])
                )
            except Exception:
                raise OSError(f"Could not find '{res}_image'")

       # read json scalefactors
        adata.uns["spatial"][library_id]["scalefactors"] = json.loads(
            files["scalefactors_json_file"].read_bytes()
        )

        adata.uns["spatial"][library_id]["metadata"] = {
            k: (str(attrs[k], "utf-8") if isinstance(attrs[k], bytes) else attrs[k])
            for k in ("chemistry_description", "software_version")
            if k in attrs
        }  

        # read coordinates
        if files["tissue_positions_file"].name.endswith(".csv"):
            positions = pd.read_csv(
                files["tissue_positions_file"],
                header=0 if tissue_positions_file.name == "tissue_positions.csv" else None,
                index_col=0,
            )
        elif files["tissue_positions_file"].name.endswith(".parquet"):
            positions = pd.read_parquet(files["tissue_positions_file"])
            #need to set the barcode to be the index
            positions.set_index("barcode", inplace=True)
        positions.columns = [
            "in_tissue",
            "array_row",
            "array_col",
            "pxl_row_in_fullres", # the original column in bin2cell.read_visium, was set to pxl_col_in_fullres 
            "pxl_col_in_fullres", # the orginal column in bin2cell_read_visium was set to pcl_row_in_fullres
        ]
        # see link below for structure of tissue_positions_file.parquet file
        # https://www.10xgenomics.com/support/software/space-ranger/latest/analysis/outputs/spatial-outputs

        # cleaning up, removing negative coords,removing out of tissue bins
        tissue_pos_list_filt = positions[positions.in_tissue == 1]
        tissue_pos_list_filt["pxl_row_in_fullres"] = tissue_pos_list_filt[
            "pxl_row_in_fullres"
        ].astype(int)
        tissue_pos_list_filt["pxl_col_in_fullres"] = tissue_pos_list_filt[
            "pxl_col_in_fullres"
        ].astype(int)
        tissue_pos_list_filt = tissue_pos_list_filt.loc[
            (tissue_pos_list_filt.pxl_row_in_fullres >= 0)
            & (tissue_pos_list_filt.pxl_col_in_fullres >= 0)
        ]
        x_min = tissue_pos_list_filt["pxl_col_in_fullres"].min()
        y_min = tissue_pos_list_filt["pxl_row_in_fullres"].min()
        x_max = tissue_pos_list_filt["pxl_col_in_fullres"].max()
        y_max = tissue_pos_list_filt["pxl_row_in_fullres"].max()

        # adding crop boundary minimums to the enact output cell coordinates
        adata.obsm['spatial']['cell_x'] = (adata.obsm['spatial']['cell_x'] + x_min) 
        adata.obsm['spatial']['cell_y'] = (adata.obsm['spatial']['cell_y'] + y_min) 

        # convert to an array for compatibility with squidpy
        adata.obsm['spatial'] = adata.obsm['spatial'].to_numpy()

        # put image path in uns
        if source_image_path is not None:
            # get an absolute path
            source_image_path = str(Path(source_image_path).resolve())
            adata.uns["spatial"][library_id]["metadata"]["source_image_path"] = str(
                source_image_path
            )
        return adata

@LucFrancisENX
Copy link

This was used on the output of the human colon cancer visium hd dataset, and I was able to perform standard downstream processing using scanpy/squidpy.

One issue I am having is that if I try to save to raw beforehand (as below), my kernel crashes when running any downstream analysis (e.g. filter_cells).

adata.raw = adata.copy()

Would anyone have a solution for this please? I don't encounter any such issues with the bin2cell output

@AlbertPlaPlanas
Copy link

@Demond-dev in our latest release we have added a notebook example that may help you visualize the outputs in an image: https://github.com/Sanofi-Public/enact-pipeline/blob/main/ENACT_outputs_demo.ipynb (note that in git it does not render the image as it is an html interactive plot, but once you load it in jupyter or vscode it should be visible).

Also, the latest release incorporates compatibility with TIssUUmaps for output visualization. The execution generates a .tmap file that allows you to load your outcomes in TissUUmaps: https://github.com/Sanofi-Public/enact-pipeline/blob/main/ENACT_outputs_demo.ipynb

@AlbertPlaPlanas
Copy link

was able to perform standard downstream processing using scanpy/squidpy.

One issue I am having is that if I try to

Have you checked your memory usage? it seems you may be running out of memory.

@AlbertPlaPlanas AlbertPlaPlanas added the help wanted Extra attention is needed label Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants