Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Raster Tile Merging and TIF File Output #489

Open
RickLeite opened this issue Dec 15, 2023 · 4 comments
Open

Question: Raster Tile Merging and TIF File Output #489

RickLeite opened this issue Dec 15, 2023 · 4 comments

Comments

@RickLeite
Copy link

How can I merge raster tiles and write them to a TIFF file?

Is there already a way to do that, or is it planned to be introduced?


My Current Approach:

df = spark.read.format("gdal").option("extensions", "tif")\
           .load("dbfs:/FileStore/temp/rastersfile/extracted")\
           .groupBy().agg(  collect_list("tile").alias("tile"))

merged_tile = df.select(mos.rst_merge("tile"))

result = merged_tile.select("rst_merge(tile)").collect()[0]

raster_data_base64 = result["rst_merge(tile)"]["raster"]
binary_raster_data = bytes(raster_data_base64)

output_path = "/dbfs/FileStore/temp/rastersfile/merged/mergedrasters.tif"
with open(output_path, "wb") as output_file:
    output_file.write(binary_raster_data)

@RickLeite RickLeite changed the title Question: Raster Tile Merging and TIFF File Output Question: Raster Tile Merging and TIF File Output Dec 15, 2023
@RickLeite
Copy link
Author

Clearly, my current approach results in the loss of all file Metadata. Additionally, handling a large number of rasters is causing kernel issues due to memory constraints. I've attempted to use the latest rasterio UDFs, but I'm unsure how to proceed after merging the tiles.

@RickLeite
Copy link
Author

RickLeite commented Dec 16, 2023

Using rasterio udf

df = spark.read.format("gdal").option("extensions", "tif")\
           .load('/FileStore/temp/esri')\
           .groupBy().agg(collect_list("tile").alias("tile"))

merged_tile = df.select(mos.rst_merge("tile").alias('merged'))
import numpy as np
import rasterio
from rasterio.io import MemoryFile
from io import BytesIO
from pyspark.sql.functions import udf
from pathlib import Path

@udf("string")
def write_raster(raster, parent_dir):
  with MemoryFile(BytesIO(raster)) as memfile:
    with memfile.open() as dataset:
      Path(parent_dir).mkdir(parents=True, exist_ok=True)
      extensions_map = rasterio.drivers.raster_driver_extensions()
      driver_map = {v: k for k, v in extensions_map.items()}
      extension = driver_map[dataset.driver]
      file_id = 5234476790949929865   # Manually set UUID
      path = f"{parent_dir}/{file_id}.{extension}"
      print(f" parent_dir: {parent_dir}, file_id: {file_id}, extension: {extension}")

      with rasterio.open(path, "w", **dataset.profile) as dst:
        dst.write(dataset.read())
        print(f"writed to: {path}")
      return path

Since the returned merged tiles only provide the index_id, raster, parentPath, and driver, I manually set the UUID myself


merged_tile.select(write_raster("merged.raster", lit("dbfs:/FileStore/temp/esri/rastermerged"))).show(truncate=False)

Apparently it is little buggy; it wrote to 'dbfs:' as if it were a 'dbfs:' folder, and surprisingly I can't access it by browsing the DBFS from the Databricks catalog. But anyway, I was able to move the file to the desired location with shutil.

import shutil
shutil.copy('dbfs:/FileStore/temp/esri/rastermerged/5234476790949929865.tiff', '/dbfs/FileStore/temp/esri/rastermerged/5234476790949929865.tiff')

But when downloading the merged file, it corresponded to only one of the rasters in the directory (the first one). This is strange because I merged them, and with the approach that I write decoding it from base64 to binaryformat, the results give me the merged rasters.

@milos-colic
Copy link
Contributor

@RickLeite thank you for your question.

The parent behaviour you are describing is a current behavior which we plan to adjust.
At the moment only one parent is reported even though there may be many parents.
In the next versions we will update the schema to capture a list of parents as opposed to a single string parent path which we have now.

So your output file is a merged raster but it only selects a first parent from the collected set at runtime (wont be the same value between reruns).

This is currently planned for 0.4.1 version.

Kind regards
Milos

@RickLeite
Copy link
Author

Hi @milos-colic,

Appreciate your response! Excited for what's ahead!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants