Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic row counts for spatial joins in DuckDB 1.2.0 #1060

Open
grobins opened this issue Feb 26, 2025 · 1 comment
Open

Non-deterministic row counts for spatial joins in DuckDB 1.2.0 #1060

grobins opened this issue Feb 26, 2025 · 1 comment

Comments

@grobins
Copy link

grobins commented Feb 26, 2025

Description
I'm experiencing inconsistent query results when running the same spatial query with DuckDB on EC2. When executing the identical query multiple times in succession, I get different row counts:

First run: 35,817 records
Second run: 291,614 records
Third run: 9,555 records

The correct number of records - which I get from running locally with 1.1.3 is 1,632,012

This appears to be a non-deterministic behavior introduced in DuckDB 1.2.0.
Environment

Local: DuckDB 1.1.3 (gives consistent results)
EC2: DuckDB 1.2.0
R package: duckdb
Extensions: spatial, aws, httpfs, icu

I'm not able to installed 1.1.3 into the EC2, closest I get is v1.1.3-dev165 but then the spatial extensions aren't working

Error: IO Error: Failed to install 'spatial'
The file was built for DuckDB version 'v1.1.3', but we can only load extensions built for DuckDB version '19864453f7'.

SQL I'm running

WITH events AS (SELECT *
                     , ST_Point(longitude, latitude) geom
                     , timezone('Pacific/Auckland', event_timestamp::TIMESTAMP WITH TIME ZONE) AS event_timestamp_nz
                     , CAST(event_timestamp AS DATE) dom
                     , strftime(date_trunc('hour', event_timestamp_nz), '%H:%M') timegroup
                     , extract(dow FROM event_timestamp_nz) dow
                    FROM read_parquet('MYLOCATION')
)
   , filtered_route_ways AS (SELECT *
                                  , st_buffer(ST_Envelope(geom), 0.001) bbox_geom
                                 FROM route_ways)
   , ranked_matches AS (SELECT
                            e.*
                          , r.way_id
                          , r.highway
                          , r.maxspeed
                          , r.osm_tuple_length_m
                          , r.osm_way_length_m
                          , r.osm_tuple_direction
                          , ROW_NUMBER() OVER (
            PARTITION BY e.machine_gid, e.event_timestamp
            ORDER BY ST_Distance(e.geom, r.geom)
        ) AS rank
                            FROM events e
                                     JOIN filtered_route_ways r
                                          ON ST_Intersects(e.geom, r.bbox_geom))
 
select * from ranked_matches;

SETUP before running

con <- dbConnect(duckdb::duckdb())
dbExecute(con, "INSTALL spatial; LOAD spatial;")
dbExecute(con, "INSTALL aws; LOAD aws;")
dbExecute(con, "INSTALL httpfs; LOAD httpfs;")
dbExecute(con, "INSTALL icu; LOAD icu;")
dbExecute(con, "SET TimeZone = 'Pacific/Auckland';")
dbWriteTable(con, "route_ways", route_ways)
dbExecute(con, "ALTER TABLE route_ways ADD COLUMN geom GEOMETRY;")
dbExecute(con, "UPDATE route_ways SET geom = ST_GeomFromText(linestring);")
dbExecute(con, "CREATE INDEX idx_geom ON route_ways USING RTREE (geom);")
dbExecute(con, "CALL load_aws_credentials('robinsight-data-hub');")

The 'events' source is exactly the same location in S3 that the local and EC2 versions are querying. the 'route_ways' dataframe is exactly the same, and I've confirmed both the EC2 and local queries return the correct results for the count from the 'events' CTE and 'filtered_route_ways' CTE.

UPDATE:
I replaced the ST_Intersects function with the following SQL and I'm getting consistent results between local and EC2 versions

WITH events AS (
    SELECT *
         , longitude
         , latitude
         , ST_Point(longitude, latitude) AS geom
         , timezone('Pacific/Auckland', event_timestamp::TIMESTAMP WITH TIME ZONE) AS event_timestamp_nz
         , CAST(event_timestamp AS DATE) AS dom
         , strftime(date_trunc('hour', event_timestamp_nz), '%H:%M') AS timegroup
         , extract(dow FROM event_timestamp_nz) AS dow
    FROM read_parquet('MYLOCATION')
),

route_ways_bounds AS (
    SELECT *
         , geom
         , ST_Buffer(ST_Envelope(geom), 0.001) AS bbox_geom
         , ST_XMin(ST_Buffer(ST_Envelope(geom), 0.001)) AS min_lon
         , ST_XMax(ST_Buffer(ST_Envelope(geom), 0.001)) AS max_lon
         , ST_YMin(ST_Buffer(ST_Envelope(geom), 0.001)) AS min_lat
         , ST_YMax(ST_Buffer(ST_Envelope(geom), 0.001)) AS max_lat
    FROM route_ways
),

filtered_events AS (
    SELECT e.*, r.way_id, r.highway, r.maxspeed, r.osm_tuple_length_m, r.osm_way_length_m, r.osm_tuple_direction, r.geom AS route_geom
    FROM events e
    JOIN route_ways_bounds r
        ON e.longitude BETWEEN r.min_lon AND r.max_lon
        AND e.latitude BETWEEN r.min_lat AND r.max_lat
),

ranked_matches AS (
    SELECT
        f.*,
        ROW_NUMBER() OVER (
            PARTITION BY f.machine_gid, f.event_timestamp
            ORDER BY ST_Distance(f.geom, f.route_geom)
        ) AS rank
    FROM filtered_events f
)
@eitsupi
Copy link
Contributor

eitsupi commented Feb 26, 2025

If you post a non-R-specific issue to this repository, doesn't it take a long time to resolve?
I think it would be worthwhile for you to post the issue to the upstream https://github.com/duckdb/duckdb to see if the issue reproduces on non-R clients.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants