You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
I'm experiencing inconsistent query results when running the same spatial query with DuckDB on EC2. When executing the identical query multiple times in succession, I get different row counts:
First run: 35,817 records
Second run: 291,614 records
Third run: 9,555 records
The correct number of records - which I get from running locally with 1.1.3 is 1,632,012
This appears to be a non-deterministic behavior introduced in DuckDB 1.2.0.
Environment
I'm not able to installed 1.1.3 into the EC2, closest I get is v1.1.3-dev165 but then the spatial extensions aren't working
Error: IO Error: Failed to install 'spatial'
The file was built for DuckDB version 'v1.1.3', but we can only load extensions built for DuckDB version '19864453f7'.
SQL I'm running
WITH events AS (SELECT *
, ST_Point(longitude, latitude) geom
, timezone('Pacific/Auckland', event_timestamp::TIMESTAMP WITH TIME ZONE) AS event_timestamp_nz
, CAST(event_timestamp AS DATE) dom
, strftime(date_trunc('hour', event_timestamp_nz), '%H:%M') timegroup
, extract(dow FROM event_timestamp_nz) dow
FROM read_parquet('MYLOCATION')
)
, filtered_route_ways AS (SELECT *
, st_buffer(ST_Envelope(geom), 0.001) bbox_geom
FROM route_ways)
, ranked_matches AS (SELECT
e.*
, r.way_id
, r.highway
, r.maxspeed
, r.osm_tuple_length_m
, r.osm_way_length_m
, r.osm_tuple_direction
, ROW_NUMBER() OVER (
PARTITION BY e.machine_gid, e.event_timestamp
ORDER BY ST_Distance(e.geom, r.geom)
) AS rank
FROM events e
JOIN filtered_route_ways r
ON ST_Intersects(e.geom, r.bbox_geom))
select * from ranked_matches;
The 'events' source is exactly the same location in S3 that the local and EC2 versions are querying. the 'route_ways' dataframe is exactly the same, and I've confirmed both the EC2 and local queries return the correct results for the count from the 'events' CTE and 'filtered_route_ways' CTE.
UPDATE:
I replaced the ST_Intersects function with the following SQL and I'm getting consistent results between local and EC2 versions
WITH events AS (
SELECT *
, longitude
, latitude
, ST_Point(longitude, latitude) AS geom
, timezone('Pacific/Auckland', event_timestamp::TIMESTAMP WITH TIME ZONE) AS event_timestamp_nz
, CAST(event_timestamp AS DATE) AS dom
, strftime(date_trunc('hour', event_timestamp_nz), '%H:%M') AS timegroup
, extract(dow FROM event_timestamp_nz) AS dow
FROM read_parquet('MYLOCATION')
),
route_ways_bounds AS (
SELECT *
, geom
, ST_Buffer(ST_Envelope(geom), 0.001) AS bbox_geom
, ST_XMin(ST_Buffer(ST_Envelope(geom), 0.001)) AS min_lon
, ST_XMax(ST_Buffer(ST_Envelope(geom), 0.001)) AS max_lon
, ST_YMin(ST_Buffer(ST_Envelope(geom), 0.001)) AS min_lat
, ST_YMax(ST_Buffer(ST_Envelope(geom), 0.001)) AS max_lat
FROM route_ways
),
filtered_events AS (
SELECT e.*, r.way_id, r.highway, r.maxspeed, r.osm_tuple_length_m, r.osm_way_length_m, r.osm_tuple_direction, r.geom AS route_geom
FROM events e
JOIN route_ways_bounds r
ON e.longitude BETWEEN r.min_lon AND r.max_lon
AND e.latitude BETWEEN r.min_lat AND r.max_lat
),
ranked_matches AS (
SELECT
f.*,
ROW_NUMBER() OVER (
PARTITION BY f.machine_gid, f.event_timestamp
ORDER BY ST_Distance(f.geom, f.route_geom)
) AS rank
FROM filtered_events f
)
The text was updated successfully, but these errors were encountered:
If you post a non-R-specific issue to this repository, doesn't it take a long time to resolve?
I think it would be worthwhile for you to post the issue to the upstream https://github.com/duckdb/duckdb to see if the issue reproduces on non-R clients.
Description
I'm experiencing inconsistent query results when running the same spatial query with DuckDB on EC2. When executing the identical query multiple times in succession, I get different row counts:
First run: 35,817 records
Second run: 291,614 records
Third run: 9,555 records
The correct number of records - which I get from running locally with 1.1.3 is 1,632,012
This appears to be a non-deterministic behavior introduced in DuckDB 1.2.0.
Environment
Local: DuckDB 1.1.3 (gives consistent results)
EC2: DuckDB 1.2.0
R package: duckdb
Extensions: spatial, aws, httpfs, icu
I'm not able to installed 1.1.3 into the EC2, closest I get is v1.1.3-dev165 but then the spatial extensions aren't working
SQL I'm running
SETUP before running
The 'events' source is exactly the same location in S3 that the local and EC2 versions are querying. the 'route_ways' dataframe is exactly the same, and I've confirmed both the EC2 and local queries return the correct results for the count from the 'events' CTE and 'filtered_route_ways' CTE.
UPDATE:
I replaced the ST_Intersects function with the following SQL and I'm getting consistent results between local and EC2 versions
The text was updated successfully, but these errors were encountered: