Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(bq,sf,rs|clustering):ST_CLUSTERKMEANS remove duplicated coords #491

Merged

Conversation

vdelacruzb
Copy link
Contributor

Description

Shortcut

Some tables containing duplicates points might result in less clusters done expected. As a solution for this we are removing duplicates before doing the processing.

Type of change

  • Fix

Acceptance

### Bigquery

with a as (
SELECT `carto-un`.carto.ST_CLUSTERKMEANS(ARRAY_AGG(geom ORDER BY ST_ASBINARY(geom)), 40) arr
from `cartodb-data-engineering-team`.vdelacruz_carto.clustering_table_copy
)
select count(distinct(element.cluster)) from a, UNNEST(arr) element;
-- 36

with a as (
SELECT `cartodb-data-engineering-team`.vdelacruz_carto.ST_CLUSTERKMEANS(ARRAY_AGG(geom ORDER BY ST_ASBINARY(geom)), 40) arr
from `cartodb-data-engineering-team`.vdelacruz_carto.clustering_table_copy
)
select count(distinct(element.cluster)) from a, UNNEST(arr) element;
-- 40

### Snowflake

CREATE table  CARTO_DATA_ENGINEERING_TEAM.vdelacruz_carto.clustering_table_duplicateds AS
SELECT ST_GEOGFROMTEXT('POINT(-72.3539 -37.47262)') as geom
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-72.3539 -37.47262)')
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-71.61442 -35.39392)')
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-71.61442 -35.39392)')
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-71.61442 -35.39392)')
UNION ALL SELECT ST_GEOGFROMTEXT('POINT(-71.33815 -29.9541)');

with a as (
SELECT CARTO_DEV_DATA.carto.ST_CLUSTERKMEANS(ARRAY_AGG(ST_ASGEOJSON(geom)::STRING), 6) arr
from CARTO_DATA_ENGINEERING_TEAM.vdelacruz_carto.clustering_table_duplicateds
)
select count(*) FROM a, LATERAL FLATTEN(input=>arr);
-- 6 rows

with a as (
SELECT CARTO_DATA_ENGINEERING_TEAM.vdelacruz_carto.ST_CLUSTERKMEANS(ARRAY_AGG(ST_ASGEOJSON(geom)::STRING), 6) arr
from CARTO_DATA_ENGINEERING_TEAM.vdelacruz_carto.clustering_table_duplicateds
)
select count(*) FROM a, LATERAL FLATTEN(input=>arr);
-- 3 rows. Looked at the results and it's keeping the distinct points

### Redshift

CREATE table  vdelacruz_carto.clustering_table_duplicateds AS
SELECT ST_GEOMFROMTEXT('POINT(-72.3539 -37.47262)') as geom
UNION ALL SELECT ST_GEOMFROMTEXT('POINT(-72.3539 -37.47262)')
UNION ALL SELECT ST_GEOMFROMTEXT('POINT(-71.61442 -35.39392)')
UNION ALL SELECT ST_GEOMFROMTEXT('POINT(-71.61442 -35.39392)')
UNION ALL SELECT ST_GEOMFROMTEXT('POINT(-71.61442 -35.39392)')
UNION ALL SELECT ST_GEOMFROMTEXT('POINT(-71.33815 -29.9541)');

SELECT get_array_length( carto.ST_CLUSTERKMEANS(ST_GEOMFROMTEXT('MULTIPOINT ((-72.3539 -37.47262), (-72.3539 -37.47262), (-71.61442 -35.39392), (-71.61442 -35.39392), (-71.61442 -35.39392), (-71.33815 -29.9541))', 9)));
-- 6

SELECT get_array_length( vdelacruz_carto.ST_CLUSTERKMEANS(ST_GEOMFROMTEXT('MULTIPOINT ((-72.3539 -37.47262), (-72.3539 -37.47262), (-71.61442 -35.39392), (-71.61442 -35.39392), (-71.61442 -35.39392), (-71.33815 -29.9541))', 6)));
-- 3. Looked to the results and is producing disting values

Copy link

@vdelacruzb vdelacruzb requested a review from Jesus89 April 3, 2024 17:10
@vdelacruzb vdelacruzb merged commit 8e0a749 into main Apr 4, 2024
17 checks passed
@vdelacruzb vdelacruzb deleted the bug/sc-397702/internal-st-cluster-k-means-component-result branch April 4, 2024 07:43
@vdelacruzb vdelacruzb mentioned this pull request Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants