Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-aggregations containing nullable decimal measures fail in Athena Driver #9010

Open
Larissa-Rocha opened this issue Dec 2, 2024 · 5 comments
Assignees
Labels
bug Something isn't working driver:athena Issues related to the AWS Athena driver pre-aggregations Issues related to pre-aggregations

Comments

@Larissa-Rocha
Copy link

Larissa-Rocha commented Dec 2, 2024

Describe the bug
I'm trying to pre-aggregate a measure that contains a decimal value, while other measures can be NULL, and I'm having the following error in pre-aggregation:

Error: Error during create table: CREATE TABLE prod_pre_aggregations.fct_my_table_name_main20241001_asdfghjkl ('list of fields and data types') WITH (input_format = 'csv_no_header', delimiter = '^A', build_range_end = '2024-10-31T23:59:59.999') LOCATION ?, ?: Create table failed: Can't parse '\N
4010' column value for 'fct_user_progress_contents__value_workload_total_completed' column: Internal: invalid float literal
    at WebSocket.<anonymous> (/node_modules/@cubejs-backend/cubestore-driver/src/WebSocketConnection.ts:121:30)
    at WebSocket.emit (node:events:519:28)
    at Receiver.receiverOnMessage (/node_modules/ws/lib/websocket.js:1008:20)
    at Receiver.emit (node:events:519:28)
    at Receiver.dataMessage (/node_modules/ws/lib/receiver.js:502:14)
    at Receiver.getData (/node_modules/ws/lib/receiver.js:435:17)
    at Receiver.startLoop (/node_modules/ws/lib/receiver.js:143:22)
    at Receiver._write (/node_modules/ws/lib/receiver.js:78:10)
    at writeOrBuffer (node:internal/streams/writable:570:12)
    at _write (node:internal/streams/writable:499:10)
    at Receiver.Writable.write (node:internal/streams/writable:508:10)
    at TLSSocket.socketOnData (/node_modules/ws/lib/websocket.js:1102:35)
    at TLSSocket.emit (node:events:519:28)
    at addChunk (node:internal/streams/readable:559:12)
    at readableAddChunkPushByteMode (node:internal/streams/readable:510:3)
    at TLSSocket.Readable.push (node:internal/streams/readable:390:5)
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:191:23)

To Reproduce
Steps to reproduce the behavior:

  1. Create a cube yml with the following metrics:
- name: value_workload_total
        sql: 1000 * ROUND(value_workload_minutes/cast(value_workload_minutes_count as double), 3)
        type: sum

      - name: value_workload_total_completed
        sql: 1000 * ROUND(value_workload_minutes/cast(value_workload_minutes_count as double), 3)
        type: sum
        filters:
          - sql: flag_enrollment_completed_pathway = TRUE

  1. Put both in pre aggregation
  2. Run pre aggregation

Expected behavior
I expect the metric to pre aggregate

Version:
[e.g. 1.0.5]

Additional context
I found a similar issue here: #6693 for snowflake driver, I wonder how to solve it for athena

@igorlukanin igorlukanin added question The issue is a question. Please use Stack Overflow for questions. driver:athena Issues related to the AWS Athena driver pre-aggregations Issues related to pre-aggregations labels Dec 4, 2024
@igorlukanin igorlukanin self-assigned this Dec 4, 2024
@igorlukanin
Copy link
Member

igorlukanin commented Dec 4, 2024

Hi @Larissa-Rocha 👋

I've tried to reproduce this and while I wasn't able to get exactly the same error, I acknowledge that this case could have been handled better.

The data model that I used:

cubes:
  - name: test
    sql: >
      SELECT
        NULL AS value, 
        NULL AS value_workload_minutes,
        NULL AS value_workload_minutes_count,
        TRUE AS flag_enrollment_completed_pathway

    dimensions:
      - name: value
        sql: value
        type: string

    measures:
      - name: value_workload_total
        sql: 1000 * ROUND(CAST(value_workload_minutes AS DOUBLE)/cast(value_workload_minutes_count as double), 3)
        type: sum

      - name: value_workload_total_completed
        sql: 1000 * ROUND(CAST(value_workload_minutes AS DOUBLE)/cast(value_workload_minutes_count as double), 3)
        type: sum
        filters:
          - sql: flag_enrollment_completed_pathway = TRUE

    pre_aggregations:
      - name: main
        dimensions:
          - value
        measures:
          - value_workload_total
          - value_workload_total_completed

Pre-aggregation does not build—here's the error:

Error: Error during create table: CREATE TABLE prod_pre_aggregations.test_main_srp5fcpr_isgxjz4g_1jl16fr (`test__value` unknown) LOCATION ?: Custom type 'unknown' is not supported
    at WebSocket.<anonymous> (/node_modules/@cubejs-backend/cubestore-driver/src/WebSocketConnection.ts:121:30)
    at WebSocket.emit (node:events:519:28)
    at Receiver.receiverOnMessage (/node_modules/ws/lib/websocket.js:1008:20)
    at Receiver.emit (node:events:519:28)
    at Receiver.dataMessage (/node_modules/ws/lib/receiver.js:502:14)
    at Receiver.getData (/node_modules/ws/lib/receiver.js:435:17)
    at Receiver.startLoop (/node_modules/ws/lib/receiver.js:143:22)
    at Receiver._write (/node_modules/ws/lib/receiver.js:78:10)
    at writeOrBuffer (node:internal/streams/writable:570:12)
    at _write (node:internal/streams/writable:499:10)
    at Receiver.Writable.write (node:internal/streams/writable:508:10)
    at TLSSocket.socketOnData (/node_modules/ws/lib/websocket.js:1102:35)
    at TLSSocket.emit (node:events:519:28)
    at addChunk (node:internal/streams/readable:559:12)
    at readableAddChunkPushByteMode (node:internal/streams/readable:510:3)
    at TLSSocket.Readable.push (node:internal/streams/readable:390:5)
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:191:23)

My Cube version was v1.1.7.

That being said, I guess that a possible workaround would be to safeguard against NULL values as much as possible. For example, all potentially NULL values can be surrounded with COALESCE(..., 0) or something similar so that they evaluate to 0 instead of NULL. Obviously, what to substitute NULLs with depends on your domain.

@igorlukanin igorlukanin added bug Something isn't working and removed question The issue is a question. Please use Stack Overflow for questions. labels Dec 4, 2024
@Larissa-Rocha
Copy link
Author

Hi @igorlukanin, after further investigation, I found that pre aggregation in columns containing special characters like ", á, â also fail with Can't parse column value error. Do you have any tips on how to escape those values or if this issue would be addressed any time soon? I'm using Cube Cloud 1.0.5

@igorlukanin
Copy link
Member

Hi @Larissa-Rocha 👋

I've quickly checked it with the following data model:

cubes:
  - name: test_2
    sql: >
      SELECT 'á' AS value UNION ALL
      SELECT 'â' AS value UNION ALL
      SELECT '"' AS value 

    dimensions:
      - name: value
        sql: value
        type: string

    pre_aggregations:
      - name: main
        dimensions:
          - value

And it worked as expected:
Screenshot 2024-12-16 at 13 28 37

So I guess there's something else that should be causing that error. Could you please help narrow this down?

@Larissa-Rocha
Copy link
Author

Hi @igorlukanin!
We’ve identified an issue when using the schema below in Cube.js. The pre-aggregation configured to be partitioned by month encounters errors when Cube attempts to create a CSV file to store the pre-aggregation data.
The problem specifically occurs when one of the columns contains more than two double quotes ("). This seems to cause an error during the CSV creation process using the csv-write-stream library, resulting in a malformed file. The generated CSV cannot be read correctly, which breaks queries that depend on the pre-aggregation.
Here’s the schema used for testing:

cubes:
  - name: fct_test
    public: false

    sql: |
      SELECT 
        title,
        value_sum,
        value_count,
        date_reference
      FROM (
        SELECT 
          '"Title" x "Subtitle"' AS title,
          97.8 AS value_sum,
          10 AS value_count,
          CAST('2024-10-10' AS TIMESTAMP(3)) AS date_reference
        UNION ALL
        SELECT 
          '"Title" \ "Subtitle"' AS title,
          27.3 AS value_sum,
          20 AS value_count,
          CAST('2024-10-10' AS TIMESTAMP(3)) AS date_reference
        UNION ALL
        SELECT 
          '"Title 3"' AS title,
          30.9 AS value_sum,
          30 AS value_count,
          CAST('2024-11-10' AS TIMESTAMP(3)) AS date_reference
        UNION ALL
        SELECT 
          '"Title\\4"' AS title,
          40.7 AS value_sum,
          40 AS value_count,
          CAST('2024-11-10' AS TIMESTAMP(3)) AS date_reference
        UNION ALL
        SELECT 
          '"Title"5"' AS title,
          58.1 AS value_sum,
          50 AS value_count,
          CAST('2024-12-10' AS TIMESTAMP(3)) AS date_reference
        UNION ALL
        SELECT 
          '"Title" 6' AS title,
          69.8 AS value_sum,
          60 AS value_count,
          CAST('2024-12-10' AS TIMESTAMP(3)) AS date_reference
      )
    
    dimensions:
      - name: title
        sql: title
        type: string
        
      - name: dateReference
        sql: date_reference
        type: time
        
    measures:
      - name: valueSum
        sql: value_sum
        type: sum
        
      - name: valueCount
        sql: value_count
        type: sum
        
      - name: valueAvg
        sql: ({valueSum} / {valueCount})
        type: number
        
    preAggregations:
      - name: main
        type: rollup
        
        dimensions:
          - title
        
        measures:
          - valueSum
          - valueCount          
          
        time_dimension: dateReference
        granularity: day
        partition_granularity: month

In the example above, titles like '"Title" x "Subtitle"' and '"Title" \ "Subtitle"' contain multiple double quotes, which appears to be causing the issue. When titles include only one or two double quotes, the CSV appears to be generated correctly.
Could you test this schema and verify if the problem occurs in your environment? It would be great to confirm whether the issue is directly related to csv-write-stream or if there’s something we can adjust in Cube.js to prevent these errors.
Thank you in advance for your attention and support!

@Larissa-Rocha
Copy link
Author

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working driver:athena Issues related to the AWS Athena driver pre-aggregations Issues related to pre-aggregations
Projects
None yet
Development

No branches or pull requests

2 participants