Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming results: error appears as a result row #378

Open
dcheckoway opened this issue Jan 16, 2025 · 2 comments
Open

Streaming results: error appears as a result row #378

dcheckoway opened this issue Jan 16, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@dcheckoway
Copy link

dcheckoway commented Jan 16, 2025

Describe the bug

When using .stream() to stream rows as JSON, if an error occurs mid-query, the "exception" appears as a row in the results. This effectively means you need to scan the results for a row containing only "exception". This is costly (not much, but every bit counts) and silly.

If you have a scenario where the query legitimately returns a column named exception, then...well you get the point.

Initially I used StreamReadable.on('error', ...) expecting that listener to be invoked, but it wasn't. On one hand, I guess it makes sense since it wasn't a stream/read related error, but the fact that errors aren't surfaced in a first-class way is a design limitation.

Steps to reproduce

Run a query that will produce an error mid-way through, i.e.

SELECT 
    number, 
    if(number > 420000, throwIf(1, 'BOOM!'), number) AS value
FROM 
    numbers(1000000)

Stream that via ClickHouseClient.query().stream(). You'll see the row with "exception" appears among the results.

Expected behaviour

Errors should be surfaced in a more first-class way. Proposals:

  • Ideally each data chunk of results should have an envelope indicating whether an error occurred. That way you would only need to check for an error once per chunk, rather than once per row. Think millions+ of rows, think compression enabled -- it's only a handful of chunks. This would be a difference maker -- and much more elegant.
  • Consider having the SDK intercept the error (in a performant way, one way or another) and surface it via the StreamReadable.on('error', ...) listener.
  • Dirt simple: instead of "text" you could distinguish it as "error". This would still require checking each row, far from ideal, but a step in the right direction.

Thank you!!

@dcheckoway dcheckoway added the bug Something isn't working label Jan 16, 2025
@slvrtrn
Copy link
Contributor

slvrtrn commented Jan 16, 2025

I think this is the same issue as this one: #332. The complication there is that a mid-stream error appears as part of the results, and it is also format-dependant.

Unfortunately, the HTTP interface fixes in 24.11+ did not improve the situation in these scenarios.

We are currently discussing internally what could be the best approach there, as there are a few major complications:

  • The client needs to somehow parse the error from the response body, as due to HTTP protocol limitations, the exception can only go there if the error occurs after the response headers are sent, as it is not possible to send additional headers during the streaming. The way the error is printed by CH in the body stream is format dependent, e.g., JSONEachRow will have an object with an exception field, JSONCompactEachRow = just an exception as a single element in the array, etc., etc. The most complicated part is streaming into files with formats like Parquet.
  • There are awkward corner cases like this (if someone has a field named exception):
    curl "http://localhost:8123" --data-binary "select throwIf(number = 5, 'There was an error in the stream!') AS _, randomPrintableASCII(20) AS exception from system.numbers limit 30000 SETTINGS max_block_size=1 FORMAT JSONEachRow"
    
    {"_":0,"exception":"&vn]M\"\/B]c=Z1FFF\\u,Z"}
    {"_":0,"exception":"\"y}oIU\/ ~Q47B3hg>928"}
    {"exception": "Code: 395. DB::Exception: There was an error in the stream!: while executing 'FUNCTION throwIf(equals(__table1.number, 5_UInt8) :: 3, 'There was an error in the stream!'_String :: 2) -> throwIf(equals(__table1.number, 5_UInt8), 'There was an error in the stream!'_String) UInt8 : 0'. (FUNCTION_THROW_IF_VALUE_IS_NON_ZERO) (version 24.12.2.29 (official build))"}
    

Currently, the only reliable way to avoid this is to use the wait_end_of_query=1 setting, which causes the response to be buffered in memory or temp files on the server side before the streaming starts; in this case, the error will be thrown at the very beginning.

import { createClient } from '@clickhouse/client'

void (async () => {
  const client = createClient()
  try {
    let rowsCount = 0
    const rs = await client.query({
      query: `
        SELECT
            number,
            if(number > 420000, throwIf(1, 'Intentional Error Triggered'), number) AS value
        FROM
            numbers(1000000)
      `,
      format: 'JSONEachRow',
      clickhouse_settings: {
        max_block_size: '1000',
        wait_end_of_query: 1,
      },
    }) // <- the request will fail here due to `wait_end_of_query`
    const stream = rs.stream()
    for await (const rows of stream) {
      rowsCount += rows.length
    }
    console.log(`Received ${rowsCount} rows`)
  } catch (e) {
    console.error(e)
  }
})()

which will now print the following, as the server will immediately respond with 500 ISE instead of 200 OK:

ClickHouseError: Intentional Error Triggered: while executing 'FUNCTION if(greater(__table1.number, 420000_UInt32) :: 3, throwIf(1_UInt8, 'Intentional Error Triggered'_String) :: 0, __table1.number :: 4) -> if(greater(__table1.number, 420000_UInt32), throwIf(1_UInt8, 'Intentional Error Triggered'_String), __table1.number) UInt64 : 2'. 
    at parseError (/home/serge/work/clickhouse-js/examples/node_modules/packages/client-common/src/error/parse_error.ts:30:12)
    at ClientRequest.onResponse (/home/serge/work/clickhouse-js/examples/node_modules/packages/client-node/src/connection/node_base_connection.ts:474:28)
    at processTicksAndRejections (node:internal/process/task_queues:95:5) {
  code: '395',
  type: 'FUNCTION_THROW_IF_VALUE_IS_NON_ZERO'
}

However, this might not be an option if the result sets are too large.

@dcheckoway
Copy link
Author

Thanks for replying @slvrtrn . We don't really need the error to be thrown at the beginning of the query. It might process and return 1B rows before encountering the row with the error. We'd just like the error to be surfaced if and when it happens.

We have huge result sets, and I can't imagine buffering is practical -- not to mention, we really do want results to start flowing as soon as possible, since "time to first byte" is critical for our use case.

I completely understand the factors you mentioned. This isn't trivial. I'm glad you & others are already considering it. Thank you!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants