Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL Flock function #14

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

URL Flock function #14

wants to merge 5 commits into from

Conversation

lmangani
Copy link
Collaborator

@lmangani lmangani commented Oct 18, 2024

duckdb_writer_json-ezgif com-crop

url_flock

The url_flock function was inspired by an idea of @carlopi from DuckDB Labs 👋

When used in combination with the httpserver extension (or compatible endpoints such as ClickHouse HTTP using format JSONEachRow) the url_flock helper builds a multi-backend query with UNON ALL aggregation of results.

Limitations

Currently the function expects all backends to return the same columns (or no data).

Examples

D SELECT * FROM url_flock('SELECT ''hello'', version()', ['http://server1', 'http://server2']);
┌─────────┬─────────────┐
│ 'hello'"version"() │
│ varcharvarchar   │
├─────────┼─────────────┤
│ hello   │ v1.1.1      │
│ hello   │ v1.1.1      │
└─────────┴─────────────┘

The construction supports Token based authentication to the httpserver extension:

D CREATE SECRET extra_http_headers (
      TYPE HTTP,
      EXTRA_HTTP_HEADERS MAP{
          'X-API-Key': 'secret'
      }
  );

Known Issues

HTTP Auth

HEAD requests fail when Basic Auth user:password@host is included in the url,. Works with X-Tokens.

  • Debug Basic Authentication handler

@lmangani
Copy link
Collaborator Author

lmangani commented Oct 18, 2024

Online example mixing up DuckDB httpserver and ClickHouse API requests using JSONEachRow/NDJSON

--- Create a Backends table
CREATE TABLE IF NOT EXISTS backends (who VARCHAR, url VARCHAR);

--- Inserts some Backends
INSERT INTO backends VALUES 
('httpserver', 'https://quackpy.fly.dev'),
('httpserver2', 'https://cowsdb.fly.dev');

--- Run queries across all backends
SET variable __backends = (SELECT ARRAY_AGG(url) AS urls_array FROM backends);
SELECT * FROM url_flock('SELECT ''hello'', version()', getvariable('__backends') );

@lmangani lmangani added test needed Needs testing feature New Feature labels Oct 20, 2024
@lmangani
Copy link
Collaborator Author

lmangani commented Oct 22, 2024

@carlopi we're almost there! do you perhaps know of a smarter way to achieve the following without passing via a SET?

--- Run queries across all backends
SET variable __backends = (SELECT ARRAY_AGG(url) AS urls_array FROM backends);
SELECT * FROM url_flock('SELECT ''hello'', version()', getvariable('__backends') );
 TableFunction f(
            "url_flock",
            {LogicalType::VARCHAR, LogicalType::LIST(LogicalType::VARCHAR)},
            DuckFlockImplementation,
            DuckFlockBind,
            nullptr,
            nullptr
        );

If we try to pass a ::VARCHAR[] as second parameter directly it complains the field can't be cast to the destination type LIST

@lmangani lmangani changed the title WIP: URL Flock function URL Flock function Oct 23, 2024
@carlopi
Copy link

carlopi commented Oct 23, 2024

I will take a look at the code, it's either hard (due to read_json not being table-in table-out) or simple.

One thing that I think could make sense, possibly optionally via parameter, is adding a column duck that for each rows gives the identifier of the endpoint that generated the row.
And possibly making so it's possible to query the local duckdb via the same interface (or an easy way to add it to the flock).

Other question I have is on which extensions this need to reside, I am not completely sure, possibly http_server could also make sense.
It's possible to have cross-extensions dependencies (somewhat), say by doing LoadExtension('http_server') on the Load of chsql.

@lmangani
Copy link
Collaborator Author

I will take a look at the code, it's either hard (due to read_json not being table-in table-out) or simple.

Thanks in advance! We originally tried making this with a regular MACRO but condition 1 above was preventing it....

One thing that I think could make sense, possibly optionally via parameter, is adding a column duck that for each rows gives the identifier of the endpoint that generated the row. And possibly making so it's possible to query the local duckdb via the same interface (or an easy way to add it to the flock).

That would be amazing but I cannot imagine how :) I also thought about passing a DB/schema as a setting to have the function do the host lookups directly but that brings a different set of challenges as well. Curious as of what you think.

Other question I have is on which extensions this need to reside, I am not completely sure, possibly http_server could also make sense. It's possible to have cross-extensions dependencies (somewhat), say by doing LoadExtension('http_server') on the Load of chsql.

This function was originally added to httpserver but i felt it would bloat an already complex role with an even more complex dependency so we decided to stash in the chsql extension as a superpowered url function just to get it on the field since by design this function should be compatible with the Clickhouse HTTP API using JSONEachRow as format. However, we're happy to revert direction with anything you would suggest to make sure its as logically viable as it can be.

Once again thanks in advance for helping us square this out and for the original inspiration :)

@carlopi
Copy link

carlopi commented Oct 23, 2024

I was thinking something like:

CREATE TABLE IF NOT EXISTS backends (who VARCHAR, url VARCHAR);

--- Inserts some Backends
INSERT INTO backends VALUES 
('httpserver', 'https://quackpy.fly.dev'),
('httpserver2', 'https://cowsdb.fly.dev');
SELECT * FROM backends, LATERAL (FROM url_flock('SELECT version()', backends.url));

could possibly work, making url_flock a table-in, table-out, that takes a single backend BUT it's table-in, table-out.
Unsure of the interface, but I think it makes sense to register backends (or deletes them) from a table, and using that as reference.

I will try to make some progress later, I though this was still interesting to share.

@lmangani
Copy link
Collaborator Author

lmangani commented Oct 23, 2024

Purely to provide further context, the following ClickHouse functions are in scope of this function emulation:

SELECT * FROM remote(127.0.0.1, view(SELECT a, b, c FROM table_name));
SELECT * FROM cluster(cluster_name, view(SELECT a, b, c FROM table_name));

Essentially querying a remote node or cluster or nodes as ours with parameters inverted - the above variations can be implemented as simple macro aliases on top of the final command we'll come up with. Also in context, the 'cluster_name' approach seems a good one to follow allowing users to group servers by cluster_name when selecting from a backends table.

@nicosuave
Copy link

It's only tangentially related and I don't think the ergonomics in DuckDB afford this yet, but Arrow IPC over HTTP could be really interesting in conjunction

@lmangani
Copy link
Collaborator Author

lmangani commented Nov 7, 2024

It's only tangentially related and I don't think the ergonomics in DuckDB afford this yet, but Arrow IPC over HTTP could be really interesting in conjunction

@nicosuave I think this might be possible using to_arrow_ipc and BLOB as type with the url() macro
EDIT: some other interesting example here

@lmangani
Copy link
Collaborator Author

I will take a look at the code, it's either hard (due to read_json not being table-in table-out) or simple.

I'm super curious about table-in-out as I do not believe I understand it yet and how to use it. Thanks in advance!

Other question I have is on which extensions this need to reside, I am not completely sure, possibly http_server could also make sense. It's possible to have cross-extensions dependencies (somewhat), say by doing LoadExtension('http_server') on the Load of chsql.

Ultimately you were right. we're going to have a flock function in http_server alongside the discovery API to fill it up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New Feature test needed Needs testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants