-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent query results with left joins #15
Comments
@yanmendes Update, I've successfully reproduced this on my own using a couple of test tables. It's basically stripping the LEFT JOIN syntax for my and turning it into an INNER JOIN using the syntax without the JOIN keywords. My test query:
Bigdawg is doing the following query:
I have to deep dive into the SQL parser to find out where this loss of fidelity is going on. |
I've made some modifications and I now have a version that now successfully handle's LEFT JOIN clauses. Remaining issue(s):
The overarching issue is that Bigdawg uses the Postgres planner to determine its SQL plan, which is in part a saving grace of not having to implement such a thing itself (caveats being it's done against the Schema tables which don't have rows, and hence no statistics so the plan will necessarily be optimized for an empty table case), however once the plan is generated internally to Bigdawg, it has to now (re)assemble a query from the plan. While we know it's an Anti Join, it would be useful in order to reverse generate the original query to know that it's a left join (which is not known from the EXPLAIN statement). Hence some further logic will probably need to be made to interpret the Anti Join. This is the explain run: EXPLAIN (VERBOSE ON, COSTS OFF, FORMAT XML) SELECT a.id, a.name, b.id, b.test_id, b.description FROM mm.test AS a LEFT JOIN mm.test2 AS b ON a.id = b.test_id where b.test_id is null; This is the query plan returned (notice Join-Type):
As an alternative a query plan generated from a pure Left join looks like this (notice the Join-Type node):
I think I can take an Anti join and possibly interpret it as a left join adding the appropriate where condition, but I'll have to do a little testing. The flip side is I'm not sure an Anti join works either with the current code base, actually I'm certain it doesn't. |
… and acting accordingly
Okay so I've punted on the Anti Join issue, and started taking a look at what happens when a distributed query is processed (across two different servers). Unfortunately it too drops the "LEFT" keyword off the ensuing join. I've done some limited debugging using log statements and it's deep in the query traversing and 'optimizing' section. It's going to be very useful to get a debugger running on this one, however it's tricky since it's a distributed join requiring a migration from the secondary server to be done first before the query is executed and running things locally everything shows up under a single IP (meaning I can't effectively do migrations from one server to another). Although I can get Bigdawg running locally fine, the problem I have is getting an independent host (or maybe 2) setup for the migration test. Since I'm doing this work on Windows I can't assign the docker containers IP addresses (unsupported), so I've resorted to creating at least one separate VM to run the other server on independently. I'm part way through the process now, but I've only been able to devote bits of time here and there to it. I could make Bigdawg more docker friendly by allowing the other copies of Bigdawg to have different ports, and then exposing those ports on docker, and storing them in the Catalog, but this is probably a bigger project than it's worth right now and only helps in this narrow debugging case for the moment (although perhaps a useful feature later). Alternatively I may reinstall linux on this machine which I've had on here in the past, which has helped, and then be able to debug a lot easier if I can assign the containers independent IPs. Anyway hopefully with in the next few days or so I'll have an update. |
Ok, I got my test environment fully setup tonight, finally, and was able to debug then fix at least part of the issue with distributed joins - the LEFT and RIGHT keywords should not be dropped anymore for simple distributed joins. I have to do more testing with complex joins as there appears to be additional edge cases in the code when joins are permuted where the type of join is ignored. |
…s are likewise flipped
…n child nodes are likewise flipped, fix for Anti joins.
…upport for Semi Joins
Ok, I've addressed now Semi and Anti joins. There are certainly edge cases that probably don't work right given how the code paths are, but the basics should be good. |
I have added my own data to BigDAWG, however, it's not processing
left joins
correctly. As a matter of fact, it's outputting the same result as aninner join
would. Evidence and how to reproduce below:Issuing query via BigDAWG's interface:
curl -X POST -d \ "bdrel(SELECT e.id, w.name, e.name FROM entity as e LEFT JOIN workflow as w ON w.id = e.id)" http://localhost:8080/bigdawg/query id name name 1 04-HelloWorld
Querying data directly (
LEFT JOIN
)Querying the data directly (
INNER JOIN
)If you need a reproducible environment, just follow the installation steps here
The text was updated successfully, but these errors were encountered: