-
Notifications
You must be signed in to change notification settings - Fork 273
Support Databases that don't have md5 or fast md5 (MSSQL) #51
Comments
Hello! Yes, we originally intended to include MSSQL, and even wrote the code for it. But the md5 function ended up being way too slow (x100 slower than postgres iirc). We considered it unusable, so we decided to put it aside for now. If we find a solution to the performance issue, I'll be happy to add MSSQL officially. |
Hey @pnelson-de, thanks for writing in :) In extension of Erez' reply, something he and I have discussed is that we will need to support a variety of databases where However, it's not a super safe default as it's not entirely unlikely you'll end up with the same number for segments even if the data is wrong... However, there's already plenty of types in play that need to hash correctly to the same values across databases. 😅 Once the testing infrastructure is in place for that, we would look at databases that don't have proper support for checksumming. |
Guessing https://orderbyselectnull.com/2018/05/31/hashbytes-scalability/ is related / similar problem. Was looking to do an Oracle MSSQL check with this tool myself so I'll +1 this. I don't have a better idea for you all though at this moment. Took a quick peak at https://docs.microsoft.com/en-us/sql/t-sql/functions/checksum-agg-transact-sql?view=sql-server-2017 but I'm not confident |
@visch are you looking to test Oracle MSSQL to another Oracle MSSQL? |
@visch the problem with If it's not a checksum we can reproduce https://github.com/datafold/data-diff/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-desc and add support for negotiating a common algorithm, in the case of MSSQL, it'll be |
Oracle DB to Microsoft SQL |
@visch if you're up for implementing what I suggested above, let us know. You can join us in #tools-data-diff in the Locally Optimistic Slack.. If you have a supported database pair and wiling to test and provide feedback, we'd be eager to hear it :) Even just Oracle -> Oracle on the same table would be really useful for us. |
Looks like |
I don't have the time right now to look at this but in case someone wants to dive in https://gitlab.com/vischous/oracle2mssql/-/blob/master/script.sh is a nice podman (replace that with docker if you want) script to spin up oracle / mssql dbs. Note there's a few extra commands you have to run with Oracle here https://gitlab.com/vischous/oracle2mssql/-/blob/master/oracle.sql |
Just adding my vote for SQL Server support! Best of luck solving the technical issues. |
+1 for MS SQL Server support |
Please vote by "reacting" on the post. That makes it easier to aggregate. |
Proposal for support for SQLServerThe hash: We take each column, turn its normalized form into a hex string, and then take the 16 left-most digits, 16 right-most digits, and possibly another 16 digits from its middle. We convert each of them into an int64, and together with the length of the string, we dot-product them with a constant vector of prime numbers. So Cons:
Pros:
To summarize, I think it might be good enough for actual use, but it's hard to measure exactly how likely it is to break down. |
My team would be very interested in having SQL Server support. I don't have background in hashing functions, so can't really speak to the best approach there, but a colleague and I would be willing to spend some time working on this. @erezsh should we chat first? Or should we just give it a go? |
Hi @Elliot2718 , If you have any questions about the proposal I wrote, or how to approach implementing it, we can set up a chat and get you up to speed. Keep in mind that the implementation should be done by creating a new Dialect Mixin that is compatible with https://github.com/datafold/sqeleton/ . For example, here is the MD5 mixin implementation for postgres: https://github.com/datafold/sqeleton/blob/master/sqeleton/databases/postgresql.py#L29 And here is how it's included in data-diff: https://github.com/datafold/data-diff/blob/master/data_diff/databases/postgresql.py#L5 The implementation itself can reside in either sqeleton or data-diff, though I think for consistency it's better to put it in Sqeleton. |
This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days. |
Commenting to move this out of being stale. This is still a highly relevant feature that needs to be added. |
Yeah this should definitely remain open. I'm unsure if/when we as maintainers would have the bandwidth to work on this, but very open to community contributions in the meantime. |
Hi all! I'm sorry for the delay in following up on this. We made a hard decision to sunset the data-diff package and won't provide further development or support. If that's of interest, over the past few months, we have rewritten the diffing engine in Datafold Cloud and solved many issues that existed in this package. We also added full support for diffing MS SQL Server 2019+ with other databases. -Gleb |
I was browsing through the list of supported databases on your readme and noticed MSSQL is not listed although it can be found in databases.py. Just wondering if this was intentional and also what the status of that connector is at the moment.
Cheers
The text was updated successfully, but these errors were encountered: