Skip to content
This repository has been archived by the owner on Feb 21, 2022. It is now read-only.

Merging Zoo with Databook #194

Open
maroshmka opened this issue Oct 7, 2019 · 3 comments
Open

Merging Zoo with Databook #194

maroshmka opened this issue Oct 7, 2019 · 3 comments

Comments

@maroshmka
Copy link
Contributor

Hello guys,

we have a project called Databook. Conceptually it is the same thing as the Zoo, except it tries to manage metadata about internal data world - databases, ETLs and reports.

The architecture should be pretty similar. Imagine we're building a graph/map of data in kiwi.com - the Nodes are filled by crawlers (I believe you call them scanners) and then someone create the Edges (ETL in our case). So, we have crawler for Postgres, BigQuery etc., that fills the metadata about tables/schemas/settings then we take them and visualise / allow some interaction on the web.

I was thinking if we should continue developing separate system for it or if we could merge it with the Zoo. By that, we would have a system that should be to able to map and interconnect overall infrastructure inside the company. You would just put credentials (gitlab, postgres, google..) for crawlers and you would have data lineage visualisation from "source system" (e.g. booking) -> "revenue report". Plus lot of other features as well, e.g. data-quality reports, rest api best practices etc.

We may share some internal code for logic, maybe only FrontEnd part, maybe deployment part, or yea, maybe nothing.

What is your opinion on this cooperation ? Would this be viable ?

@maroshmka maroshmka changed the title Merging this with Databook Merging Zoo with Databook Oct 7, 2019
@aexvir
Copy link
Member

aexvir commented Oct 7, 2019

Hey @maroshmka thanks for reaching us for this!

So... if I understand it correctly you want to build a visualization for how any of our databases it's defined (schemas) and what's the status on each one of them (quality, etc) ? But that would be done manually as of right now.

What would be the interactions you are mentioning?
What's the main scope of this? Just to have a listing of all our databases and their data quality?
What about monitoring the flow of the data somehow? Does it make sense?

There are some features that we could definitely share, we actually thought about some checks that would ensure that the different representations for the same object should be consistent across all our services. For now we wanted to focus on API level, but it might be interesting to do it also at DB level.

I see that there are some overlapping features with what we are developing ant it could be interesting to join efforts on this, but I'd like to see a more formal specification, with specific requirements that you would need to develop.

@maroshmka
Copy link
Contributor Author

maroshmka commented Oct 8, 2019

Ok, im gonna try to answer your questions and then sum it up somehow.

if I understand it correctly you want to build a visualization for how any of our databases it's defined (schemas) and what's the status on each one of them (quality, etc) ?

More or less, yes. We wanna add convenient search, unify model regarding type of db (bigquery, postgres, redis...), show data-quality, show owner (we must discuss what ownership means). We wanna add data lineage, which should be one of the main points.

What would be the interactions you are mentioning?

I meant that system doesn't need to be strictly read-only web. It can allow you for example to - setup notification on data-quality drops to slack, create data-quality check (if you have perms of course), edit descriptions for bussiness/dev clarification (again if you're supposed to) etc.

What's the main scope of this?

I'm not sure what do you mean. Main use-case? Or how big it is ? It seems that it will be bigger project as for the scope. As for the use-cases, some of them could be:

  • analyst wanna start doing report/model regarding X ? where should he start ? he will check in Databook as a starting point
  • new to the company? let's check what data we have that you can use
  • bizzdev or corresponding analyst wondering why is the report showing data that doesn't makes sense. They go to Databook and check how the data got to the report, who was reponsible for which part and whats the status of the parts. Then they can escalate fix.

What about monitoring the flow of the data somehow? Does it make sense?

Yes, edges in the graph would be created by ETL. Now we have Airflow, but can be other tool we'll use in the future that will connect the dots. Which means, we will see overall dataflow.

More formal specification - I can't give you them now. This discussion should be exactly about that - does it make sense to start to develop this project as one ? If yea, let's gather more formal specification. For now its about the vision. Vasek Dorazil is currently Product Manager on this one, so maybe he has more formal specification than me that he can share.

Example of such projects:

@aexvir
Copy link
Member

aexvir commented Oct 10, 2019

Hmm, I see. I definitely think that having this together in The Zoo would open many possibilities, I really think that if we wouldn't merge it at least I'd like to integrate it with that service somehow. At the end it's about resources that our services are consuming.

https://github.com/lyft/amundsen This one actually looks pretty nice, can I ask what would prevent us to just use this one or build on top of it?

I meant that system doesn't need to be strictly read-only web. It can allow you for example to - setup notification on data-quality drops to slack, create data-quality check (if you have perms of course), edit descriptions for bussiness/dev clarification (again if you're supposed to) etc.

Actually AppSec is working on notifying the results of the checks to Slack, not sure how your data-quality checks would be defined, but I think they could be built on top of our code checks, although it seems more like an SLO type of metric than something more complex.

Overall I must say that I love the idea, I just want to make sure that building this on top of The Zoo will be useful for you and it won't compromise anything for us. So far I don't see that happening, as this will most probably be built as a new package inside of our Django app, but we'll definitely need to modify some of our core features to allow you to extend it easily.

Have you taken a look on our code? If not please do it 🙂

Btw, let's have a quick call next week regarding this, and maybe Vasek and let's agree on a proposal?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants