Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide metrics for Hive tables and partitions #242

Open
pjeli opened this issue Jun 5, 2019 · 6 comments
Open

Provide metrics for Hive tables and partitions #242

pjeli opened this issue Jun 5, 2019 · 6 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@pjeli
Copy link
Collaborator

pjeli commented Jun 5, 2019

Assuming a valid hive-site.xml, it will be possible to determine the active hive warehouse HDFS directory and HiveServer2 and Metastore URIs.

From there we should be able to perform a directory analysis on the hive warehouse parent directory and then all HDFS locations that represent tables / partitions.

@pjeli pjeli added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Jun 5, 2019
@kunalmulwani
Copy link
Contributor

What are the different metrics?

@pjeli
Copy link
Collaborator Author

pjeli commented Jun 6, 2019

Hey @kunalmulwani !

Apologies in advance as Hive is the one of the applications I still struggle with.

My current thinking is that it should be pretty easy to count the number of internal tables and do some sorting based on simple total file counts and diskspace consumed for each table; (as each table will be represented as just a directory). From there we can then do age analysis and sort internal tables by the last access / mod time.

That way folks can get an idea for which internal tables should be considered for possible HAR archival or deletion.

The above wasn't really possible before because of how long directory histograms used to take on large NNA instances. But ever since #224 was done I've felt comfortable with extending the analysis that NNA can perform into Hive, HBase, etc.

Now we can extend the same idea to external tables but that will require some communication with the Metastore and the SQL database backing it -- so I'd like to focus on that part later as it will be harder.

Make sense?

@kunalmulwani
Copy link
Contributor

I understand what you say. I would like to work on this but I might need help to achieve this.

@pjeli
Copy link
Collaborator Author

pjeli commented Jun 17, 2019

Sure @kunalmulwani !

I think all that is required is:
(1) Parsing the hive-site.ml and determining where the warehouse directory is.
(2) Within the SuggestionsEngine, use the QueryEngine to get all directories directly underneath the warehouse directory, something like: http://SERVER:PORT/filter?set=dirs&filters=path:startsWith:/warehouse/dir/
(3) Each of these directories should be an internal hive database directory. Underneath that is each table for each database. So we should be able to get file count and diskspace consumed per DB and per table (again, for internal ones only).

@PoojaShekhar
Copy link

We can also parse hive metastore logs to get the stats numFiles, uncompressed datasize(rawDataSize), compressed datasize(totalSize), numPartitions, table hdfs location for each table given that stats are collected every time there an update on the existing tables or a new table is created. Auto gather setting needs to be turned on. Or we can script it to do collect stats later using the analyze query.

@pjeli
Copy link
Collaborator Author

pjeli commented Apr 6, 2022

I think it's finally time to revisit this. Sorry for getting back to you so late @PoojaShekhar. I think if you want to parse Hive MetaStore logs you can do so - but we should avoid doing it as part of NNA. NNA is, ideally, an isolated system, and we should avoid having to talk to as many other things as we need to. In this case, we have all the metadata necessary to figure everything out within the NameNode's memory, so I would rather exploit that here. By all means though, if you wish to parse MetaStore logs go ahead - but I would not like to have NNA be the driver for that. Kind of the same thing for the HBase side too.

Bit of a neat thing here - once the Hive stats are obtained we can compare against the rest of the cluster and say what % of data is (managed) Hive tables, HBase tables, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants