-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide metrics for Hive tables and partitions #242
Comments
What are the different metrics? |
Hey @kunalmulwani ! Apologies in advance as Hive is the one of the applications I still struggle with. My current thinking is that it should be pretty easy to count the number of internal tables and do some sorting based on simple total file counts and diskspace consumed for each table; (as each table will be represented as just a directory). From there we can then do age analysis and sort internal tables by the last access / mod time. That way folks can get an idea for which internal tables should be considered for possible HAR archival or deletion. The above wasn't really possible before because of how long directory histograms used to take on large NNA instances. But ever since #224 was done I've felt comfortable with extending the analysis that NNA can perform into Hive, HBase, etc. Now we can extend the same idea to external tables but that will require some communication with the Metastore and the SQL database backing it -- so I'd like to focus on that part later as it will be harder. Make sense? |
I understand what you say. I would like to work on this but I might need help to achieve this. |
Sure @kunalmulwani ! I think all that is required is: |
We can also parse hive metastore logs to get the stats numFiles, uncompressed datasize(rawDataSize), compressed datasize(totalSize), numPartitions, table hdfs location for each table given that stats are collected every time there an update on the existing tables or a new table is created. Auto gather setting needs to be turned on. Or we can script it to do collect stats later using the analyze query. |
I think it's finally time to revisit this. Sorry for getting back to you so late @PoojaShekhar. I think if you want to parse Hive MetaStore logs you can do so - but we should avoid doing it as part of NNA. NNA is, ideally, an isolated system, and we should avoid having to talk to as many other things as we need to. In this case, we have all the metadata necessary to figure everything out within the NameNode's memory, so I would rather exploit that here. By all means though, if you wish to parse MetaStore logs go ahead - but I would not like to have NNA be the driver for that. Kind of the same thing for the HBase side too. Bit of a neat thing here - once the Hive stats are obtained we can compare against the rest of the cluster and say what % of data is (managed) Hive tables, HBase tables, etc. |
Assuming a valid hive-site.xml, it will be possible to determine the active hive warehouse HDFS directory and HiveServer2 and Metastore URIs.
From there we should be able to perform a directory analysis on the hive warehouse parent directory and then all HDFS locations that represent tables / partitions.
The text was updated successfully, but these errors were encountered: