Provide metrics for Hive tables and partitions #242

pjeli · 2019-06-05T01:54:56Z

Assuming a valid hive-site.xml, it will be possible to determine the active hive warehouse HDFS directory and HiveServer2 and Metastore URIs.

From there we should be able to perform a directory analysis on the hive warehouse parent directory and then all HDFS locations that represent tables / partitions.

kunalmulwani · 2019-06-06T02:49:44Z

What are the different metrics?

pjeli · 2019-06-06T05:32:04Z

Hey @kunalmulwani !

Apologies in advance as Hive is the one of the applications I still struggle with.

My current thinking is that it should be pretty easy to count the number of internal tables and do some sorting based on simple total file counts and diskspace consumed for each table; (as each table will be represented as just a directory). From there we can then do age analysis and sort internal tables by the last access / mod time.

That way folks can get an idea for which internal tables should be considered for possible HAR archival or deletion.

The above wasn't really possible before because of how long directory histograms used to take on large NNA instances. But ever since #224 was done I've felt comfortable with extending the analysis that NNA can perform into Hive, HBase, etc.

Now we can extend the same idea to external tables but that will require some communication with the Metastore and the SQL database backing it -- so I'd like to focus on that part later as it will be harder.

Make sense?

kunalmulwani · 2019-06-09T23:09:11Z

I understand what you say. I would like to work on this but I might need help to achieve this.

pjeli · 2019-06-17T17:13:08Z

Sure @kunalmulwani !

I think all that is required is:
(1) Parsing the hive-site.ml and determining where the warehouse directory is.
(2) Within the SuggestionsEngine, use the QueryEngine to get all directories directly underneath the warehouse directory, something like: http://SERVER:PORT/filter?set=dirs&filters=path:startsWith:/warehouse/dir/
(3) Each of these directories should be an internal hive database directory. Underneath that is each table for each database. So we should be able to get file count and diskspace consumed per DB and per table (again, for internal ones only).

PoojaShekhar · 2020-05-13T16:56:01Z

We can also parse hive metastore logs to get the stats numFiles, uncompressed datasize(rawDataSize), compressed datasize(totalSize), numPartitions, table hdfs location for each table given that stats are collected every time there an update on the existing tables or a new table is created. Auto gather setting needs to be turned on. Or we can script it to do collect stats later using the analyze query.

pjeli · 2022-04-06T19:10:01Z

I think it's finally time to revisit this. Sorry for getting back to you so late @PoojaShekhar. I think if you want to parse Hive MetaStore logs you can do so - but we should avoid doing it as part of NNA. NNA is, ideally, an isolated system, and we should avoid having to talk to as many other things as we need to. In this case, we have all the metadata necessary to figure everything out within the NameNode's memory, so I would rather exploit that here. By all means though, if you wish to parse MetaStore logs go ahead - but I would not like to have NNA be the driver for that. Kind of the same thing for the HBase side too.

Bit of a neat thing here - once the Hive stats are obtained we can compare against the rest of the cluster and say what % of data is (managed) Hive tables, HBase tables, etc.

pjeli added enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed labels Jun 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide metrics for Hive tables and partitions #242

Provide metrics for Hive tables and partitions #242

pjeli commented Jun 5, 2019

kunalmulwani commented Jun 6, 2019

pjeli commented Jun 6, 2019 •

edited

Loading

kunalmulwani commented Jun 9, 2019

pjeli commented Jun 17, 2019 •

edited

Loading

PoojaShekhar commented May 13, 2020

pjeli commented Apr 6, 2022

Provide metrics for Hive tables and partitions #242

Provide metrics for Hive tables and partitions #242

Comments

pjeli commented Jun 5, 2019

kunalmulwani commented Jun 6, 2019

pjeli commented Jun 6, 2019 • edited Loading

kunalmulwani commented Jun 9, 2019

pjeli commented Jun 17, 2019 • edited Loading

PoojaShekhar commented May 13, 2020

pjeli commented Apr 6, 2022

pjeli commented Jun 6, 2019 •

edited

Loading

pjeli commented Jun 17, 2019 •

edited

Loading