Taking inspiration from Edward Capriolo, this project shows how to perform the rank() and dense_rank() functions within Hive, taking into account tied ranks. Also available is an implementation of first_value() and a UDTF of parse_key_val_tuple().
This project uses Maven. To build the software, simply use "mvn package".
To make the jar available on a temporary basis:
hive> add jar /home/paul/hive-udf-0.1-SNAPSHOT.jar;
To make the function available on a temporary basis:
hive> CREATE TEMPORARY FUNCTION rank AS 'com.cloudera.hive.examples.Rank';
The rank(), dense_rank() and first_value() functions take a minimum of one parameter - the column that is to be ranked / retrieved. Since the functions have no ability to sort data for themselves, we use a sub-query to appropriately distribute and sort data before passing to rank().
For example, imagine we have a table such as the following:
select * from items;
item | category | price |
---|---|---|
Orange | Fruit | 0.30 |
Apple | Fruit | 0.25 |
Banana | Fruit | 0.75 |
Carrot | Veg | 0.20 |
Sprout | Veg | 1.75 |
Kiwi | Fruit | 0.30 |
To use the rank function, prepare the data with an inner query:
select item, category, price from items distribute by category sort by category, price;
Then wrap this in another query that applies the rank function:
select item, category, price, rank(price, category) from ( select item, category, price from items distribute by category sort by category, price) inner;
item | category | price | rank |
---|---|---|---|
Apple | Fruit | 0.25 | 1 |
Orange | Fruit | 0.30 | 2 |
Kiwi | Fruit | 0.30 | 2 |
Banana | Fruit | 0.75 | 4 |
Carrot | Veg | 0.20 | 1 |
Sprout | Veg | 1.75 | 2 |
Notice that rank() takes the price column as the first parameter, the rest of the parameters are used to determine the row groupings.
The parse_key_val_tuple() function is a UDTF that takes a minimum of 4 parameters. The input string to be parsed, the delimiter between all the fields, the separator between key and value pairs, and 1 to many Keys that you would like to extract. Note that the parameters are case sensitive.
See the following links for Hive UDTF usage instructions
Below are two sample queries that show the functions basic capabilities and perhaps illustrate the outcome of some edge cases:
Demo Setup
A one row table called 'dual' is used for selecting demo values statically for demo purposes only.
echo "X" > dummy.txt
hive
CREATE TABLE dual(dummy STRING);
LOAD DATA LOCAL INPATH 'dummy.txt' OVERWRITE INTO TABLE dual;
ADD JAR /full/path/to/local/jar/hive-udf-0.1-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION parse_key_val_tuple AS 'com.cloudera.hive.udf.functions.ParseKeyValueTuple';
Static Parameter Sample:
SELECT data.label, t.*
-- Static data from 'dummy' table for demo
FROM(
-- String to show many edge cases
SELECT 'edge' AS label, 'foo=bar&extra=extra=separator&&empty=&bad&unused=string' AS text FROM dual LIMIT 1
UNION ALL
-- Good string
SELECT 'good' AS label, 'foo=bar&extra=not-extra&&empty=not-empty&bad=not-bad' AS text FROM dual LIMIT 1
UNION ALL
-- Empty string
SELECT 'empty' AS label, '' AS text FROM dual LIMIT 1
UNION ALL
-- NULL string
SELECT 'null' AS label, NULL AS text FROM dual LIMIT 1
) data
LATERAL VIEW parse_key_val_tuple(text, '&', '=', 'foo', 'extra', 'empty', 'bad') t AS foo, extra, empty, bad;
Dynamic Parameter Sample:
SELECT t.*
-- Static data from 'dummy' table for demo
FROM(
-- type = person
SELECT 'greeting=Hello\;person=Mr. Smith' AS text, 'person' AS type FROM dual LIMIT 1
UNION ALL
-- type = thing
SELECT 'greeting=Hi\;thing=World' AS text, 'thing' AS type FROM dual LIMIT 1
) data
-- lookup name by type
LATERAL VIEW parse_key_val_tuple(text, '\;', '=', 'greeting', data.type) t AS greeting, name;