Hive UDFs

Taking inspiration from Edward Capriolo, this project shows how to perform the rank() and dense_rank() functions within Hive, taking into account tied ranks. Also available is an implementation of first_value() and a UDTF of parse_key_val_tuple().

Building

This project uses Maven. To build the software, simply use "mvn package".

Deploying within Hive

To make the jar available on a temporary basis:

hive> add jar /home/paul/hive-udf-0.1-SNAPSHOT.jar;

To make the function available on a temporary basis:

hive> CREATE TEMPORARY FUNCTION rank AS 'com.cloudera.hive.examples.Rank';

Usage: rank(), dense_rank() and first_value()

The rank(), dense_rank() and first_value() functions take a minimum of one parameter - the column that is to be ranked / retrieved. Since the functions have no ability to sort data for themselves, we use a sub-query to appropriately distribute and sort data before passing to rank().

For example, imagine we have a table such as the following:

select * from items;

item	category	price
Orange	Fruit	0.30
Apple	Fruit	0.25
Banana	Fruit	0.75
Carrot	Veg	0.20
Sprout	Veg	1.75
Kiwi	Fruit	0.30

To use the rank function, prepare the data with an inner query:

select item, category, price from items distribute by category sort by category, price;

Then wrap this in another query that applies the rank function:

select item, category, price, rank(price, category) from ( select item, category, price from items distribute by category sort by category, price) inner;

item	category	price	rank
Apple	Fruit	0.25	1
Orange	Fruit	0.30	2
Kiwi	Fruit	0.30	2
Banana	Fruit	0.75	4
Carrot	Veg	0.20	1
Sprout	Veg	1.75	2

Notice that rank() takes the price column as the first parameter, the rest of the parameters are used to determine the row groupings.

Usage: parse_key_val_tuple()

The parse_key_val_tuple() function is a UDTF that takes a minimum of 4 parameters. The input string to be parsed, the delimiter between all the fields, the separator between key and value pairs, and 1 to many Keys that you would like to extract. Note that the parameters are case sensitive.

See the following links for Hive UDTF usage instructions

Below are two sample queries that show the functions basic capabilities and perhaps illustrate the outcome of some edge cases:

Demo Setup

A one row table called 'dual' is used for selecting demo values statically for demo purposes only.

echo "X" > dummy.txt
hive
CREATE TABLE dual(dummy STRING);
LOAD DATA LOCAL INPATH 'dummy.txt' OVERWRITE INTO TABLE dual;

ADD JAR /full/path/to/local/jar/hive-udf-0.1-SNAPSHOT.jar;
CREATE TEMPORARY FUNCTION parse_key_val_tuple AS 'com.cloudera.hive.udf.functions.ParseKeyValueTuple';

Static Parameter Sample:

SELECT data.label, t.*
-- Static data from 'dummy' table for demo
FROM(
    -- String to show many edge cases
    SELECT 'edge' AS label, 'foo=bar&extra=extra=separator&&empty=&bad&unused=string' AS text FROM dual LIMIT 1
    UNION ALL
    -- Good string
    SELECT 'good' AS label, 'foo=bar&extra=not-extra&&empty=not-empty&bad=not-bad' AS text FROM dual LIMIT 1
    UNION ALL
    -- Empty string
    SELECT 'empty' AS label, '' AS text FROM dual LIMIT 1
    UNION ALL
    -- NULL string
    SELECT 'null' AS label, NULL AS text FROM dual LIMIT 1
    ) data
LATERAL VIEW parse_key_val_tuple(text, '&', '=', 'foo', 'extra', 'empty', 'bad') t AS foo, extra, empty, bad;

Dynamic Parameter Sample:

SELECT t.*
-- Static data from 'dummy' table for demo
FROM(
    -- type = person
    SELECT 'greeting=Hello\;person=Mr. Smith' AS text, 'person' AS type FROM dual LIMIT 1
    UNION ALL
    -- type = thing
    SELECT 'greeting=Hi\;thing=World' AS text, 'thing' AS type FROM dual LIMIT 1
    ) data
-- lookup name by type
LATERAL VIEW parse_key_val_tuple(text, '\;', '=', 'greeting', data.type) t AS greeting, name;

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
sample		sample
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hive UDFs

Building

Deploying within Hive

Usage: rank(), dense_rank() and first_value()

Usage: parse_key_val_tuple()

About

Releases

Packages

Languages

License

coughman/hive-udf

Folders and files

Latest commit

History

Repository files navigation

Hive UDFs

Building

Deploying within Hive

Usage: rank(), dense_rank() and first_value()

Usage: parse_key_val_tuple()

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages