-
Notifications
You must be signed in to change notification settings - Fork 17
Home
Table Functions are a powerful mechanism to extend a database’s functionality. Further a Partitioned Table function capability combines extensibility with the ability to partition & parallelize how a function operates on the input data set. Many databases provide this capability whether as Parallel Pipelined Functions in Oracle, or as SQL +MR in Aster.
Hive enables this capability by allowing users to specify Map & Reduce scripts where a Table could appear. But this lacks the simplicity of injecting a table function into SQL. SQL Windowing is a subarea of table functions where aggregations operate on partitions or windows of rows. We feel that both these use cases could benefit from a SQL like interface for a User. SQLWindowing for Hive is an open source project that provides a layer on top of Hive that enables users to specify Windowing and Table Function based analysis in a HQL like fashion. For detailed technical writeup, please look at Windowing.pdf and for details on the project please visit the project page. In this document we give a brief overview of the project, from a End User’s perspective.
There are 2 forms of Queries supported:
Here the input to the Windowing clauses can be any Hive table or Query. The Windowing functions support aggregation, lead, lag, linear regression etc. See examples below.
from <hive table or query> partition by ... order by with windowing expressions select cols where ...
The difference here is that the partitions from the Hive Query are handed to a table function, for e.g. NPath (similar to Aster’s NPath function), which then processes a partition and outputs a partition.
from tableFunction(<hive table or query> partition by ... order by) select cols where ...
Table Functions maybe chained together. It will also be possible to chain Queries, so the input to a Table function could be another SQLWindowing Query.
- download com.sap.hadoop.windowing-0.0.2-SNAPSHOT.jar
cp com.sap.hadoop.windowing-0.0.2-SNAPSHOT.jar to $HIVE_HOME/lib cp $HIVE_HOME/bin/ext/cli.sh $HIVE_HOME/bin/ext/windowCli.sh download groovy-all-1.8.0.jar and copy it to $HIVE_HOME/lib. If you want a more recent version of groovy, download from http://groovy.codehaus.org/Download
- edit windowCli.sh; change to:
THISSERVICE=windowingCli export SERVICE_LIST="${SERVICE_LIST}${THISSERVICE} " windowingCli () { CLASS=com.sap.hadoop.windowing.WindowingHiveCliDriver if $cygwin; then HIVE_LIB=`cygpath -w "$HIVE_LIB"` fi JAR=${HIVE_LIB}/com.sap.hadoop.windowing-0.0.2-SNAPSHOT.jar exec $HADOOP jar $JAR $CLASS "$@" } windowingCli_help () { windowingCli "--help" }
The library is usable in 2 forms: as a Command Line Interface and a as a Java library.
Start the windowingCLI service for e.g.:
$hive --service windowingCli
In this mode you can intermix Hive & SQLWindowing queries from the shell. Use ‘wmode’ to switch between hive & SQLWindowing mode. For e.g. here is a sample session:
SELECT * FROM bucketed_users TABLESAMPLE(BUCKET 1 OUT OF 4 ON id); wmode windowing; from census_q1 partition by county order by county, arealand desc with rank() as r select county, tract, arealand, r into path='/tmp/wout'; wmode hive; select count(*) from movieratings2; wmode windowing; from <select origin_city_name, year, month, day_of_month, dep_time from flightsdata where dest_city_name = 'New York' and dep_time != '' and day_of_week = 1> partition by origin_city_name, year, month, day_of_month order by dep_time select origin_city_name, year, month, day_of_month, dep_time, <lag('dep_time', 1)> as lastdep[string] where <((dep_time[0..1] as int) - (lag('dep_time', 1)[0..1] as int)) * 60 + ((dep_time[2..3] as int) - (lag('dep_time',1)[2..3] as int)) \> 60> into path='/tmp/wout' serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' with serdeproperties('field.delim'=',') format 'org.apache.hadoop.mapred.TextOutputFormat'; from census_q2 partition by county order by pop100 desc with rank() as r, sum(pop100) as s, first_value(pop100) as fv select county, name, pop100, r, <((double)pop100)/fv *100> as percentOfTopSubCounty[double], <lag('pop100', 1) - pop100> as diffFromNextLargestSubCounty[int] into path='/tmp/wout'; wmode hive; select /*+ mapjoin(bucketed_movies_rc2) */ bucketed_movieratings_rc2.userid from bucketed_movieratings_rc2 join bucketed_movies_rc2 on (bucketed_movieratings_rc2.movieid = bucketed_movies_rc2.movie_id); wmode windowing; from <select county, tract, arealand from geo_header_sf1 where sumlev = 140> partition by county order by county, arealand desc with rank() as r select county, tract, arealand, r into path='/tmp/wout' serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' format 'org.apache.hadoop.mapred.TextOutputFormat';
The service is packaged as a java jar that can be embedded in your
App. There is a simple API to interact with the WindowingEngine;
essentially
// setup your hadoop configuration Configuration conf = WORK_LOCALMR(); // create WindowingShell WindowingShell wshell = new WindowingShell(conf, new MRTranslator(), new MRExecutor()); // in API mode setup a Thrift Based HiveQuery Executor //embedded Hive Queries will be run using this Executor. wshell.hiveQryExec = new ThriftBasedHiveQueryExecutor(conf); // execute a Query. wshell.execute(""" from <select origin_city_name, year, month, day_of_month, arr_delay, fl_num from flightsdata where dest_city_name = 'New York' and dep_time != ''> partition by fl_num order by year, month, day_of_month with sum(<arr_delay < 0 ? 1 : 0>) over rows between 5 preceding and current row as delaycount[int] select origin_city_name, fl_num, year, month, day_of_month, delaycount where <delaycount \\>= 5> into path='/tmp/wout' serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' with serdeproperties('field.delim'=',') format 'org.apache.hadoop.mapred.TextOutputFormat'""")
Currently there are 21 functions available. These are loosely divided into Ranking, Aggregation, Navigation and Statistics functions. Not all functions support a windowing clause. Functions are described by an annotation that specifies their details: name, description, windowing support, args with their names, types and if they are required/optional.
from part_rc partition by p_mfgr order by p_mfgr, p_name with rank() as r, sum(p_size) over rows between unbounded preceding and current row as s, sum(p_size) over rows between current row and unbounded following as s1, denserank() as dr, cumedist() as cud, percentrank() as pr, ntile(<3>) as nt, count(<p_size>) as c, count(<p_size>, 'all') as ca, count(<p_size>, 'distinct') as cd, avg(<p_size>) as avg, stddev(p_size) as st, first_value(p_size) as fv, last_value(p_size) as lv, first_value(p_size, 'true') over rows between 2 preceding and 2 following as fv2 select p_mfgr,p_name, p_size, r, dr, cud, pr, nt, c, ca, cd, avg, st, fv,lv, fv2
We have loaded a subset of the 2011 Census data into our Hive instance.
Calculate the Top 3 Tracts(based on land area) by County.
from <select county, tract, arealand from geo_header_sf1 where sumlev = 140> partition by county order by county, arealand desc with rank() as r, sum(arealand) over rows between unbounded preceding and current row as cum_area, select county, tract, arealand, r, cum_area where <r < 3>
Compute the percentage of largest subcounty, difference from next largest county
from census_q2 partition by county order by pop100 desc with rank() as r, sum(pop100) as s, first_value(pop100) as fv select county, name, pop100, r, <((double)pop100)/fv *100> as percentOfTopSubCounty[double], <lag('pop100', 1) - pop100> as diffFromNextLargestSubCounty[int]
These examples are based on the Flights on-time dataset from TSA.
List flights to NY on Mondays, for which there are no flights within 1 hour before this flight.
from <select origin_city_name, year, month, day_of_month, dep_time from flightsdata where dest_city_name = 'New York' and dep_time != '' and day_of_week = 1> partition by origin_city_name, year, month, day_of_month order by dep_time select origin_city_name, year, month, day_of_month, dep_time, <lag('dep_time', 1)> as lastdep[string] where <((dep_time[0..1] as int) - (lag('dep_time', 1)[0..1] as int)) * 60 + ((dep_time[2..3] as int) - (lag('dep_time',1)[2..3] as int)) \\> 60>
List incidents by airline where a passenger would have missed a connecting flight (of the same airline) because of a delay.
- create a set of flights arriving or leaving NY by doing a union all on 2 queries on flightsdata
- tag arriving flights as 1, leaving as 2
- partition by carrier, because we are comparing flights from the same airline
- sort by carrier, year, month, day_of_month and time(arrival or departure)
- for any arriving flight look forward to find any leaving flight
within the window delay + 30 mins. Possible solutions:
- sum(<1>) over range current row and <t+ delay> 30.0 more as numflights; count flights that are within 30’s or arrival time. Problem need to only look at departing flights
- Below just showing expression to compare with next row. Soln 1 is a better soln.
- Custom function.
from <select * from ( select unique_carrier, fl_num, year, month, day_of_month, CRS_ARR_TIME as t, arr_delay as delay, 1 as flight from flightsdata where dest_city_name = 'New York' and dep_time != '' and arr_delay is not null union all select unique_carrier, fl_num, year, month, day_of_month, CRS_DEP_TIME as t, dep_delay as delay, 2 as flight from flightsdata where origin_city_name = 'New York' and dep_time != '' ) t> partition by unique_carrier order by year, month, day_of_month, t select unique_carrier, fl_num, year, month, day_of_month, t where <(flight == 1) && (delay + 30.0 \\> (((lead('t', 1)[0..1] as int) - (t[0..1] as int)) * 60 + ((lead('t', 1)[2..3] as int) - (t[2..3] as int)) ) )>
NPath(String Pattern, GroovyExpr Symbols, GroovyExpr Results)
Returns rows that meet a specified pattern. Use Symbols to specify a list of expressions to match. Pattern is used to specify a Path. The results list can contain expressions based on the input columns and also the matched Path.
List incidents where a Flight(to NY) has been more than 15 minutes late 5 or more times in a row.
from npath(<select origin_city_name, year, month, day_of_month, arr_delay, fl_num from flightsdata where dest_city_name = 'New York' and dep_time != ''> partition by fl_num order by year, month, day_of_month, 'LATE.LATE.LATE.LATE.LATE+', <[LATE : "arr_delay \\> 15"]>, <["origin_city_name", "fl_num", "year", "month", "day_of_month", ["(path.sum() { it.arr_delay})/((double)count)", "double", "avgDelay"], ["count", "int", "numOfDelays"] ]>) select origin_city_name, fl_num, year, month, day_of_month, avgDelay, numOfDelays
- LinearRegSlope
- computes the slope of the regression line fitted to non-null (x, y) pairs.
- LinearRegIntercept
- computes the intercept of the regression line fitted to non-null (x, y) pairs.
- RegCount
- returns the number of non-null (x, y) pairs.
from statisticsdataset partition by p_mfgr order by p_mfgr, p_name with linearRegSlope(val1, val2) as b, linearRegIntercept(val1, val2) as a, regCount(val1, val2) as c select a, b, c
For details on the functions we plan to support look at Table Functions