This guide takes you through the various interfaces in the Transport UDFs API that enable users to express data types, data objects, type signatures, and UDFs. For information about the project in general please refer to the documentation index
The StdType
interface is the parent class of all type objects that
are used to describe the schema of the data objects that can be
manipulated by StdUDFs
. Sub-interfaces of this interface include
StdIntegerType
, StdBooleanType
, StdLongType
, StdStringType
,
StdDoubleType
, StdFloatType
, StdBinaryType
, StdArrayType
,
StdMapType
, and StdStructType
. Each sub-interface is
defined by methods that are specific to the corresponding type. For
example, StdMapType
interface is defined by the two methods shown
below. The keyType()
and valueType()
methods can be used to obtain
the key and value types of a StdMapType
object.
public interface StdMapType extends StdType {
StdType keyType();
StdType valueType();
}
Similarly, the rest of the StdType
sub-types look like the following:
public interface StdArrayType extends StdType {
StdType elementType();
}
public interface StdStructType extends StdType {
List<? extends StdType> fieldTypes();
}
StdData
is a top-level interface for describing data that can be
manipulated by Transport UDFs. As a top-level interface, StdData
itself does not contain any methods. A number of type-specific
interfaces extend StdData
, such as StdInteger
, StdLong
,
StdBoolean
, StdString
, StdDouble
, StdFloat
, StdBinary
,
StdArray
, StdMap
, and StdStruct
to represent INTEGER
,
LONG
, BOOLEAN
, VARCHAR
, DOUBLE
, REAL
, VARBINARY
, ARRAY
, MAP
,
and STRUCT
SQL types respectively. Each of those interfaces exposes
operations that can manipulate that type of data. For example,
StdMap
interface is defined by the following methods:
public interface StdMap extends StdData {
int size();
StdData get(StdData key);
void put(StdData key, StdData value);
Set<StdData> keySet();
Collection<StdData> values();
boolean containsKey(StdData key);
}
The StdArray
interface is defined by the following methods:
public interface StdArray extends StdData, Iterable<StdData> {
int size();
StdData get(int idx);
void add(StdData e);
}
The StdStruct
interface is defined by the following methods:
public interface StdStruct extends StdData {
StdData getField(int index);
StdData getField(String name);
void setField(int index, StdData value);
void setField(String name, StdData value);
List<StdData> fields();
}
As an example of primitive types, the StdString
interface is defined
as follows. Other primitive types follow the same pattern:
public interface StdString extends StdData {
String get();
}
The Transport UDFs framework provides a way to declare what data types
a UDF expects through type signatures. Type signatures in Transport
UDFs support generic types, including type verification and
inference. For example, a user can state that a UDF expects two
arguments, one of type "array(K)"
and the other of type "array(V)"
and returns an object of type "map(K,V)"
. While type signatures are
exposed to the end user as strings, internally, TypeSignature
class
is used to represent type signature information. Type signatures can
represent types with type parameters, which can be both concrete or
generic. The following are type signature strings along with their
definition:
"varchar"
: to represent SQL Varchar type. The respective Standard Type is StdString."bigint"
: to represent SQL BigInt/Long types. The respective Standard Type is StdLong."integer"
: to represent SQL Int type. The respective Standard Type is StdInteger."boolean"
: to represent SQL Boolean type. The respective Standard Type is StdBoolean."double"
: to represent SQL Double type. The respective Standard Type is StdDouble."real"
: to represent SQL Real type. The respective Standard Type is StdFloat."varbinary"
: to represent SQL Binary type. The respective Standard Type is StdBinary."array(T)"
: to represent SQL Array type, with elements of type T. The respective Standard Type is StdArray."map(K,V)"
: to represent SQL Map type, with keys of type K and values of type V. The respective Standard Type is StdMap."row(f1 T1,.., fn Tn)"
: to represent SQL Struct type with field types T1.. Tn with names f1.. fn, respectively. The respective Standard Type is StdStruct.
StdFactory
is used to create new standard objects of different
types. Similar to StdData
and StdType
, StdFactory
implementations hide all the platforms-specific details from the user
and expose only standard methods. StdFactory
definition is shown
below. As we can see, it contains methods for creating primitive types
using their primitive values, creating container types (arrays, maps,
and structs) using their type information (encapsulated as
StdTypes
), and method to create StdTypes
from type signatures.
public interface StdFactory {
StdInteger createInteger(int value);
StdLong createLong(long value);
StdBoolean createBoolean(boolean value);
StdString createString(String value);
StdDouble createDouble(double value);
StdFloat createFloat(float value);
StdBinary createBinary(ByteBuffer value);
StdArray createArray(StdType stdType, int expectedSize);
StdArray createArray(StdType stdType);
StdMap createMap(StdType stdType);
StdStruct createStruct(List<String> fieldNames, List<StdType> fieldTypes);
StdStruct createStruct(List<StdType> fieldTypes);
StdStruct createStruct(StdType stdType);
StdType createStdType(String typeSignature);
}
All Transport UDF implementations (expressing UDF logic) extend the
StdUDF
abstract class. StdUDF
abstract class is the base class for
more specific StdUDF
abstract sub-classes that are specific to the
number of UDF arguments, i.e., StdUDF0
, StdUDF1
, StdUDF2
,
etc. StdUDF(i)
is an abstract class for UDFs expecting i
arguments. Similar to lambda expressions, StdUDF(i)
abstract classes
are type-parameterized by the input types and output type of the eval
function. Each class is type parameterized by (i + 1)
type
parameters: i
type parameters for the UDF input types, and one type
parameter for the output type. All types (both input and output
types) must extend the StdData
interface, i.e., StdLong
,
StdBoolean
, StdArray
, etc. Below we list the definition of
StdUDF
base class.
public abstract class StdUDF {
private StdFactory _stdFactory;
public abstract List<String> getInputParameterSignatures();
public abstract String getOutputParameterSignature();
public void init(StdFactory stdFactory) {
_stdFactory = stdFactory;
}
public void processRequiredFiles(String[] localFiles) {
}
public boolean[] getNullableArguments() {
return new boolean[numberOfArguments()];
}
protected abstract int numberOfArguments();
public StdFactory getStdFactory() {
return _stdFactory;
}
}
The init()
method is called at the UDF initialization time before
processing any records. It sets the StdFactory
to be used by the
StdUDF
and it can be used to perform necessary UDF
initializations. The methods getInputParameterSignatures()
and
getOutputParameterSignature()
are used to specify the input type
signatures and output type signature of the UDF, respectively. Input
type signatures are represented by a list of strings (that are parsed
into TypeSignature
objects), where each element of the list
represents the signature of a UDF argument. As mentioned above, type
signatures provide the ability to use generic types. The method
numberOfArguments()
has a default implementation in each StdUDF(i)
sub-interface based on the value of i
. We discuss
processRequiredFiles()
and getNullableArguments()
in the following
subsections.
As an example of StdUDF(i)
, we show StdUDF2
definition below. The
eval()
method is the main method of this API, and this is where all
UDF user logic should be expressed. As we can see, it takes two input
types, I1
, and I2
, and return an output type O
, all of which
extend StdData
. Return values of the eval()
method can be
instantiated using the StdFactory
if necessary.
public abstract class StdUDF2<I1 extends StdData, I2 extends StdData, O extends StdData> extends StdUDF {
public abstract O eval(I1 arg1, I2 arg2);
public String[] getRequiredFiles(I1 arg1, I2 arg2) {
return new String[]{};
}
protected final int numberOfArguments() {
return 2;
}
}
The StdUDF
API provides a standard way for accessing HDFS files and
processing them in UDFs. The API exposes two file handling methods:
getRequiredFiles()
, and
processRequiredFiles()
. getRequiredFiles()
is implemented by the
StdUDF
implementation and returns a list of files that are required
to be localized to the workers. processRequiredFiles()
expects a
list of paths of the localized files corresponding to the list of
files requested in getRequiredFiles()
, and hence this list is passed
to the method implementation and is available to the user to parse
those files. Therefore, processRequiredFiles()
provides a mechanism
for users to process the localized files before the execution of the
eval()
method. One common usage is to build hash tables (or bitmaps)
from the files that can be used as lookup tables in the eval()
method.
Nullable arguments are arguments that can receive a null value. When
an argument is declared nullable, the user can implement logic to
specify the behavior of the UDF once an argument takes a null
value. If an argument is non-nullable, the UDF returns null by default
if that argument is null. Users can specify that behavior by
implementing the getNullableArguments()
method. The default
implementation is to set all arguments to be non-nullable.
TopLevelStdUDF
API is an interface that has only two methods:
getFunctionName()
and getFunctionDescription()
. It is used as a
means to enable UDF overloading if necessary. Transport UDFs enable
overloading such that UDFs with the same name can have different type
input parameter type signatures. If a UDF does not require
overloading, it simply implements TopLevelStdUDF
in addition to
extending StdUDF(i)
. That way, the UDF will implement
getFunctionName()
and getFunctionDescription()
in addition to
other methods inherited from StdUDF(i)
, and hence the UDF definition
becomes "flat". On the other hand, if the UDF requires overloading,
then each overloading class implements an interface that extends
TopLevelStdUDF
providing the common name and description of the UDF,
in addition to extending its respective StdUDF(i)
. That way, the UDF
name and description is kept in one place (in the interface extending
TopLevelStdUDF
), but the overloading-specific information is kept in
separate classes.
The example below shows how it becomes so simple to express a UDF by combining all the APIs above together:
public class MapFromTwoArraysFunction extends StdUDF2<StdArray, StdArray, StdMap> implements TopLevelStdUDF {
private StdType _mapType;
@Override
public List<String> getInputParameterSignatures() {
return ImmutableList.of(
"array(K)",
"array(V)"
);
}
@Override
public String getOutputParameterSignature() {
return "map(K,V)";
}
@Override
public void init(StdFactory stdFactory) {
super.init(stdFactory);
_mapType = getStdFactory().createStdType(getOutputParameterSignature());
}
@Override
public StdMap eval(StdArray a1, StdArray a2) {
if (a1.size() != a2.size()) {
return null;
}
StdMap map = getStdFactory().createMap(_mapType);
for (int i = 0; i < a1.size(); i++) {
map.put(a1.get(i), a2.get(i));
}
return map;
}
@Override
public String getFunctionName() {
return "map_from_two_arrays";
}
@Override
public String getFunctionDescription() {
return "A function to create a map out of two arrays";
}
}
More examples can be found in the examples module.