-
Notifications
You must be signed in to change notification settings - Fork 128
Thoughts on Madlib API (Nov 2010)
** Thoughts on MADlib® API: **
Summary:
- Internal UDFs/UDAs:
- External UDFs.
- Instructions only.
NOTE: we should also add "view coercion" as a coding style, and iterative/recursive methods as a requirement.
Details:
** 1. Internal UDFs/UDAs: **
This is the most common approach to UDFs/UDAs. All of the database built in functions are using this methodology. Functions may be written in any built in database language but C is preferable for performance reasons. If performance is not an issue PL/pgSQL is usually next best choice.
General synopsis: output_data_type madlib.function_name( arg1, ..., argN) , where arg1 to argN can be of any data type
* Example 1. (simple UDF): *
- Synopsis: numeric pg_catalog.round( numeric)
- Example call:
psql> select round(1.77);
round
-------
2
(1 row)
* Example 2. (more complex UDA): *
- Synopsis: float[] mregr_coef( float, float[])
- Source data:
psql> select * from regr_example;
y | x1 | x2 | x3
----+----+----+----
2 | 1 | 1 | 1
4 | 2 | 2 | 2
10 | 5 | 4 | 5
20 | 10 | 11 | 9
(4 rows)
- Example call:
psql> SELECT mregr_coef(y, array[x1, x2, x3]), mregr_r2(y, array[x1, x2, x3]) from regr_example;
mregr_coef | mregr_r2
--------------------------------------------------------------+-------------------
{1.99999999999272,1.81898940354586e-12,1.81898940354586e-12} | 0.999999999995081
(1 row)
** 2. External UDFs. **
If a given function requires more complex input and output specification, or must access some additional data during runtime it may be necessary to use more complex API. In such case there may be a large set if input parameters needed and/or some input and output tables/views.
Guidelines:
(1-1) A generic parameter table may be used to specify parameter sets for a given madlib function.
Table: madlib.parameters:
Columns:
- pset_id INT NOT NULL -- parameter set id
- func_name NAME NOT NULL -- madlib function name (no schema needed, "madlib" assumed)
- par_name NAME NOT NULL -- name of the function parameter
- par_type VARCHAR(10) NOT NULL -- data type of the function parameter (for proper conversion)
- par_desc TEXT -- description of the parameter
- par_value TEXT -- value of the parameter
Constraints:
- primary key (set_id, func_name, par_name)
- check constraint ( par_type in ('input', 'output', 'int', 'float', 'text', 'int[]', 'float[]', 'text[]')
(1-2) Parameter validation:
A function (madlib._check_parameters) would be provided to test the correctness of the parameter values.
This function can be then used inside each user function at the beginning of the execution.
It would perform the following test based on the value of the par_type field:
- input : value must be in the form of schema.table_view_name and must exist in the database;
- output : value must be in the form of schema.table_name and if exists it will be overwritten or appended
- int/int[] : value must be an INT/INT[]
- float/float[] : value must be a FLOAT/FLOAT[]
- text/text[] : value must be a TEXT/TEXT[]
(1-3) Logging:
- Information:
INFO: TimeStamp : madlib.func_name : Message
- Error/Exception:
ERROR: TimeStamp : madlib.func_name : Message
TimeStamp: date_trunc( 'sec', clock_timestamp()::timestamp)
(1-4) Final Output:
INFO: TimeStamp : madlib.func_name : Finished OK. Total runtime HH:MM:SS
(1-5) Example:
K-means clustering function (madlib.k_means_cluster) developed according to the above guidelines:
* Assuming the following:
** Source table/view: source_schema.data_points ( id INT, dim1 FLOAT, dim2 FLOAT, dim3 FLOAT);
This table/view would have to follow certain structure - described in the the function specs.
For example the k_means_cluster may allow the following two types of input structures,
validation of which would occur at the beginning of the execution:
- Table/view with 2 columns:
point_id INT
coordinates FLOAT[]
- Table/view with N+1 columns where N is the number of dimensions:
point_id INT
dim1 FLOAT
...
dimN FLOAT
** And the fact that the function requires a target number of clusters as input.
* Example API design:
a) Option #1: when using madlib.parameters table:
- Synopsis: void madlib.k_means_cluster( pset_id INT)
- Paramemters: select * from madlib.parameters;
pset_id | func_name | par_name | par_type | par_desc | par_value
---------+-----------------+----------+----------+-------------------------------------+--------------------
1 | k_means_cluster | p_source | input | Source table/view with data points. | source.data_points
1 | k_means_cluster | p_k | int | Number of clusters to generate. | 10
1 | k_means_cluster | p_out | output | Target table to save the output in. | output.clusters
- Example call: select madlib.k_means_cluster( 1);
Where '1' represents the parameter set it
b) Option #2: using PostgreSQL function overloading feature:
- Synopsis: void madlib.k_means_cluster( p_source TEXT, p_k INT, p_out TEXT)
- Example call: select madlib.k_means_cluster( 'source.data_points', 10, 'output.clusters');
The advantage of the option #1 is the ability to use a single function for parameter types/values validation.
** 3. Instructions only. **
This approach would apply to situations where there is no madlib UDFs or UDAs, but there is still valid solution provided in the form of step by step instruction set.
Example: weighted random sampling w/o replacment.
psql> SELECT * FROM population_table ORDER BY log( random()^(1/weight)) DESC limit <sample size>;