Skip to content

Expression Optimization via Distributive Law

Ashar edited this page Aug 20, 2019 · 9 revisions

This Documentation is no longer valid . The API/ Interfaces presented here have been removed from source. Kept for Record purpose.


Distributive Law

From Britannica

Distributive law the law relating the operations of multiplication and addition, stated symbolically, a*( b + c ) = a*b + a*c ; that is, the monomial factor a is distributed, or separately applied, to each term of the binomial factor b + c, resulting in the product ab + ac. From this law, it is easy to show that the result of first adding several numbers and then multiplying the sum by some number is the same as first multiplying each separately by the number and then adding the products.

Flexibility of Optimization

Optimization at element-level

We do optimization at element level rather than at tensor-level. It has many benefits as this is very good for adaptation to other tensor types such as tensor_sparse and others. It is due to this approach that we can even optimize the operations that involve scalars.

Consider this example:

tensor<int> s{shape{5,5}, 6};
teensor<int> k{shape{5,5}, 3};

auto expr = 3*s + k*7; // Still optimizes

In the evaluation of this expression, the tensor k is just a tensor with all values 3. The optimizer uses this as a common operand and optimizes the above expression into: expr = 3*(s + 7). In this case, the value put to expr would be 3*(6+7)=39.

In case k was a tensor of not all values 3 or only some values 3, the Optimizer would still optimize the expression partially.

Proof of Optimization

Metric for comparison

Using Execution time as a metric for the performance was not a good idea as depending upon the schedular and process counts the CPU may not do justice, So instead, I have relied on the number of instructions that are executed during the executing. The more the instruction the poor the performance.

Testing Code

We have deliberately chosen the following expression for evaluation a*b + a*c. Each operand in the expression is a tensor of extent {500, 5000}

int main() {
  using namespace boost::numeric::ublas;
  using tensor_type = tensor<int>;
  std::vector<int> sa(5000*500), sb(2500000);
  std::iota(sa.begin(), sa.end(), 1);
  std::iota(sb.begin(), sb.end(), 1);
  tensor_type a{shape{500, 5000}, sa}, b{shape{500, 5000}, sb}, c{shape{500,5000}, 1};
  
  auto expr = a*b + a*c;  // build the expression
  tensor_type x = expr; // Evaluate the expression
}

Compilation with/without optimization

By default we use the optimization and compiled the program with following:

g++ test/tensor/test_basic.cpp -I./include/ -std=c++17 -O3 -o a.out

For Un-optimized binary we compile with the flag for disabling the optimization set as follows:

g++ test/tensor/test_basic.cpp -I./include/ -std=c++17 -O3 -o b.out \
							   -DBOOST_UBLAS_NO_OPTIMIZATION

Executing with Valgrind

Optimized
valgrind --tool=callgrind ./a.out # Optimized binary

==14545== Callgrind, a call-graph generating cache profiler

==14545== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.

==14545== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info

==14545== Command: ./a.out

==14545==

==14545== For interactive control, run 'callgrind_control -h'.

==14545==

==14545== Events : Ir

==14545== Collected : 59792202

==14545==

==14545== I refs: 59,792,202

Unoptimized
valgrind --tool=callgrind ./b.out # Unoptimized binary

==14569== Callgrind, a call-graph generating cache profiler

==14569== Copyright (C) 2002-2017, and GNU GPL'd, by Josef Weidendorfer et al.

==14569== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info

==14569== Command: ./b.out

==14569==

==14569== For interactive control, run 'callgrind_control -h'.

==14569==

==14569== Events : Ir

==14569== Collected : 102292217

==14569==

==14569== I refs: 102,292,217

While I do not present any percentage of improvement in the execution. The lesser instruction to execute to evaluate a tensor is certainly better than evaluating after a lot of instructions. If optimization cannot take place after checking for optimization. The check does not deteriorate the performance by a big factor. In most cases of my run, I found that is was only about 2-5% more instructions.

Configuration Macros

We offer you control over the optimization using the following macros.

  • BOOST_UBLAS_NO_EXPRESSION_OPTIMIZATION: If defined this macro causes the expressions do not have any optimization applied. If you do not want us to optimize your expressions please define it during the compilation of your sources.

  • BOOST_UBLAS_NO_RECURSIVE_OPTIMIZATION: If defined this macro causes the optimization to not re-curse to left and right operands in case separated by + or -.

    Informally

    tensor<int> a,b,c; // Assume some value in them each of them and same extent.
    #define BOOST_UBLAS_NO_RECURSIVE_OPTIMIZATION
    
    auto expr = (a*b - b*c) - (c*a - a*b);
    // No optimization takes place the top level operator is - and optimizer does not recurse for optimization to its left and right sides.
    
    auto expr2 = (a*b - b*c) + (c*a - a*b);
    // No optimization, Reason as stated above
    
    auto expr3 = (a*b - b*c) * (c*a - a*b);
    // Does optimizes after recursing even though macro is defined. Because This expression is just one term. Optimizes to (b*(a-c))*(a*(c-b))
    
    #undef BOOST_UBLAS_NO_RECURSIVE_OPTIMIZATION
    
    auto expr4 = (a*b - b*c) - (c*a - a*b);
    // Recurses and optimizes to form: (b*(a-c))-(a*(c-b));
    

WARNING: THE DATA-TYPE OF THE TENSOR MUST PROVIDE == OVERLOAD FOR OPTIMIZATION TO TAKE PLACE. IF YOUR CUSTOM TYPE DOES NOT PROVIDE THIS OPERATOR OVERLOAD, PLEASE DISABLE THE OPTIMIZATION USING THE CONFIGURATION MACRO, ELSE IT IS A COMPILE TIME ERROR.

Clone this wiki locally