diff --git a/docs/reference.md b/docs/reference.md index a87c436..94d0a46 100644 --- a/docs/reference.md +++ b/docs/reference.md @@ -50,6 +50,17 @@ Heterogenerous layers (`H-` prefix): - `HAdd`: Element-wise addition. - `HDenseBatchNorm`: `HDense` with fused batch normalization. No resource overhead when converting to hls4ml. - `HConv*DBatchNorm`: `HConv*D` with fused batch normalization. No resource overhead when converting to hls4ml. +- (New in 0.2) `HActivation` with **arbitrary unary function**. (See note below.) + +```{note} +`HActivation` will be converted to a general `unary LUT` in `to_proxy_model` when + - the required table size is smaller or equal to `unary_lut_max_table_size`. + - the corresponding function is not `relu`. + +Here, table size is determined by $2^{bw_{in}}$, where $bw_{in}$ is the bitwidth of the input. + +If the condition is not met, already supported activations like `tanh` or `sigmoid` will be done in the traditional way. However, if a arbitrary unary function is used, the conversion will fail. Thus, when using arbitrary unary functions, make sure that the table size is small enough. +``` ```{note} `H*BatchNorm` layers require both scaling and shifting parameters to be fused into the layer. Thus, when bias is set to `False`, shifting will not be available. @@ -64,6 +75,14 @@ Passive layers (`P-` prefix): - `PFlatten`: Flatten layer. - `Signature`: Does nothing, but marks the input to the next layer as already quantized to specified bitwidth. +```{note} +Average pooling layers are now bit-accurate, with the requirement that **all** individual pool size is a power of 2. This include all padded pools, with are with smaller sizes, if any. +``` + +```{warning} +As of hls4ml v0.9.1, padding in pooling layers with `io_stream` is not supported. If you are using `io_stream`, please make sure that the padding is set to `valid`. For more details, merely setting `padding='same'` is fine, but no actual padding may be performed, or the generated firmware will fail at an assertion. +``` + ## Commonly used functions - `trace_minmax`: Trace the min/max values of the model against a dataset, print computed `BOPs` per-layer, and return the accumulated `BOPs` of the model. @@ -96,3 +115,7 @@ Though the proxy model is bit-accurate with hls4ml in general, exceptions exist: ```{tip} The proxy model can also be used to convert a `QKeras` model to a bit-accurate hls4ml-ready proxy model. See more details in the [Regarding QKeras](qkeras.md) section. ``` + +```{warning} +Experimental: Nested layer structure is now supported by `to_keras_model` in v0.2.0. If you pass a model with nested layers, the function will flatten the model. However, be careful that some information in the inner models (e.g., `parallelization_factor`) may be lost during the conversion. +``` diff --git a/docs/tips.md b/docs/tips.md index fd380ff..4770258 100644 --- a/docs/tips.md +++ b/docs/tips.md @@ -8,17 +8,19 @@ The BOPs generated by this framework can be use as a good estimator for the on-c - `reuse_factor` is set to 1. - `parallel_factor` is set to match the number of convolution kernel application count (Everything done in parallel). -If `io_parallel` is used, resource consumption in terms of LUTs, can be estimated by: $$\#\mathrm{LUTs}\sim\#\mathrm{BOPs}$$ +If `io_parallel` is used, resource consumption can be estimated in terms of a linear combination of LUTs and DSPs: $$\mathrm{LUTs}+55\cdot\mathrm{LUTs}\sim\mathrm{BOPs}$$ + +The factor in front of DSPs is rough, but the final order-of-magnitude estimation is still useful. If `io_stream` is used, you will need to add resources used for FIFOs, which cannot be directly estimated from BOPs, and depends on the specific implementation (i.e. ShiftRegister vs. BRAM). -## Regarding `#pragma HLS DATAFLOW` +## Regarding `#pragma HLS DATAFLOW` in `vivado/vitis` If you are using `io_parallel` AND met the above conditions AND and has a colvolution layer in your network, you may get a much larger resource consumption than expected together with terrible latency. In this case, please try changing the `#pragma HLS DATAFLOW` to `#pragma HLS PIPELINE` or simply removing it and re-synthesize the code. -## Regarding `#pragma HLS INLINE RECURSIVE` +## Regarding `#pragma HLS INLINE RECURSIVE` in `vivado` -If you are using `io_parallel` with `latency` strategy, you may try adding `#pragma HLS INLINE RECURSIVE` to your top function. This **may** reduce the resource consumption for some networks. In many cases, resource consumption can be reduced by $\sim10$%, and latency may or may not be improved. +If you are using `io_parallel` with `latency` strategy with `vivado_hls`, you may try adding `#pragma HLS INLINE RECURSIVE` to your top function. This **may** reduce the resource consumption for some networks. In many cases, resource consumption can be reduced by $\sim10$%, and latency may or may not be improved. ## When use intra-layer heterogeneous quantization @@ -28,6 +30,6 @@ For intra-later heterogeneous activation quantization, if you are using `io_para ## When using only inter-layer heterogeneous quantization -It is **recommended** one should use inter-layer heterogeneous quantization **if and only if** the model is planned to be deployed with the `resource` strategy in `hls4ml`. This is equivalent to optimizing bitwidths with approximated gradient, and the obtained resource may be better or worse than the `AutoQKeras` counterpart. +One is **recommended** to disable intra-layer heterogeneous weight quantization **if and only if** the model is planned to be deployed with the `resource` strategy in `hls4ml`. When intra-layer heterogeneous quantization is not enabled, this is equivalent to optimizing bitwidths with approximated gradients. The obtained resource may be better or worse than the `AutoQKeras` counterpart. When doing this, it is **strongly recommended** to use only `L1` and/or `L2` regulation on weights and activations (i.e., set `beta=0`), as the training time BOPs estimated is **not accurate at all and not relevant**.