Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
calad0i committed Feb 21, 2024
1 parent 3e8d084 commit b50d38c
Show file tree
Hide file tree
Showing 2 changed files with 30 additions and 5 deletions.
23 changes: 23 additions & 0 deletions docs/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,17 @@ Heterogenerous layers (`H-` prefix):
- `HAdd`: Element-wise addition.
- `HDenseBatchNorm`: `HDense` with fused batch normalization. No resource overhead when converting to hls4ml.
- `HConv*DBatchNorm`: `HConv*D` with fused batch normalization. No resource overhead when converting to hls4ml.
- (New in 0.2) `HActivation` with **arbitrary unary function**. (See note below.)

```{note}
`HActivation` will be converted to a general `unary LUT` in `to_proxy_model` when
- the required table size is smaller or equal to `unary_lut_max_table_size`.
- the corresponding function is not `relu`.
Here, table size is determined by $2^{bw_{in}}$, where $bw_{in}$ is the bitwidth of the input.
If the condition is not met, already supported activations like `tanh` or `sigmoid` will be done in the traditional way. However, if a arbitrary unary function is used, the conversion will fail. Thus, when using arbitrary unary functions, make sure that the table size is small enough.
```

```{note}
`H*BatchNorm` layers require both scaling and shifting parameters to be fused into the layer. Thus, when bias is set to `False`, shifting will not be available.
Expand All @@ -64,6 +75,14 @@ Passive layers (`P-` prefix):
- `PFlatten`: Flatten layer.
- `Signature`: Does nothing, but marks the input to the next layer as already quantized to specified bitwidth.

```{note}
Average pooling layers are now bit-accurate, with the requirement that **all** individual pool size is a power of 2. This include all padded pools, with are with smaller sizes, if any.
```

```{warning}
As of hls4ml v0.9.1, padding in pooling layers with `io_stream` is not supported. If you are using `io_stream`, please make sure that the padding is set to `valid`. For more details, merely setting `padding='same'` is fine, but no actual padding may be performed, or the generated firmware will fail at an assertion.
```

## Commonly used functions

- `trace_minmax`: Trace the min/max values of the model against a dataset, print computed `BOPs` per-layer, and return the accumulated `BOPs` of the model.
Expand Down Expand Up @@ -96,3 +115,7 @@ Though the proxy model is bit-accurate with hls4ml in general, exceptions exist:
```{tip}
The proxy model can also be used to convert a `QKeras` model to a bit-accurate hls4ml-ready proxy model. See more details in the [Regarding QKeras](qkeras.md) section.
```

```{warning}
Experimental: Nested layer structure is now supported by `to_keras_model` in v0.2.0. If you pass a model with nested layers, the function will flatten the model. However, be careful that some information in the inner models (e.g., `parallelization_factor`) may be lost during the conversion.
```
12 changes: 7 additions & 5 deletions docs/tips.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,19 @@ The BOPs generated by this framework can be use as a good estimator for the on-c
- `reuse_factor` is set to 1.
- `parallel_factor` is set to match the number of convolution kernel application count (Everything done in parallel).

If `io_parallel` is used, resource consumption in terms of LUTs, can be estimated by: $$\#\mathrm{LUTs}\sim\#\mathrm{BOPs}$$
If `io_parallel` is used, resource consumption can be estimated in terms of a linear combination of LUTs and DSPs: $$\mathrm{LUTs}+55\cdot\mathrm{LUTs}\sim\mathrm{BOPs}$$

The factor in front of DSPs is rough, but the final order-of-magnitude estimation is still useful.

If `io_stream` is used, you will need to add resources used for FIFOs, which cannot be directly estimated from BOPs, and depends on the specific implementation (i.e. ShiftRegister vs. BRAM).

## Regarding `#pragma HLS DATAFLOW`
## Regarding `#pragma HLS DATAFLOW` in `vivado/vitis`

If you are using `io_parallel` AND met the above conditions AND and has a colvolution layer in your network, you may get a much larger resource consumption than expected together with terrible latency. In this case, please try changing the `#pragma HLS DATAFLOW` to `#pragma HLS PIPELINE` or simply removing it and re-synthesize the code.

## Regarding `#pragma HLS INLINE RECURSIVE`
## Regarding `#pragma HLS INLINE RECURSIVE` in `vivado`

If you are using `io_parallel` with `latency` strategy, you may try adding `#pragma HLS INLINE RECURSIVE` to your top function. This **may** reduce the resource consumption for some networks. In many cases, resource consumption can be reduced by $\sim10$%, and latency may or may not be improved.
If you are using `io_parallel` with `latency` strategy with `vivado_hls`, you may try adding `#pragma HLS INLINE RECURSIVE` to your top function. This **may** reduce the resource consumption for some networks. In many cases, resource consumption can be reduced by $\sim10$%, and latency may or may not be improved.

## When use intra-layer heterogeneous quantization

Expand All @@ -28,6 +30,6 @@ For intra-later heterogeneous activation quantization, if you are using `io_para

## When using only inter-layer heterogeneous quantization

It is **recommended** one should use inter-layer heterogeneous quantization **if and only if** the model is planned to be deployed with the `resource` strategy in `hls4ml`. This is equivalent to optimizing bitwidths with approximated gradient, and the obtained resource may be better or worse than the `AutoQKeras` counterpart.
One is **recommended** to disable intra-layer heterogeneous weight quantization **if and only if** the model is planned to be deployed with the `resource` strategy in `hls4ml`. When intra-layer heterogeneous quantization is not enabled, this is equivalent to optimizing bitwidths with approximated gradients. The obtained resource may be better or worse than the `AutoQKeras` counterpart.

When doing this, it is **strongly recommended** to use only `L1` and/or `L2` regulation on weights and activations (i.e., set `beta=0`), as the training time BOPs estimated is **not accurate at all and not relevant**.

0 comments on commit b50d38c

Please sign in to comment.