diff --git a/docs/reference.md b/docs/reference.md
index a87c436..94d0a46 100644
--- a/docs/reference.md
+++ b/docs/reference.md
@@ -50,6 +50,17 @@ Heterogenerous layers (`H-` prefix):
 - `HAdd`: Element-wise addition.
 - `HDenseBatchNorm`: `HDense` with fused batch normalization. No resource overhead when converting to hls4ml.
 - `HConv*DBatchNorm`: `HConv*D` with fused batch normalization. No resource overhead when converting to hls4ml.
+- (New in 0.2) `HActivation` with **arbitrary unary function**. (See note below.)
+
+```{note}
+`HActivation` will be converted to a general `unary LUT` in `to_proxy_model` when
+ - the required table size is smaller or equal to `unary_lut_max_table_size`.
+ - the corresponding function is not `relu`.
+
+Here, table size is determined by $2^{bw_{in}}$, where $bw_{in}$ is the bitwidth of the input.
+
+If the condition is not met, already supported activations like `tanh` or `sigmoid` will be done in the traditional way. However, if a arbitrary unary function is used, the conversion will fail. Thus, when using arbitrary unary functions, make sure that the table size is small enough.
+```
 
 ```{note}
 `H*BatchNorm` layers require both scaling and shifting parameters to be fused into the layer. Thus, when bias is set to `False`, shifting will not be available.
@@ -64,6 +75,14 @@ Passive layers (`P-` prefix):
 - `PFlatten`: Flatten layer.
 - `Signature`: Does nothing, but marks the input to the next layer as already quantized to specified bitwidth.
 
+```{note}
+Average pooling layers are now bit-accurate, with the requirement that **all** individual pool size is a power of 2. This include all padded pools, with are with smaller sizes, if any.
+```
+
+```{warning}
+As of hls4ml v0.9.1, padding in pooling layers with `io_stream` is not supported. If you are using `io_stream`, please make sure that the padding is set to `valid`. For more details, merely setting `padding='same'` is fine, but no actual padding may be performed, or the generated firmware will fail at an assertion.
+```
+
 ## Commonly used functions
 
 - `trace_minmax`: Trace the min/max values of the model against a dataset, print computed `BOPs` per-layer, and return the accumulated `BOPs` of the model.
@@ -96,3 +115,7 @@ Though the proxy model is bit-accurate with hls4ml in general, exceptions exist:
 ```{tip}
 The proxy model can also be used to convert a `QKeras` model to a bit-accurate hls4ml-ready proxy model. See more details in the [Regarding QKeras](qkeras.md) section.
 ```
+
+```{warning}
+Experimental: Nested layer structure is now supported by `to_keras_model` in v0.2.0. If you pass a model with nested layers, the function will flatten the model. However, be careful that some information in the inner models (e.g., `parallelization_factor`) may be lost during the conversion.
+```
diff --git a/docs/tips.md b/docs/tips.md
index fd380ff..4770258 100644
--- a/docs/tips.md
+++ b/docs/tips.md
@@ -8,17 +8,19 @@ The BOPs generated by this framework can be use as a good estimator for the on-c
 - `reuse_factor` is set to 1.
 - `parallel_factor` is set to match the number of convolution kernel application count (Everything done in parallel).
 
-If `io_parallel` is used, resource consumption in terms of LUTs, can be estimated by: $$\#\mathrm{LUTs}\sim\#\mathrm{BOPs}$$
+If `io_parallel` is used, resource consumption can be estimated in terms of a linear combination of LUTs and DSPs: $$\mathrm{LUTs}+55\cdot\mathrm{LUTs}\sim\mathrm{BOPs}$$
+
+The factor in front of DSPs is rough, but the final order-of-magnitude estimation is still useful.
 
 If `io_stream` is used, you will need to add resources used for FIFOs, which cannot be directly estimated from BOPs, and depends on the specific implementation (i.e. ShiftRegister vs. BRAM).
 
-## Regarding `#pragma HLS DATAFLOW`
+## Regarding `#pragma HLS DATAFLOW` in `vivado/vitis`
 
 If you are using `io_parallel` AND met the above conditions AND and has a colvolution layer in your network, you may get a much larger resource consumption than expected together with terrible latency. In this case, please try changing the `#pragma HLS DATAFLOW` to `#pragma HLS PIPELINE` or simply removing it and re-synthesize the code.
 
-## Regarding `#pragma HLS INLINE RECURSIVE`
+## Regarding `#pragma HLS INLINE RECURSIVE` in `vivado`
 
-If you are using `io_parallel` with `latency` strategy, you may try adding `#pragma HLS INLINE RECURSIVE` to your top function. This **may** reduce the resource consumption for some networks. In many cases, resource consumption can be reduced by $\sim10$%, and latency may or may not be improved.
+If you are using `io_parallel` with `latency` strategy with `vivado_hls`, you may try adding `#pragma HLS INLINE RECURSIVE` to your top function. This **may** reduce the resource consumption for some networks. In many cases, resource consumption can be reduced by $\sim10$%, and latency may or may not be improved.
 
 ## When use intra-layer heterogeneous quantization
 
@@ -28,6 +30,6 @@ For intra-later heterogeneous activation quantization, if you are using `io_para
 
 ## When using only inter-layer heterogeneous quantization
 
-It is **recommended** one should use inter-layer heterogeneous quantization **if and only if** the model is planned to be deployed with the `resource` strategy in `hls4ml`. This is equivalent to optimizing bitwidths with approximated gradient, and the obtained resource may be better or worse than the `AutoQKeras` counterpart.
+One is **recommended** to disable intra-layer heterogeneous weight quantization **if and only if** the model is planned to be deployed with the `resource` strategy in `hls4ml`. When intra-layer heterogeneous quantization is not enabled, this is equivalent to optimizing bitwidths with approximated gradients. The obtained resource may be better or worse than the `AutoQKeras` counterpart.
 
 When doing this, it is **strongly recommended** to use only `L1` and/or `L2` regulation on weights and activations (i.e., set `beta=0`), as the training time BOPs estimated is **not accurate at all and not relevant**.