Using activated not Main branch cause server error #203

mtroshyn · 2025-01-29T11:11:41Z

Plugin Version

0.5.2

NetBox Version

4.2.2

Python Version

3.12.3

Steps to Reproduce

Hi.
We have strange behaviour using netbox branching plugin, and I would appreciate any advice and help.

Our environment

We use netbox-docker setup. Our setup is deployed in docker swarm cluster with 2 replicas for netbox UI container, connection to postgresql v16.3 database and keydb v6.3.3 (redis alternative for cache) is proxied through 2 separate haproxy services deployed in docker swarm too. This setup works stable for a long time.

Expected Behavior

I expect that working on new branch created in netbox branching plugin will not return server error as described in my case. Willing to provide more information and do tests if needed. I hope for help from the community.

Observed Behavior

Problem description

We have installed netbox branching module according instructions. After creating new branch, activating it and adding test changes in new branch, after several minutes netbox UI returns server error. Recreating docker container fixes it but after several minutes UI returns server error again. Refreshing the page several times can help avoid server error, but it comes back after few minutes. Deactivating new branch (I mean removing to main branch) seems fix error, because on main branch I can't reproduce error.
I have enabled debug mode and can provide some information from debug output.

The error looks like on screenshot, the exception type, value and location is the same on any URI of netbox UI.

Here is traceback log (copy-paste variant)
traceback.log

Also I want to leave here a screenshot of a traceback with local vars that caught my eye (I highlighted what exactly )

Could it relates to described problem and if yes, how to fix it?

Here is netbox.settings.txt
netbox.settings.txt

I have tried to play with those django settings, changing default values to those

 CONN_MAX_AGE: "60"
 CONN_HEALTH_CHECKS: "'True'"
 DJANGO_ALLOW_ASYNC_UNSAFE: "'True'"
 DISABLE_SERVER_SIDE_CURSORS: "'True'"

but it didn't helped.

The text was updated successfully, but these errors were encountered:

mtroshyn · 2025-02-06T12:34:25Z

I have update on issue. I found the reason. Described error occurs in case of using haproxy between django application and postgresql database, and with non zero values for

timeout client 
timeout server

settings in haproxy config.

The error can be reproduced on fresh netbox-docker setup with netbox branching plugin. Steps to reproduce:

Create netbox directory and place testing configs like in example below

./netbox
├── .env
├── Dockerfile.netbox
├── docker-compose.yaml
├── haproxy.cfg

Dockerfile.netbox

FROM netboxcommunity/netbox:v4.2.2

# install git, its required for install some pip packages
RUN apt update --allow-insecure-repositories && \
    apt install -y git && \
    apt clean && \
    rm -rf /var/lib/apt/lists/*

# copy extra configuration for netbox branching plugin
RUN cat <<EOF > /opt/netbox/netbox/netbox/local_settings.py
from netbox_branching.utilities import DynamicSchemaDict
from .configuration import DATABASE

# Wrap DATABASES with DynamicSchemaDict for dynamic schema support
DATABASES = DynamicSchemaDict({
    'default': DATABASE,
})

# Employ our custom database router
DATABASE_ROUTERS = [
    'netbox_branching.database.BranchAwareRouter',
]
EOF

# add netbox branching plugin to plugins config
RUN cat <<EOF > /etc/netbox/config/plugins.py
PLUGINS = [
  'netbox_branching'
]
EOF

# create plugin requirements file
RUN cat <<EOF > /opt/netbox/plugin_requirements.txt
netboxlabs-netbox-branching==0.5.2
EOF

# install plugin requirements
RUN /opt/netbox/venv/bin/pip install  --no-warn-script-location -r /opt/netbox/plugin_requirements.txt

# needed for plugins to work. This SECRET_KEY is only used during the installation. There's no need to change it.
RUN SECRET_KEY="dummydummydummydummydummydummydummydummydummydummy" /opt/netbox/venv/bin/python /opt/netbox/netbox/manage.py collectstatic --no-input

haproxy.cfg

global
    log stdout local0 debug
    maxconn 4096

defaults
    log global
    mode tcp
    retries 2
    timeout connect 60s
    timeout client 60s
    timeout server 60s
    timeout check 5s

### for update dns records in backends###
resolvers ahresolvers
    nameserver dns1 1.1.1.1:53
    nameserver dns2 8.8.8.8:53
    resolve_retries       3
    timeout retry         1s
    hold other           30s
    hold refused         30s
    hold nx              30s
    hold timeout         30s
    hold valid           10s

### stats UI and prometheus exporter settings ###
listen stats
    mode http
    bind *:7000
    http-request use-service prometheus-exporter if { path /metrics }
    stats enable
    stats hide-version
    stats realm Haproxy\ Statistics
    stats uri /stats
    stats auth stats:stats
    option dontlog-normal

listen postgres
    bind *:5433
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server postgres postgres:5432 maxconn 4096 check

.env file

NETBOX_SECRET_KEY=XbdI@Nc-QaH&@JwDbL3q-%k=^^)&6LNFuQ+d6*zY^T77QpnNv=
REDIS_PASSWORD=CnFu5krpvYyVfLEuAj7Z
DB_PASSWORD=LSZdu9tHbqEMzwrfrjvN
DB_NAME=postgres
DB_USER=postgres

docker-compose.yaml (change keydb image architecture if you use x86 host)

services:
  netbox: &netbox
    build:
      context: .
      dockerfile: ./Dockerfile.netbox
    networks:
      - nbsnet
    ports:
      - "127.0.0.1:8040:8080"
    environment:
      ALLOWED_HOSTS: '*'
      DB_NAME: ${DB_NAME}
      DB_USER: ${DB_USER}
      DB_PASSWORD: ${DB_PASSWORD}
      DB_HOST: haproxy
      DB_PORT: 5433
      REDIS_HOST: keydb
      REDIS_PORT: 6379
      REDIS_PASSWORD: ${REDIS_PASSWORD}
      REDIS_DATABASE: 1
      REDIS_CACHE_HOST: keydb
      REDIS_CACHE_PORT: 6379
      REDIS_CACHE_PASSWORD: ${REDIS_PASSWORD}
      REDIS_CACHE_DATABASE: 2
      SECRET_KEY: ${NETBOX_SECRET_KEY}
      DEBUG: 'True'
    user: 'unit'
    healthcheck:
      start_period: 600s
      timeout: 30s
      interval: 30s
      test: "curl -f http://localhost:8080/login/ || exit 1"

  netbox-worker:
    <<: *netbox
    depends_on:
      - netbox
    command:
      - /opt/netbox/venv/bin/python
      - /opt/netbox/netbox/manage.py
      - rqworker
    ports:
      - 8080
    deploy:
      mode: replicated
      replicas: 1
      update_config:
        parallelism: 1
        order: start-first
        delay: 20s
    healthcheck:
      start_period: 90s
      timeout: 3s
      interval: 15s
      test: "ps -aux | grep -v grep | grep -q rqworker || exit 1"

  netbox-housekeeping:
    <<: *netbox
    depends_on:
      - netbox
    command:
      - /opt/netbox/housekeeping.sh
    deploy:
      mode: replicated
      replicas: 1
      update_config:
        parallelism: 1
        order: start-first
        delay: 20s
    ports:
      - 8080
    healthcheck:
      start_period: 90s
      timeout: 3s
      interval: 15s
      test: "ps -aux | grep -v grep | grep -q housekeeping || exit 1"

  haproxy:
    image: docker.io/haproxy:2.9.8
    networks:
      - nbsnet
    expose:
      - 5433
    volumes:
      - ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg

  postgres:
    image: docker.io/postgres:16.3
    networks:
      - nbsnet
    expose:
      - 5432
    volumes:
      - postgres-data:/var/lib/postgresql/data
    environment:
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_USER: ${DB_USER}
      POSTGRES_DB: ${DB_NAME}
    healthcheck:
      test:
        - CMD-SHELL
        - pg_isready -q -t 2 -d $$POSTGRES_DB -U $$POSTGRES_USER
      timeout: 30s
      interval: 10s
      retries: 5
      start_period: 20s

  keydb:
    image: docker.io/eqalpha/keydb:arm64_v6.3.3
    networks:
      - nbsnet
    expose:
      - 6379
    volumes:
      - keydb-data:/data
    command:
      - sh
      - -c
      - keydb-server --appendonly yes --requirepass $$KEYDB_PASSWORD
    environment:
      KEYDB_PASSWORD: ${REDIS_PASSWORD}
    healthcheck:
      test:
        - CMD-SHELL
        - '[ $$(keydb-cli --pass "$${KEYDB_PASSWORD}" ping) = ''PONG'' ]'
      timeout: 3s
      interval: 1s
      retries: 5
      start_period: 5s

networks:
  nbsnet:
volumes:
  postgres-data:
  keydb-data:

Run commands in netbox dir

docker compose build && docker compose up -d

Wait until image will be build, pulled and all containers started and become healthy.

Login into netbox container and create user

# login into container shell
docker compose exec netbox bash

# run command ad fill requested data
python3 manage.py createsuperuser

Open http://localhost:8040 in browser and login.
Then go to http://localhost:8040/plugins/branching/branches/ and create branch for testing. Make sure it in 'Ready' state, if state is 'New' - check if netbox-worker container is up (if not, run docker compose up -d once more). After testing branch creation, wait 1 minute (just nothing do in netbox UI), then try to switch from main to testing branch. You will get server error with exception like the connection is closed. 1 minute because it is timeouts values in haproxy config.

If you change haproxy settings to

    timeout client 0
    timeout server 0

and redeploy haproxy and netbox containers, it fixes the problem and error doesn't occur. It can be like workaround, but haproxy doesn't recommend to use non zero values for timeouts and print warnings

haproxy-1  | [WARNING]  (1) : config : missing timeouts for proxy 'postgres'.
haproxy-1  |    | While not properly invalid, you will certainly encounter various problems
haproxy-1  |    | with such a configuration. To fix this, please ensure that all following
haproxy-1  |    | timeouts are set to a non-zero value: 'client', 'connect', 'server'.

What I have tried (while keeping non zero client and server timeouts in haproxy):

use option tcpka in default section in haproxy config
change CONN_MAX_AGE django settings to 0 (by default in netbox-docker it has value 300)
use CONN_MAX_AGE with zero and non zero values with CONN_HEALTH_CHECKS: True
use Connection pooling with CONN_HEALTH_CHECKS: True and CONN_HEALTH_CHECKS: False
All those test didn't provide expected result and error occurred.

So, in conclusion - disabling haproxy client and server timeouts can be like workaround.
Maybe it helps someone to save time for investigation when getting such behaviour.
I would also like to get feedback from developers regarding possible concerns when using such as workaround.

mtroshyn added the type: bug A confirmed report of unexpected behavior in the application label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using activated not Main branch cause server error #203

Using activated not Main branch cause server error #203

mtroshyn commented Jan 29, 2025

mtroshyn commented Feb 6, 2025

Using activated not Main branch cause server error #203

Using activated not Main branch cause server error #203

Comments

mtroshyn commented Jan 29, 2025

Plugin Version

NetBox Version

Python Version

Steps to Reproduce

Our environment

Expected Behavior

Observed Behavior

Problem description

mtroshyn commented Feb 6, 2025