Skip to content

Latest commit

 

History

History
336 lines (293 loc) · 10.4 KB

BUG_REPORTS.md

File metadata and controls

336 lines (293 loc) · 10.4 KB

Debugging and reporting bugs in Contiv-VPP

Bug report structure

Describe deployment

Since contiv-vpp can be used with different configuration it is helpful to attach the config that was applied. Either attach values.yaml passed to helm chart or the corresponding part from deployment yaml file.

  contiv.yaml: |-
    TCPstackDisabled: true
    UseTAPInterfaces: true
    TAPInterfaceVersion: 2
    NatExternalTraffic: true
    MTUSize: 1500
    IPAMConfig:
      PodSubnetCIDR: 10.1.0.0/16
      PodSubnetOneNodePrefixLen: 24
      PodVPPSubnetCIDR: 10.2.1.0/24
      VPPHostSubnetCIDR: 172.30.0.0/16
      VPPHostSubnetOneNodePrefixLen: 24
      NodeInterconnectCIDR: 192.168.16.0/24
      VxlanCIDR: 192.168.30.0/24
      NodeInterconnectDHCP: False

Information that might be helpful:

  • whether node IPs are statically assigned or DHCP is used
  • STN is enabled
  • version of TAP interfaces used
  • output of kubectl get pods -o wide --all-namespaces

Collecting the logs

The most essential thing that one needs to do in case of debugging and reporting an issue in Contiv-VPP is collecting the logs from the contiv-vpp vswitch containers.

a) Collecting vswitch logs using kubectl

In order to collect the logs from individual vswitches in the cluster, connect to the master node a find out the POD names of individual vswitch containers:

$ kubectl get pods --all-namespaces | grep vswitch
kube-system   contiv-vswitch-lqxfp               2/2       Running   0          1h
kube-system   contiv-vswitch-q6kwt               2/2       Running   0          1h

Then run the following command with replaced by the actual POD name:

$ kubectl logs <pod name> -n kube-system -c contiv-vswitch

You can redirect the output to a file to save the logs, for example:

kubectl logs contiv-vswitch-lqxfp -n kube-system -c contiv-vswitch > logs-master.txt

b) Collecting vswitch logs using docker

If the option a) does not work for some reason, you can still collect the same logs using the plain docker command. For that, you need to connect to each individual node in the k8s cluster and find the container ID of the vswitch container:

$ docker ps | grep contivvpp/vswitch
b682b5837e52        contivvpp/vswitch                                        "/usr/bin/supervisor…"   2 hours ago         Up 2 hours                              k8s_contiv-vswitch_contiv-vswitch-q6kwt_kube-system_d09b6210-2903-11e8-b6c9-08002723b076_0

Now use the ID from the first column to dump the logs into the logs-master.txt file:

$ docker logs b682b5837e52 > logs-master.txt

Reviewing the vswitch logs

In order to debug an issue, it is good to start by grepping the logs for the level=error string, e.g.:

$ cat logs-master.txt | grep level=error

Also, VPP or contiv-agent may crash in case of some bugs. To check if some process crashed, grep for the string exit, e.g.:

$ cat logs-master.txt | grep exit
2018-03-20 06:03:45,948 INFO exited: vpp (terminated by SIGABRT (core dumped); not expected)
2018-03-20 06:03:48,948 WARN received SIGTERM indicating exit request

Collecting the STN daemon logs

In STN (Steal The NIC) deployment scenarios, it is often needed to collect and review the logs from the STN daemon. This needs to be done on each node:

$ docker logs contiv-stn > logs-stn-master.txt

Collecting the logs in case of crash loop

If the vswitch is crashing in a loop (which can be determined by increasing number in the RESTARTS column of the kubectl get pods --all-namespaces output), the kubectl logs or docker logs would give us the logs of the latest incarnation of the vswitch. That might not be the original root cause of the very first crash, so in order to debug that, we need to disable k8s health check probes to not restart the vswitch after the very first crash. This can be done by commenting-out the readinessProbe and livenessProbe in the contiv-vpp deployment YAML:

diff --git a/k8s/contiv-vpp.yaml b/k8s/contiv-vpp.yaml
index 3676047..ffa4473 100644
--- a/k8s/contiv-vpp.yaml
+++ b/k8s/contiv-vpp.yaml
@@ -224,18 +224,18 @@ spec:
           ports:
             # readiness + liveness probe
             - containerPort: 9999
-          readinessProbe:
-            httpGet:
-              path: /readiness
-              port: 9999
-            periodSeconds: 1
-            initialDelaySeconds: 15
-          livenessProbe:
-            httpGet:
-              path: /liveness
-              port: 9999
-            periodSeconds: 1
-            initialDelaySeconds: 60
+ #         readinessProbe:
+ #           httpGet:
+ #             path: /readiness
+ #             port: 9999
+ #           periodSeconds: 1
+ #           initialDelaySeconds: 15
+ #         livenessProbe:
+ #           httpGet:
+ #             path: /liveness
+ #             port: 9999
+ #           periodSeconds: 1
+ #           initialDelaySeconds: 60
           env:
             - name: MICROSERVICE_LABEL
               valueFrom:

If VPP is the crashing process, please follow the CORE_FILES guide and provide the coredump file.

Inspect VPP config

  • Configured interfaces (issues related basic node/pod connectivity issues)
vpp# sh int addr
GigabitEthernet0/9/0 (up):
  192.168.16.1/24
local0 (dn):
loop0 (up):
  l2 bridge bd_id 1 bvi shg 0
  192.168.30.1/24
tapcli-0 (up):
  172.30.1.1/24
  • IP forwarding table:
vpp# sh ip fib
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto ] locks:[src:(nil):2, src:adjacency:3, src:default-route:1, ]
0.0.0.0/0
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:1 buckets:1 uRPF:0 to:[7:552]]
    [0] [@0]: dpo-drop ip4
0.0.0.0/32
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:2 buckets:1 uRPF:1 to:[0:0]]
    [0] [@0]: dpo-drop ip4

... 
...

255.255.255.255/32
  unicast-ip4-chain
  [@0]: dpo-load-balance: [proto:ip4 index:5 buckets:1 uRPF:4 to:[0:0]]
    [0] [@0]: dpo-drop ip4
  • ARP Table
vpp# sh ip arp
    Time           IP4       Flags      Ethernet              Interface       
    728.6616  192.168.16.2     D    08:00:27:9c:0e:9f GigabitEthernet0/8/0
    542.7045  192.168.30.2     S    1a:2b:3c:4d:5e:02 loop0
      1.4241   172.30.1.2      D    86:41:d5:92:fd:24 tapcli-0
     15.2485    10.1.1.2      SN    00:00:00:00:00:02 tapcli-1
    739.2339    10.1.1.3      SN    00:00:00:00:00:02 tapcli-2
    739.4119    10.1.1.4      SN    00:00:00:00:00:02 tapcli-3
  • NAT configuration (issues related to services):
DBGvpp# sh nat44 addresses
NAT44 pool addresses:
192.168.16.10
  tenant VRF independent
  0 busy udp ports
  0 busy tcp ports
  0 busy icmp ports
NAT44 twice-nat pool addresses:
vpp# sh nat44 static mappings 
NAT44 static mappings:
 tcp local 192.168.42.1:6443 external 10.96.0.1:443 vrf 0  out2in-only
 tcp local 192.168.42.1:12379 external 192.168.42.2:32379 vrf 0  out2in-only
 tcp local 192.168.42.1:12379 external 192.168.16.2:32379 vrf 0  out2in-only
 tcp local 192.168.42.1:12379 external 192.168.42.1:32379 vrf 0  out2in-only
 tcp local 192.168.42.1:12379 external 192.168.16.1:32379 vrf 0  out2in-only
 tcp local 192.168.42.1:12379 external 10.109.143.39:12379 vrf 0  out2in-only
 udp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0  out2in-only
 tcp local 10.1.2.2:53 external 10.96.0.10:53 vrf 0  out2in-only
vpp# sh nat44 interfaces
NAT44 interfaces:
 loop0 in out
 GigabitEthernet0/9/0 out
 tapcli-0 in out
vpp# sh nat44 sessions
NAT44 sessions:
  192.168.20.2: 0 dynamic translations, 3 static translations
  10.1.1.3: 0 dynamic translations, 0 static translations
  10.1.1.4: 0 dynamic translations, 0 static translations
  10.1.1.2: 0 dynamic translations, 6 static translations
  10.1.2.18: 0 dynamic translations, 2 static translations
  • ACL config (issues related to policies):
vpp# sh acl-plugin acl
  • "Steal the NIC (STN)" config (issues related to host connectivity when STN is active):
vpp# sh stn rules 
- rule_index: 0
  address: 10.1.10.47
  iface: tapcli-0 (2)
  next_node: tapcli-0-output (410)
  • Errors:
vpp# sh errors
  • Vxlan tunnels:
vpp# sh vxlan tunnels
  • Vxlan tunnels:
vpp# sh vxlan tunnels
  • Hardware interface information:
vpp# sh hardware-interfaces

Basic Example

contiv-vpp-bug-report.sh is an example of a script that may be useful as a starting point to gathering the above information using kubectl.

Limitations:

  • the script does not include STN daemon logs nor does it handle the special case of a crash loop

Prerequisites:

  • The user specified in the script must have passwordless access to all nodes in the cluster; on each node in the cluster the user must have passwordless access to sudo.

Setting up prerequisites

To enable looging into a node without a password, copy your public key to the node:

ssh-copy-id <user-id>@<node-name-or-ip-address>

To enable running sudo without a password for a given user, do:

$ sudo visudo

Append the following entry to run ALL command without a password for a given user:

<userid> ALL=(ALL) NOPASSWD:ALL

You can also add user <user-id> to group sudo and edit the sudo entry as follows:

# Allow members of group sudo to execute any command
%sudo	ALL=(ALL:ALL) NOPASSWD:ALL

Add user <user-id> to group <group-id> as follows:

sudo adduser <user-id> <group-id>

or as follows:

usermod -a -G <group-id> <user-id>

Working with the Contiv-vpp Vagrant test bed

The script can be used to collect data from the Contiv-vpp test bed created with Vagrant. To collect debug information from this Contiv-vpp test bed, do the following steps:

  • In the directory where you created your vagrant test bed, do:
  vagrant ssh-config > vagrant-ssh.conf
  • To collect the debug information do:
  ./contiv-vpp-bug-report.sh -u vagrant -m k8s-master -f <path-to-your-vagrant-ssh-config-file>/vagrant-ssh.conf