RFC : empty fifos + hardware mailboxes when re-enabling CAN ? #137

fenugrec · 2022-11-16T22:16:47Z

Scenario :

device is connected and communicating on bus
disconnect bus; this could leave some frames pending transmission in the software fifo and hardware mailboxes.
if anything but a hard reset is done, the remaining frames in the fifo and mboxes will be sent on the next bus that we connect to. That is, even ip link set ... down , or a usb-reset, will not clear mboxes etc. May be an undesirable behaviour.

It is not as simple as just modifying can_disable() since this func is also called on USB suspend . IMO a USB suspend+resume cycle should not empty buffers; current behaviour seems ok.

Comments, ideas, counter-examples ?

The text was updated successfully, but these errors were encountered:

marckleinebudde · 2022-11-16T22:33:09Z

A ip link set $IFACE down should shut down the CAN controller and leave the bus. A ip link set $IFACE up should reset the CAN controller, program it's registers and join the bus.

During down I would reset the TX FIFO and discard all frames that are queued for CAN TX. Probably the same for frames queued for going to the host via USB.

Shutting down properly down during high bus load is an interesting task...

marckleinebudde · 2022-11-16T22:41:32Z

https://github.com/torvalds/linux/blob/master/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c#L1648

For a Linux driver the (important) code during ifdown is:

static int mcp251xfd_stop(struct net_device *ndev)
{
	struct mcp251xfd_priv *priv = netdev_priv(ndev);

	netif_stop_queue(ndev);                          // Stops the TX queue from the kernel into the driver
	mcp251xfd_chip_interrupts_disable(priv);
	free_irq(ndev->irq, priv);
	mcp251xfd_chip_stop(priv, CAN_STATE_STOPPED);    // Shuts down CAN controller, as much as possible, reset
	mcp251xfd_transceiver_disable(priv);

	pm_runtime_put(ndev->dev.parent);                // power CAN controller down

	return 0;
}

fenugrec · 2022-11-16T22:47:13Z

Shutting down properly down during high bus load is an interesting task...

Agreed.
I think on the CAN side it's fairly "straightforward" on F0 :

make sure USB can't pump more frames into firmware fifo (can't just set EP to NAK since that would kill everything on multichannel)
disable CAN RX
set ABRQ bit on all mboxes (delay should be deterministic, a function of bitrate ; worst case is a frame that just started transmitting. bxCAN will complete the transmission then empty the mbx)
empty firmware fifos
option 1: poll all mboxes with a maximum timeout (either fixed, or determined by bitrate ) - not a huge fan, could disrupt USB comms ?
option 2: reset bxCAN peripheral, accept possible corrupt frames on the bus.

marckleinebudde · 2022-11-16T22:56:02Z

looks good - usually the CAN controllers have different modes, e.g. config mode where you can configure everything and they are not on the bus and a normal mode, where they CAN RX and TX on the bus. joining and leaving the bus takes some time, the controller takes care not to leave the bus during an ongoing TX, you usually have to poll some bit.

ryedwards · 2022-11-18T20:33:35Z

An example from the HAL...not sure if even they are doing it correctly:

/**
  * @brief  Stop the FDCAN module and enable access to configuration registers.
  * @param  hfdcan pointer to an FDCAN_HandleTypeDef structure that contains
  *         the configuration information for the specified FDCAN.
  * @retval HAL status
  */
HAL_StatusTypeDef HAL_FDCAN_Stop(FDCAN_HandleTypeDef *hfdcan)
{
  uint32_t Counter = 0U;

  if (hfdcan->State == HAL_FDCAN_STATE_BUSY)
  {
    /* Request initialisation */
    SET_BIT(hfdcan->Instance->CCCR, FDCAN_CCCR_INIT);

    /* Wait until the INIT bit into CCCR register is set */
    while ((hfdcan->Instance->CCCR & FDCAN_CCCR_INIT) == 0U)
    {
      /* Check for the Timeout */
      if (Counter > FDCAN_TIMEOUT_VALUE)
      {
        /* Update error code */
        hfdcan->ErrorCode |= HAL_FDCAN_ERROR_TIMEOUT;

        /* Change FDCAN state */
        hfdcan->State = HAL_FDCAN_STATE_ERROR;

        return HAL_ERROR;
      }

      /* Increment counter */
      Counter++;
    }

    /* Reset counter */
    Counter = 0U;

    /* Exit from Sleep mode */
    CLEAR_BIT(hfdcan->Instance->CCCR, FDCAN_CCCR_CSR);

    /* Wait until FDCAN exits sleep mode */
    while ((hfdcan->Instance->CCCR & FDCAN_CCCR_CSA) == FDCAN_CCCR_CSA)
    {
      /* Check for the Timeout */
      if (Counter > FDCAN_TIMEOUT_VALUE)
      {
        /* Update error code */
        hfdcan->ErrorCode |= HAL_FDCAN_ERROR_TIMEOUT;

        /* Change FDCAN state */
        hfdcan->State = HAL_FDCAN_STATE_ERROR;

        return HAL_ERROR;
      }

      /* Increment counter */
      Counter++;
    }

    /* Enable configuration change */
    SET_BIT(hfdcan->Instance->CCCR, FDCAN_CCCR_CCE);

    /* Reset Latest Tx FIFO/Queue Request Buffer Index */
    hfdcan->LatestTxFifoQRequest = 0U;

    /* Change FDCAN peripheral state */
    hfdcan->State = HAL_FDCAN_STATE_READY;

    /* Return function status */
    return HAL_OK;
  }
  else
  {
    /* Update error code */
    hfdcan->ErrorCode |= HAL_FDCAN_ERROR_NOT_STARTED;

    return HAL_ERROR;
  }
}

marckleinebudde · 2023-07-06T11:38:56Z

FYI: on the RX-side, we have the same problem and it even crashes the Linux driver if HW Timestamps are enabled.

On the Linux side we first shoot down all pending bulk transfers and then send the reset command to the candlelight. If the CAN bus is busy, this will lead to RX'ed CAN frames in the "to host" queue. All queued RX'ed frames will be send to the host as soon as the driver submits the new bulk urbs.

I'll fix the crash on the Linux side. A proper reset for the RX (and TX) queues should be implemented in the FW, too.

commit 5886e4d upstream. If the gs_usb device driver is unloaded (or unbound) before the interface is shut down, the USB stack first calls the struct usb_driver::disconnect and then the struct net_device_ops::ndo_stop callback. In gs_usb_disconnect() all pending bulk URBs are killed, i.e. no more RX'ed CAN frames are send from the USB device to the host. Later in gs_can_close() a reset control message is send to each CAN channel to remove the controller from the CAN bus. In this race window the USB device can still receive CAN frames from the bus and internally queue them to be send to the host. At least in the current version of the candlelight firmware, the queue of received CAN frames is not emptied during the reset command. After loading (or binding) the gs_usb driver, new URBs are submitted during the struct net_device_ops::ndo_open callback and the candlelight firmware starts sending its already queued CAN frames to the host. However, this scenario was not considered when implementing the hardware timestamp function. The cycle counter/time counter infrastructure is set up (gs_usb_timestamp_init()) after the USBs are submitted, resulting in a NULL pointer dereference if timecounter_cyc2time() (via the call chain: gs_usb_receive_bulk_callback() -> gs_usb_set_timestamp() -> gs_usb_skb_set_timestamp()) is called too early. Move the gs_usb_timestamp_init() function before the URBs are submitted to fix this problem. For a comprehensive solution, we need to consider gs_usb devices with more than 1 channel. The cycle counter/time counter infrastructure is setup per channel, but the RX URBs are per device. Once gs_can_open() of _a_ channel has been called, and URBs have been submitted, the gs_usb_receive_bulk_callback() can be called for _all_ available channels, even for channels that are not running, yet. As cycle counter/time counter has not set up, this will again lead to a NULL pointer dereference. Convert the cycle counter/time counter from a "per channel" to a "per device" functionality. Also set it up, before submitting any URBs to the device. Further in gs_usb_receive_bulk_callback(), don't process any URBs for not started CAN channels, only resubmit the URB. Fixes: 45dfa45 ("can: gs_usb: add RX and TX hardware timestamp support") Closes: candle-usb/candleLight_fw#137 (comment) Cc: [email protected] Cc: John Whittington <[email protected]> Link: https://lore.kernel.org/all/20230716-gs_usb-fix-time-stamp-counter-v1-2-9017cefcd9d5@pengutronix.de Signed-off-by: Marc Kleine-Budde <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

If the gs_usb device driver is unloaded (or unbound) before the interface is shut down, the USB stack first calls the struct usb_driver::disconnect and then the struct net_device_ops::ndo_stop callback. In gs_usb_disconnect() all pending bulk URBs are killed, i.e. no more RX'ed CAN frames are send from the USB device to the host. Later in gs_can_close() a reset control message is send to each CAN channel to remove the controller from the CAN bus. In this race window the USB device can still receive CAN frames from the bus and internally queue them to be send to the host. At least in the current version of the candlelight firmware, the queue of received CAN frames is not emptied during the reset command. After loading (or binding) the gs_usb driver, new URBs are submitted during the struct net_device_ops::ndo_open callback and the candlelight firmware starts sending its already queued CAN frames to the host. However, this scenario was not considered when implementing the hardware timestamp function. The cycle counter/time counter infrastructure is set up (gs_usb_timestamp_init()) after the USBs are submitted, resulting in a NULL pointer dereference if timecounter_cyc2time() (via the call chain: gs_usb_receive_bulk_callback() -> gs_usb_set_timestamp() -> gs_usb_skb_set_timestamp()) is called too early. Move the gs_usb_timestamp_init() function before the URBs are submitted to fix this problem. For a comprehensive solution, we need to consider gs_usb devices with more than 1 channel. The cycle counter/time counter infrastructure is setup per channel, but the RX URBs are per device. Once gs_can_open() of _a_ channel has been called, and URBs have been submitted, the gs_usb_receive_bulk_callback() can be called for _all_ available channels, even for channels that are not running, yet. As cycle counter/time counter has not set up, this will again lead to a NULL pointer dereference. Convert the cycle counter/time counter from a "per channel" to a "per device" functionality. Also set it up, before submitting any URBs to the device. Further in gs_usb_receive_bulk_callback(), don't process any URBs for not started CAN channels, only resubmit the URB. Fixes: 45dfa45 ("can: gs_usb: add RX and TX hardware timestamp support") Closes: candle-usb/candleLight_fw#137 (comment) Cc: [email protected] Cc: John Whittington <[email protected]> Link: https://lore.kernel.org/all/20230716-gs_usb-fix-time-stamp-counter-v1-2-9017cefcd9d5@pengutronix.de Signed-off-by: Marc Kleine-Budde <[email protected]>

commit 5886e4d upstream. If the gs_usb device driver is unloaded (or unbound) before the interface is shut down, the USB stack first calls the struct usb_driver::disconnect and then the struct net_device_ops::ndo_stop callback. In gs_usb_disconnect() all pending bulk URBs are killed, i.e. no more RX'ed CAN frames are send from the USB device to the host. Later in gs_can_close() a reset control message is send to each CAN channel to remove the controller from the CAN bus. In this race window the USB device can still receive CAN frames from the bus and internally queue them to be send to the host. At least in the current version of the candlelight firmware, the queue of received CAN frames is not emptied during the reset command. After loading (or binding) the gs_usb driver, new URBs are submitted during the struct net_device_ops::ndo_open callback and the candlelight firmware starts sending its already queued CAN frames to the host. However, this scenario was not considered when implementing the hardware timestamp function. The cycle counter/time counter infrastructure is set up (gs_usb_timestamp_init()) after the USBs are submitted, resulting in a NULL pointer dereference if timecounter_cyc2time() (via the call chain: gs_usb_receive_bulk_callback() -> gs_usb_set_timestamp() -> gs_usb_skb_set_timestamp()) is called too early. Move the gs_usb_timestamp_init() function before the URBs are submitted to fix this problem. For a comprehensive solution, we need to consider gs_usb devices with more than 1 channel. The cycle counter/time counter infrastructure is setup per channel, but the RX URBs are per device. Once gs_can_open() of _a_ channel has been called, and URBs have been submitted, the gs_usb_receive_bulk_callback() can be called for _all_ available channels, even for channels that are not running, yet. As cycle counter/time counter has not set up, this will again lead to a NULL pointer dereference. Convert the cycle counter/time counter from a "per channel" to a "per device" functionality. Also set it up, before submitting any URBs to the device. Further in gs_usb_receive_bulk_callback(), don't process any URBs for not started CAN channels, only resubmit the URB. Fixes: 45dfa45 ("can: gs_usb: add RX and TX hardware timestamp support") Closes: candle-usb/candleLight_fw#137 (comment) Cc: [email protected] Cc: John Whittington <[email protected]> Link: https://lore.kernel.org/all/20230716-gs_usb-fix-time-stamp-counter-v1-2-9017cefcd9d5@pengutronix.de Signed-off-by: Marc Kleine-Budde <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>

BugLink: https://bugs.launchpad.net/bugs/2037005 commit 5886e4d upstream. If the gs_usb device driver is unloaded (or unbound) before the interface is shut down, the USB stack first calls the struct usb_driver::disconnect and then the struct net_device_ops::ndo_stop callback. In gs_usb_disconnect() all pending bulk URBs are killed, i.e. no more RX'ed CAN frames are send from the USB device to the host. Later in gs_can_close() a reset control message is send to each CAN channel to remove the controller from the CAN bus. In this race window the USB device can still receive CAN frames from the bus and internally queue them to be send to the host. At least in the current version of the candlelight firmware, the queue of received CAN frames is not emptied during the reset command. After loading (or binding) the gs_usb driver, new URBs are submitted during the struct net_device_ops::ndo_open callback and the candlelight firmware starts sending its already queued CAN frames to the host. However, this scenario was not considered when implementing the hardware timestamp function. The cycle counter/time counter infrastructure is set up (gs_usb_timestamp_init()) after the USBs are submitted, resulting in a NULL pointer dereference if timecounter_cyc2time() (via the call chain: gs_usb_receive_bulk_callback() -> gs_usb_set_timestamp() -> gs_usb_skb_set_timestamp()) is called too early. Move the gs_usb_timestamp_init() function before the URBs are submitted to fix this problem. For a comprehensive solution, we need to consider gs_usb devices with more than 1 channel. The cycle counter/time counter infrastructure is setup per channel, but the RX URBs are per device. Once gs_can_open() of _a_ channel has been called, and URBs have been submitted, the gs_usb_receive_bulk_callback() can be called for _all_ available channels, even for channels that are not running, yet. As cycle counter/time counter has not set up, this will again lead to a NULL pointer dereference. Convert the cycle counter/time counter from a "per channel" to a "per device" functionality. Also set it up, before submitting any URBs to the device. Further in gs_usb_receive_bulk_callback(), don't process any URBs for not started CAN channels, only resubmit the URB. Fixes: 45dfa45 ("can: gs_usb: add RX and TX hardware timestamp support") Closes: candle-usb/candleLight_fw#137 (comment) Cc: [email protected] Cc: John Whittington <[email protected]> Link: https://lore.kernel.org/all/20230716-gs_usb-fix-time-stamp-counter-v1-2-9017cefcd9d5@pengutronix.de Signed-off-by: Marc Kleine-Budde <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Kamal Mostafa <[email protected]> Signed-off-by: Roxana Nicolescu <[email protected]>

fenugrec mentioned this issue Aug 6, 2024

Why does it keep sensing the same frame endlessly o error or NAK? #185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC : empty fifos + hardware mailboxes when re-enabling CAN ? #137

RFC : empty fifos + hardware mailboxes when re-enabling CAN ? #137

fenugrec commented Nov 16, 2022

marckleinebudde commented Nov 16, 2022 •

edited

Loading

marckleinebudde commented Nov 16, 2022

fenugrec commented Nov 16, 2022

marckleinebudde commented Nov 16, 2022

ryedwards commented Nov 18, 2022

marckleinebudde commented Jul 6, 2023

RFC : empty fifos + hardware mailboxes when re-enabling CAN ? #137

RFC : empty fifos + hardware mailboxes when re-enabling CAN ? #137

Comments

fenugrec commented Nov 16, 2022

marckleinebudde commented Nov 16, 2022 • edited Loading

marckleinebudde commented Nov 16, 2022

fenugrec commented Nov 16, 2022

marckleinebudde commented Nov 16, 2022

ryedwards commented Nov 18, 2022

marckleinebudde commented Jul 6, 2023

marckleinebudde commented Nov 16, 2022 •

edited

Loading