[BUG]: vsomeip slow to restart with lots of EventGroup #690

joeyoravec · 2024-05-04T01:25:10Z

vSomeip Version

v3.4.10

Boost Version

1.82

Environment

Android and QNX

Describe the bug

My automotive system has *.fidl with ~3500 attributes, one per CAN signal. My *.fdepl maps each attribute into a unique EventGroup.

Especially when resuming from suspend-to-ram it's possible that UDP SOMEIP-SD will be operational but TCP socket will be broken. This leads to tce restart() but during this time any Subscribe will receive SubscribeNack in response:

4191	105.781314	10.6.0.10	10.6.0.3	SOME/IP-SD	1408	SOME/IP Service Discovery Protocol [Subscribe]
4192	105.790868	10.6.0.3	10.6.0.10	SOME/IP-SD	1396	SOME/IP Service Discovery Protocol [SubscribeNack]
4193	105.792094	10.6.0.10	10.6.0.3	SOME/IP-SD	1410	SOME/IP Service Discovery Protocol [Subscribe]
4194	105.801525	10.6.0.10	10.6.0.3	SOME/IP-SD	1410	SOME/IP Service Discovery Protocol [Subscribe]
4195	105.802118	10.6.0.3	10.6.0.10	SOME/IP-SD	1398	SOME/IP Service Discovery Protocol [SubscribeNack]
4196	105.819610	10.6.0.3	10.6.0.10	SOME/IP-SD	1398	SOME/IP Service Discovery Protocol [SubscribeNack]

as the number of EventGroup scales to a large number, this become catastrophic to performance.

In service_discovery_impl::handle_eventgroup_subscription_nack() each EventGroup calls restart():

vsomeip/implementation/service_discovery/src/service_discovery_impl.cpp

Lines 2517 to 2521 in cf49723

    
           if (!its_subscription->is_selective()) { 
        
               auto its_reliable = its_subscription->get_endpoint(true); 
        
               if (its_reliable) 
        
                   its_reliable->restart(); 
        
           }

and in tcp_client_endpoint_impl::restart() while ::CONNECTING the code will "early terminate" from maximum 5 restarts:

vsomeip/implementation/endpoints/src/tcp_client_endpoint_impl.cpp

Lines 77 to 85 in cf49723

    
           if (!_force && self->state_ == cei_state_e::CONNECTING) { 
        
               std::chrono::steady_clock::time_point its_current 
        
                   = std::chrono::steady_clock::now(); 
        
               std::int64_t its_connect_duration = std::chrono::duration_cast<std::chrono::milliseconds>( 
        
                       its_current - self->connect_timepoint_).count(); 
        
               if (self->aborted_restart_count_ < self->tcp_restart_aborts_max_ 
        
                       && its_connect_duration < self->tcp_connect_time_max_) { 
        
                   self->aborted_restart_count_++; 
        
                   return;

thereafter the code will fall through, calling shutdown_and_close_socket_unlocked() and perform the full restart even while a connection is in progress.

As the system continues processing 1000s of SubscribeNack this will be a tight loop of 100% cpu load and multiple seconds to plow-through the workload. This can easily exceed a 2s ServiceDiscovery interval and cascade to further problems.

Reproduction Steps

My reproduction was:

start with fully-established communication between tse and tce
tce enters suspend-to-ram with TCP socket established
allow tse to continue running, exceed TCP keepalive timeout, and close the TCP socket
tce resumes from suspend-to-ram thinking TCP socket is still established, then discovers it to be closed

but any use-case where tse closes the TCP socket but UDP is functional should be sufficient.

Expected behaviour

Performance should be better.

Logs and Screenshots

No response

The text was updated successfully, but these errors were encountered:

joeyoravec · 2024-05-04T01:32:56Z

We came up with 3 possible solutions;

eliminate the tce restart() call from service_discovery_impl::handle_eventgroup_subscription_nack(). It's not clear why this is required or how it would help
modify tce restart() to "early terminate" better, perhaps an unlimited number of times within the 5 second timeout
ensure that SOMEIP-SD gets inhibited around any event like suspend-to-ram where network communication will be lost. Try to prevent Subscribe until the TCP socket gets re-established

Interested in feedback on what would be most effective

joeyoravec added the bug label May 4, 2024

joeyoravec linked a pull request May 4, 2024 that will close this issue

elminate tcp_client_endpoint restart on SubscribeNack #691

Draft

joeyoravec mentioned this issue May 13, 2024

[BUG]: vsomeip slow to restart due to VSOMEIP_MAX_TCP_SENT_WAIT_TIME #701

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: vsomeip slow to restart with lots of EventGroup #690

[BUG]: vsomeip slow to restart with lots of EventGroup #690

joeyoravec commented May 4, 2024

joeyoravec commented May 4, 2024

[BUG]: vsomeip slow to restart with lots of EventGroup #690

[BUG]: vsomeip slow to restart with lots of EventGroup #690

Comments

joeyoravec commented May 4, 2024

vSomeip Version

Boost Version

Environment

Describe the bug

Reproduction Steps

Expected behaviour

Logs and Screenshots

joeyoravec commented May 4, 2024