-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Describe the bug
Freeswitch randomly writes SSL data to files that it has open in r/w mode due to some sort of race condition.
To Reproduce
Steps to reproduce the behavior:
- Set up a freeswitch profile to use WSS webrtc
- Expose it to the internet and allow people to attempt to use it
- after some time freeswitch will overwrite random filehandles that it has open with SSL data
- FreeSWITCH crashes with sqlite "file is not a database" messages in the majority of cases as the filehandle that is most often used and in r/w mode is the one for /dev/shm/core.db. This bug could potentially overwrite any file that is open in r/w mode although the more often it is opened the more likely it is to be a victim which is probably why it seems to pick on /dev/shm/core.db
Expected behavior
freeswitch does not do random things!
Package version or git hash
1.10.10
This is the same issue previously reported in #420 which was mistakenly closed as a SQLite problem.
I am 100% certain that it not for the following reasons:
-
we have the patched version of SQLite installed and in use
-
the bug in SQLite pointed to from core db gets corrupted #420 https://www.philipotoole.com/how-i-found-a-bug-in-sqlite/ is in a specific SLQite area that needs special action to invoke and freeswitch does not use this. This means the bug will probably never affect freeswitch so a SQLite upgrade is unnecessary. The SQLite bug as reported in that link affects only databases using this specifc set up https://www.sqlite.org/inmemorydb.html and to the best of my knowledge this way of using SQLite is not used by freeswitch. It is 100% definitely not used by our installation. And even if it was (which it isn't!) we have the patched version running.
-
our freeswitch instance is set up to use param name="core-db-name" value="/dev/shm/core.db" which is an ordinary file based SQLite database. It just happens to be on a filesystem that is 'in memory' but that is not the same thing as a SQLite 'in memory' database as described above. All these reasons rule out a SQLite problem.
In addition I have added debug code to libsofia-sip-ua/tport/ws.c in the ws_close() function, immediately prior to the SSL_write() that it uses. This debug code is
/* check if no fatal error occurs on connection */
char procbuf[1024];
char fnbuf[1024];
ssize_t fnlen;
sprintf(procbuf,"/proc/self/fd/%d",wsh->sock);
if ((fnlen = readlink(procbuf,fnbuf,sizeof(fnbuf)-1)) != -1) {
fnbuf[fnlen] = '\0';
printf("WS ws_close fd %d target %s\n",wsh->sock,fnbuf);
}
else {
printf("WS ws_close fd %d readlink failed\n",wsh->sock);
}
code = SSL_write(wsh->ssl, buf, 1);
and when the bug is hit it shows the following information
Oct 18 07:50:47 fstrtc01.voiceflex.com stdbuf[389441]: WS ws_close fd 97 target /dev/shm/core.db
Oct 18 12:20:39 fstrtc01.voiceflex.com stdbuf[395088]: WS ws_close fd 63 target /dev/shm/core.db
Oct 18 19:20:46 fstrtc01.voiceflex.com stdbuf[403552]: WS ws_close fd 58 target /dev/shm/core.db
The SSL_write in ws_close() is issued immediately after that debug message and is writing SSL data to random files as per the similar problem that Facebook engineers found and debugged in their code referenced https://engineering.fb.com/2014/08/12/ios/debugging-file-corruption-on-ios/
"The SSL layer was writing to a socket that was already closed and subsequently reassigned to our database file. "
and
"Using a hex analyzer, we found a common prefix across the attachments: 17 03 03 00 28"
Our overwrite is not identical but similar enough for it to be the same problem:
$ for dbfile in $(ls /dev/shm/core.db-*); do echo $dbfile; od -X $dbfile | head -1;done
/dev/shm/core.db-1697017847
0000000 00030317 2299b112 b8d06713 442802c5
/dev/shm/core.db-1697359841
0000000 00030317 10e68912 769c4a0c 587aac68
/dev/shm/core.db-1697457656
0000000 00030317 cdc06712 26b9d175 b91379ec
/dev/shm/core.db-1697474154
0000000 00030317 0298eb12 10f4a06d a38d8a70
/dev/shm/core.db-1697477136
0000000 00030317 39243212 f67342eb 4eedfd89
/dev/shm/core.db-1697537157
0000000 00030317 bbc1a212 0dd52644 ae8bd737
/dev/shm/core.db-1697547645
0000000 00030317 c42d2612 233c9109 1cb91ff7
/dev/shm/core.db-1697569857
0000000 00030317 9e3fe612 3b797957 df9c0fba
/dev/shm/core.db-1697604657
0000000 00030317 af282512 3c08939a ac980ecf
/dev/shm/core.db-1697615458
0000000 694c5153 03031774 dd7c1300 2223e3b0
/dev/shm/core.db-1697631649
0000000 00030317 21e31112 bad65ac4 f6609114
/dev/shm/core.db-1697656856
0000000 00030317 42088b12 b27bce2d f3ed235c
Only /dev/shm/core.db-1697615458 does not start with 0x1703030012 and that has 0x1703030013 at +5 into the file.
Within a few seconds of those debug messages being issued we start to get
2023-10-18 07:34:41.466490 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.179.208.228
2023-10-18 07:50:47.946480 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.8.18.233
2023-10-18 07:50:48.026471 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.8.18.233
2023-10-18 07:50:48.026471 99.60% [ERR] switch_core_sqldb.c:728 [db="/dev/shm/core.db",type="core_db"] NATIVE SQL ERR [file is not a database]
BEGIN EXCLUSIVE
2023-10-18 07:50:48.026471 99.60% [CRIT] switch_core_sqldb.c:2109 ERROR [file is not a database], [db="/dev/shm/core.db",type="core_db"]
2023-10-18 07:50:48.026471 99.60% [ERR] switch_core_sqldb.c:728 [db="/dev/shm/core.db",type="core_db"] NATIVE SQL ERR [cannot commit - no transaction is active]
COMMIT
2023-10-18 07:50:48.086485 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.8.18.233
There are identical messages in the logs for the core.db overwrite at 12:20:39 and 19:20:46
Once this database is overwritten in this way then freeswitch will refuse to start up as the core.db.dsn database is corrupted and cannot be opened. It has to be deleted/renamed for freeswitch to start up again.
I also see hangs in freeswitch where it stops responding to connection attempts and a gcore taken at the time of this shows that it is stuck in ws_close() in the middle of the SSL_write() call. I suspect this is related and we are sending to a socket that is not expecting us to write to it (wild abandoned guess!) but this can be stuck there waiting for sometimes hours. I suspect this is also involved in #1934 where people are reporting hangs when attempting to use WSS. This is from the most recent gcore taken for fd 49 which hung from Oct 19 15:23:25 to 16:26:43 when it woke up again.
(gdb) thread apply 18 bt full
Thread 18 (Thread 0x7f078bfff640 (LWP 450637)):
#0 0x00007f07bd53e91c in read () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007f07bccdd091 in sock_read () from /lib64/libcrypto.so.3
No symbol table info available.
#2 0x00007f07bcccd44b in bread_conv () from /lib64/libcrypto.so.3
No symbol table info available.
#3 0x00007f07bccd0445 in bio_read_intern () from /lib64/libcrypto.so.3
No symbol table info available.
#4 0x00007f07bccd05c7 in BIO_read () from /lib64/libcrypto.so.3
No symbol table info available.
#5 0x00007f07bd130e8c in ssl3_read_n.part () from /lib64/libssl.so.3
No symbol table info available.
#6 0x00007f07bd132af9 in ssl3_read_bytes () from /lib64/libssl.so.3
No symbol table info available.
#7 0x00007f07bd1425c0 in state_machine.part () from /lib64/libssl.so.3
No symbol table info available.
#8 0x00007f07bd13076e in ssl3_write_bytes () from /lib64/libssl.so.3
No symbol table info available.
#9 0x00007f07bd1136e7 in SSL_write () from /lib64/libssl.so.3
No symbol table info available.
#10 0x00007f07bd2a5340 in ws_close (wsh=wsh@entry=0x7f0781f71240, reason=reason@entry=0) at tport/ws.c:900
ssl_error = 0
buf = 0x7f07bd2d41b6 "0"
procbuf = "/proc/self/fd/49\000\000\316\201\a\177\000\000\320\347\377\213\a\177\000\000\310\347\377\213\a\177\000\000\204\236(\275\a\177\000\000\220y2\200\a\177\000\000\300\240\061\275\a\177\000\000P\303\316\201\a\177\000\000\200\350\377\213\a\177\000\000\000\000\000\000\000\000\000\000֢(\275\a\177\000\000\377\377\377\177\000\000\000\000\064Z)\275\a\177\000\000tag=ZNS5\337Q-\275\a\177\000\000\377\377\377\377\377\377\377\377?B\017\000\000\000\000\000P\303\316\201\a\177\000\000\000\326\354\367\376AN\245\377\377\377\177\000\000\000\000P\303\316\201\a\177\000\000\000\000\000\000\000\000\000\000\365\022*\275\a\177\000\000P\303\316\201\a\177\000\000\000\067a\202\a\177"...
fnbuf = "socket:[9955738]\000\000\000\000\000\000\000\000d\002\000\000\000\000\000\000\001", '\000' <repeats 15 times>, "\307\002\000\000\000\000\000\000\325\371J\275\a\177\000\000\236\304\333\350\000\000\000\000\273\360\b\000\000\000\000\000@\345\377\213\a\177\000\000@\345\377\213\a\177\000\000\340\vy\202\a\177\000\000\361\001*\275\a\177\000\000\001\000\000\000\000\000\000\000\240\344\377\213\a\177\000\000\060\067a\202\a\177\000\000\001\201\001\200\a\177\000\000\312\326,\275\a\177\000\000\020\202\001\200\a\177\000\000d\002\000\000\000\000\000\000`.Ё\000\000\000\004\000\nV\022C\000\000\000\000\003\000\n]_|\006", '\000' <repeats 25 times>...
fnlen = <optimized out>
code = 0
#11 0x00007f07bd2a5bd3 in ws_destroy (wsh=0x7f0781f71240) at tport/ws.c:836
No locals.
#12 0x00007f07bd2a5ca3 in tport_ws_deinit_secondary (self=0x7f0781f71050) at tport/tport_type_ws.c:536
wstp = <optimized out>
wstp = <optimized out>
__func__ = {<optimized out> <repeats 26 times>}
#13 tport_ws_deinit_secondary (self=0x7f0781f71050) at tport/tport_type_ws.c:530
wstp = 0x7f0781f71050
__func__ = "tport_ws_deinit_secondary"
#14 0x00007f07bd294df8 in tport_zap_secondary (self=0x7f0781f71050) at tport/tport.c:1101
mr = <optimized out>
__func__ = "tport_zap_secondary"
#15 0x00007f07bd287ed8 in su_timer_expire (timers=timers@entry=0x7f0780000ba8, timeout=timeout@entry=0x7f078bffec50, now=...) at su/su_timer.c:587
t = 0x7f07836cc9e0
f = 0x7f07bd296ed0 <tport_secondary_timer>
n = 42
__PRETTY_FUNCTION__ = "su_timer_expire"
#16 0x00007f07bd288165 in su_base_port_run (self=0x7f0780000b60) at su/su_base_port.c:339
now = {tv_sec = <optimized out>, tv_usec = <optimized out>}
tout = 15000
tout2 = 0
__PRETTY_FUNCTION__ = "su_base_port_run"
#17 0x00007f07bd28b1f3 in su_pthread_port_clone_main (varg=0x7f07b916b580) at su/su_pthread_port.c:343
arg = 0x0
task = {{sut_port = 0x7f0780000b60, sut_root = 0x7f07800013d0}}
zap = 1
#18 0x00007f07bd49f812 in start_thread () from /lib64/libc.so.6
No symbol table info available.
#19 0x00007f07bd43f450 in clone3 () from /lib64/libc.so.6