Skip to content

freeswitch randomly overwrites files that it has open in r/w mode #2283

@themsley-voiceflex

Description

@themsley-voiceflex

Describe the bug
Freeswitch randomly writes SSL data to files that it has open in r/w mode due to some sort of race condition.

To Reproduce
Steps to reproduce the behavior:

  1. Set up a freeswitch profile to use WSS webrtc
  2. Expose it to the internet and allow people to attempt to use it
  3. after some time freeswitch will overwrite random filehandles that it has open with SSL data
  4. FreeSWITCH crashes with sqlite "file is not a database" messages in the majority of cases as the filehandle that is most often used and in r/w mode is the one for /dev/shm/core.db. This bug could potentially overwrite any file that is open in r/w mode although the more often it is opened the more likely it is to be a victim which is probably why it seems to pick on /dev/shm/core.db

Expected behavior
freeswitch does not do random things!

Package version or git hash
1.10.10

This is the same issue previously reported in #420 which was mistakenly closed as a SQLite problem.

I am 100% certain that it not for the following reasons:

  1. we have the patched version of SQLite installed and in use

  2. the bug in SQLite pointed to from core db gets corrupted  #420 https://www.philipotoole.com/how-i-found-a-bug-in-sqlite/ is in a specific SLQite area that needs special action to invoke and freeswitch does not use this. This means the bug will probably never affect freeswitch so a SQLite upgrade is unnecessary. The SQLite bug as reported in that link affects only databases using this specifc set up https://www.sqlite.org/inmemorydb.html and to the best of my knowledge this way of using SQLite is not used by freeswitch. It is 100% definitely not used by our installation. And even if it was (which it isn't!) we have the patched version running.

  3. our freeswitch instance is set up to use param name="core-db-name" value="/dev/shm/core.db" which is an ordinary file based SQLite database. It just happens to be on a filesystem that is 'in memory' but that is not the same thing as a SQLite 'in memory' database as described above. All these reasons rule out a SQLite problem.

In addition I have added debug code to libsofia-sip-ua/tport/ws.c in the ws_close() function, immediately prior to the SSL_write() that it uses. This debug code is

                /* check if no fatal error occurs on connection */
                char procbuf[1024];
                char fnbuf[1024];
                ssize_t fnlen;
                sprintf(procbuf,"/proc/self/fd/%d",wsh->sock);
                if ((fnlen = readlink(procbuf,fnbuf,sizeof(fnbuf)-1)) != -1) {
                        fnbuf[fnlen] = '\0';
                        printf("WS ws_close fd %d target %s\n",wsh->sock,fnbuf);
                        }
                else {
                        printf("WS ws_close fd %d readlink failed\n",wsh->sock);
                        }
                code = SSL_write(wsh->ssl, buf, 1);

and when the bug is hit it shows the following information

Oct 18 07:50:47 fstrtc01.voiceflex.com stdbuf[389441]: WS ws_close fd 97 target /dev/shm/core.db
Oct 18 12:20:39 fstrtc01.voiceflex.com stdbuf[395088]: WS ws_close fd 63 target /dev/shm/core.db
Oct 18 19:20:46 fstrtc01.voiceflex.com stdbuf[403552]: WS ws_close fd 58 target /dev/shm/core.db

The SSL_write in ws_close() is issued immediately after that debug message and is writing SSL data to random files as per the similar problem that Facebook engineers found and debugged in their code referenced https://engineering.fb.com/2014/08/12/ios/debugging-file-corruption-on-ios/

"The SSL layer was writing to a socket that was already closed and subsequently reassigned to our database file. "
and
"Using a hex analyzer, we found a common prefix across the attachments: 17 03 03 00 28"

Our overwrite is not identical but similar enough for it to be the same problem:

$ for dbfile in $(ls /dev/shm/core.db-*); do echo $dbfile; od -X $dbfile | head -1;done
/dev/shm/core.db-1697017847
0000000 00030317 2299b112 b8d06713 442802c5
/dev/shm/core.db-1697359841
0000000 00030317 10e68912 769c4a0c 587aac68
/dev/shm/core.db-1697457656
0000000 00030317 cdc06712 26b9d175 b91379ec
/dev/shm/core.db-1697474154
0000000 00030317 0298eb12 10f4a06d a38d8a70
/dev/shm/core.db-1697477136
0000000 00030317 39243212 f67342eb 4eedfd89
/dev/shm/core.db-1697537157
0000000 00030317 bbc1a212 0dd52644 ae8bd737
/dev/shm/core.db-1697547645
0000000 00030317 c42d2612 233c9109 1cb91ff7
/dev/shm/core.db-1697569857
0000000 00030317 9e3fe612 3b797957 df9c0fba
/dev/shm/core.db-1697604657
0000000 00030317 af282512 3c08939a ac980ecf
/dev/shm/core.db-1697615458
0000000 694c5153 03031774 dd7c1300 2223e3b0
/dev/shm/core.db-1697631649
0000000 00030317 21e31112 bad65ac4 f6609114
/dev/shm/core.db-1697656856
0000000 00030317 42088b12 b27bce2d f3ed235c

Only /dev/shm/core.db-1697615458 does not start with 0x1703030012 and that has 0x1703030013 at +5 into the file.

Within a few seconds of those debug messages being issued we start to get

2023-10-18 07:34:41.466490 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.179.208.228
2023-10-18 07:50:47.946480 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.8.18.233
2023-10-18 07:50:48.026471 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.8.18.233
2023-10-18 07:50:48.026471 99.60% [ERR] switch_core_sqldb.c:728 [db="/dev/shm/core.db",type="core_db"] NATIVE SQL ERR [file is not a database]
BEGIN EXCLUSIVE
2023-10-18 07:50:48.026471 99.60% [CRIT] switch_core_sqldb.c:2109 ERROR [file is not a database], [db="/dev/shm/core.db",type="core_db"]
2023-10-18 07:50:48.026471 99.60% [ERR] switch_core_sqldb.c:728 [db="/dev/shm/core.db",type="core_db"] NATIVE SQL ERR [cannot commit - no transaction is active]
COMMIT
2023-10-18 07:50:48.086485 99.60% [WARNING] sofia_reg.c:1842 SIP auth challenge (REGISTER) on sofia profile 'internaltcp' for [[email protected]] from ip x.8.18.233

There are identical messages in the logs for the core.db overwrite at 12:20:39 and 19:20:46

Once this database is overwritten in this way then freeswitch will refuse to start up as the core.db.dsn database is corrupted and cannot be opened. It has to be deleted/renamed for freeswitch to start up again.

I also see hangs in freeswitch where it stops responding to connection attempts and a gcore taken at the time of this shows that it is stuck in ws_close() in the middle of the SSL_write() call. I suspect this is related and we are sending to a socket that is not expecting us to write to it (wild abandoned guess!) but this can be stuck there waiting for sometimes hours. I suspect this is also involved in #1934 where people are reporting hangs when attempting to use WSS. This is from the most recent gcore taken for fd 49 which hung from Oct 19 15:23:25 to 16:26:43 when it woke up again.

(gdb) thread apply 18 bt full

Thread 18 (Thread 0x7f078bfff640 (LWP 450637)):
#0  0x00007f07bd53e91c in read () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f07bccdd091 in sock_read () from /lib64/libcrypto.so.3
No symbol table info available.
#2  0x00007f07bcccd44b in bread_conv () from /lib64/libcrypto.so.3
No symbol table info available.
#3  0x00007f07bccd0445 in bio_read_intern () from /lib64/libcrypto.so.3
No symbol table info available.
#4  0x00007f07bccd05c7 in BIO_read () from /lib64/libcrypto.so.3
No symbol table info available.
#5  0x00007f07bd130e8c in ssl3_read_n.part () from /lib64/libssl.so.3
No symbol table info available.
#6  0x00007f07bd132af9 in ssl3_read_bytes () from /lib64/libssl.so.3
No symbol table info available.
#7  0x00007f07bd1425c0 in state_machine.part () from /lib64/libssl.so.3
No symbol table info available.
#8  0x00007f07bd13076e in ssl3_write_bytes () from /lib64/libssl.so.3
No symbol table info available.
#9  0x00007f07bd1136e7 in SSL_write () from /lib64/libssl.so.3
No symbol table info available.
#10 0x00007f07bd2a5340 in ws_close (wsh=wsh@entry=0x7f0781f71240, reason=reason@entry=0) at tport/ws.c:900
        ssl_error = 0
        buf = 0x7f07bd2d41b6 "0"
        procbuf = "/proc/self/fd/49\000\000\316\201\a\177\000\000\320\347\377\213\a\177\000\000\310\347\377\213\a\177\000\000\204\236(\275\a\177\000\000\220y2\200\a\177\000\000\300\240\061\275\a\177\000\000P\303\316\201\a\177\000\000\200\350\377\213\a\177\000\000\000\000\000\000\000\000\000\000֢(\275\a\177\000\000\377\377\377\177\000\000\000\000\064Z)\275\a\177\000\000tag=ZNS5\337Q-\275\a\177\000\000\377\377\377\377\377\377\377\377?B\017\000\000\000\000\000P\303\316\201\a\177\000\000\000\326\354\367\376AN\245\377\377\377\177\000\000\000\000P\303\316\201\a\177\000\000\000\000\000\000\000\000\000\000\365\022*\275\a\177\000\000P\303\316\201\a\177\000\000\000\067a\202\a\177"...
        fnbuf = "socket:[9955738]\000\000\000\000\000\000\000\000d\002\000\000\000\000\000\000\001", '\000' <repeats 15 times>, "\307\002\000\000\000\000\000\000\325\371J\275\a\177\000\000\236\304\333\350\000\000\000\000\273\360\b\000\000\000\000\000@\345\377\213\a\177\000\000@\345\377\213\a\177\000\000\340\vy\202\a\177\000\000\361\001*\275\a\177\000\000\001\000\000\000\000\000\000\000\240\344\377\213\a\177\000\000\060\067a\202\a\177\000\000\001\201\001\200\a\177\000\000\312\326,\275\a\177\000\000\020\202\001\200\a\177\000\000d\002\000\000\000\000\000\000`.Ё\000\000\000\004\000\nV\022C\000\000\000\000\003\000\n]_|\006", '\000' <repeats 25 times>...
        fnlen = <optimized out>
        code = 0
#11 0x00007f07bd2a5bd3 in ws_destroy (wsh=0x7f0781f71240) at tport/ws.c:836
No locals.
#12 0x00007f07bd2a5ca3 in tport_ws_deinit_secondary (self=0x7f0781f71050) at tport/tport_type_ws.c:536
        wstp = <optimized out>
        wstp = <optimized out>
        __func__ = {<optimized out> <repeats 26 times>}
#13 tport_ws_deinit_secondary (self=0x7f0781f71050) at tport/tport_type_ws.c:530
        wstp = 0x7f0781f71050
        __func__ = "tport_ws_deinit_secondary"
#14 0x00007f07bd294df8 in tport_zap_secondary (self=0x7f0781f71050) at tport/tport.c:1101
        mr = <optimized out>
        __func__ = "tport_zap_secondary"
#15 0x00007f07bd287ed8 in su_timer_expire (timers=timers@entry=0x7f0780000ba8, timeout=timeout@entry=0x7f078bffec50, now=...) at su/su_timer.c:587
        t = 0x7f07836cc9e0
        f = 0x7f07bd296ed0 <tport_secondary_timer>
        n = 42
        __PRETTY_FUNCTION__ = "su_timer_expire"
#16 0x00007f07bd288165 in su_base_port_run (self=0x7f0780000b60) at su/su_base_port.c:339
        now = {tv_sec = <optimized out>, tv_usec = <optimized out>}
        tout = 15000
        tout2 = 0
        __PRETTY_FUNCTION__ = "su_base_port_run"
#17 0x00007f07bd28b1f3 in su_pthread_port_clone_main (varg=0x7f07b916b580) at su/su_pthread_port.c:343
        arg = 0x0
        task = {{sut_port = 0x7f0780000b60, sut_root = 0x7f07800013d0}}
        zap = 1
#18 0x00007f07bd49f812 in start_thread () from /lib64/libc.so.6
No symbol table info available.
#19 0x00007f07bd43f450 in clone3 () from /lib64/libc.so.6

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions