Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in handling migrated symbols #120

Open
stefanct opened this issue Jul 14, 2024 · 2 comments
Open

Bug in handling migrated symbols #120

stefanct opened this issue Jul 14, 2024 · 2 comments

Comments

@stefanct
Copy link

stefanct commented Jul 14, 2024

Describe the bug
I have an embedded project for an ARM microcontroller that I need to be able to compile with the vendor's Eclipse environment (that generates GNU make makefiles) and a CMake environment. Both use the same gcc-based toolchain binaries (as of now based on GCC 12.3.1). I am using elf_diff to ensure that the resulting binaries are equal "enough". This worked fine so far with different configurations (e.g., different linker scripts, applications).

Eventually, I ran into a false positive (reporting the files to differ although they not AFAICT) that also shows a wrong location to the respective symbols for one of the ELF files.

To Reproduce
It's not exactly easy to provide a MWE including the sources (and assuming this is ARM-specific you would need the right toolchain too). The main culprit is this function. I could provide you the ELF files though (I'd rather do that in private though because this is work-related and contains the customer and project name in the paths :).
I execute elf_diff with --skip_symbol_similarities, --bin_dir pointing to the ARM toolchain used for building, and --bin_prefix "arm-none-eabi-".

I found out some interesting and hopefully at least partially helpful facts:

  • Both ELF files work fine in practice as far as execution is concerned.
  • The order of the object files during linking is important. I can make the false positive go away by swapping two files around in the linker's command line(!).
  • Neither of the two object files involved in the swapping contain the symbols reported in the false positive.
  • The respective function is an interrupt handler function that is defined with __attribute__ ((weak, section(".after_vectors"))) and has a declared prototype with __attribute__ ((weak)) only. This function is then aliased to 134 other function names with __attribute__ ((weak, alias (...))).
  • The multipage html output correctly lists all of the function names at the same line where the actual definition is located for one of the ELF files and consistently on a wrong line and wrong file for the other ELF file.
  • Dumping the debug info (with arm-none-eabi-readelf -w) shows a lot of warnings for both sides including numerous readelf: Warning: There is a hole [... - ...] in .debug_loc section. and exactly 10 occurrences of readelf: Warning: Hole and overlap detection requires adjacent view lists and loclists. each. (I don't know why 10 times yet. There are 22 object files involved).
  • This happens only for one linker script configuration where the function in question is mapped to a physical address near 0 (namely to 0x000002ee).

I couldn't find out what the wrong side is actually pointing to. As I mentioned the file pointed to does not contain any of the affected symbols at all. And the line number is also different but I could not determine where it is coming from. From all of the above, I think this is either a bug in ld or the ELF parsing (or both) that is triggered by some peculiar debug info of aliased functions.

Expected behavior
I am not sure exactly. Ideally, the expected behavior should be that it just works and shows the files to be equal. Alternatively, it could probably also try to detect the erroneous circumstance and report this as an error.

Screenshots
image

Desktop (please complete the following information):

  • OS: Debian stable
  • Version 0.7.1 from pip
@stefanct
Copy link
Author

I have found another false positive. This one is misidentifying jump tables(?), e.g., symbols like CSWTCH.97. The names of the symbols seem to differ in the ELF files (so arguably this is a true positive). However, in one ELF elf_diff identifies a C string literal as part of the symbol while in the other one the detailed html output just shows the leading binary data (".........") without the C string literal. Since this is the first time I noticed this kind of difference I assume elf_diff handles the CSWTCH names properly and compares their content. Is this correct? Then the question is why it thinks that the string is part of the CSWTCH in one instance and not the other.

@stefanct
Copy link
Author

Here is an objdump disassembly that shows superficially what's the problem:

elf1:

20203b62 <enetPllConfig_BOARD_BootClockRUN>:
20203b62:	0100 0100                                    .....

20203b67 <usb1PllConfig_BOARD_BootClockRUN>:
20203b67:	0000 6548 6c6c 206f 7266 6d6f 5220 4f54     ..Hello from RTO
20203b77:	2053 6174 6b73 0d2e 000a 7250 2065 5452     S task....Pre RT
20203b87:	534f 0a0d 4800 6c65 6f6c 745f 7361 006b     OS...Hello_task.
20203b97:	6154 6b73 6320 6572 7461 6f69 206e 6166     Task creation fa
20203ba7:	6c69 6465 2e21 0a0d                          iled!....

20203bb0 <CSWTCH.102>:
20203bb0:	0a02 0a0a 0a0a 0a0a 0a0a 0a0a 080a 4910     ...............I
20203bc0:	4c44 0045 6d54 5172 5400 726d 5320 6376     DLE.TmrQ.Tmr Svc
20203bd0:	0000 0000                                   ....

elf2:

20203b5a <CSWTCH.97>:
20203b5a:	0a02 0a0a 0a0a 0a0a 0a0a 0a0a 080a 4810     ...............H
20203b6a:	6c65 6f6c 6620 6f72 206d 5452 534f 7420     ello from RTOS t
20203b7a:	7361 2e6b 0a0d 5000 6572 5220 4f54 0d53     ask....Pre RTOS.
20203b8a:	000a 6548 6c6c 5f6f 6174 6b73 5400 7361     ..Hello_task.Tas
20203b9a:	206b 7263 6165 6974 6e6f 6620 6961 656c     k creation faile
20203baa:	2164 0d2e 000a 4449 454c 5400 726d 0051     d!....IDLE.TmrQ.
20203bba:	6d54 2072 7653 0063                         Tmr Svc.

20203bc2 <enetPllConfig_BOARD_BootClockRUN>:
20203bc2:	0100 0100                                    .....

20203bc7 <usb1PllConfig_BOARD_BootClockRUN>:
20203bc7:	0000 0000                                    .....

Any ideas how to debug this or what the culprit could be? Or to whom to report this? GCC? binutils?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant