-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SMP support #370
Comments
And, these per-cpu variables is a trouble. It needs a runtime page aligned section in loaded program image. But under shared library build, the GNU linker will use default linker script of generating shared library, which ignore such requirement. so, LKL will trigger a SIGSEGV. The current solution is ugly, I manually combine kernel linker script and above default linker script to generate the shared library. With above hack, although shared library can work well, but kallsyms subsystem is broken since kallsyms use compile time address offsets (that is lkl.o) to generate its internal lookup table. but final shared library build change them. The broken kallsyms means dump_stack() , panic , oops information become unreadable. BTW: without SMP support, it seem that we still need a minor hack to make kallsyms work. |
I just uploaded SMP prototype here: https://github.com/Rover-Yu/lkl-linux And wrote some documents about it: https://github.com/Rover-Yu/lkl-linux/wiki Thanks |
Hi Rover, Thank you for you work, it sounds exciting ! I am currently travelling so I did not get a chance to look at it, but I will do so during the weekend. Thanks, |
I gave a quick look and feel so nice ! a couple of questions for now
For the aarch64 support, it is not upstreamed yet though, there are two arm related PRs which may be helpful (cc: @mxi1). It would be nice if you could tell us which toolchain you used for your test. Having new ops entry Thanks for the great patchset and really looking forward to be completed. |
@thehajime Sorry for my delay. If you need it so much, I can organize the descriptions including toolchain version, Makefile options and how to customized the binutils. They are actually already in the related issues, but I will organize the instructions, which should have only few steps, so you can easily merge into your branch if possible. |
I was asking toolchain to @Rover-Yu: I just wanted to let you (@mxi1) aware this thread. (off topic) We are also almost fine with the android support: we tested with mptcp (https://twitter.com/thehajime/status/900596946120736770). With more clean up our code (and your patches), we can make that upstreamed. |
This is the information about my gcc: EulerOS:~ # rpm -qi gcc
Name : gcc
Version : 4.9.3
Release : 154843.1
Architecture: aarch64
Install Date: Tue Jul 18 22:34:22 2017
Group : Development/Languages
Size : 25992206
License : GPLv3+ and GPLv3+ with exceptions and GPLv2+ with exceptions and LGPLv2+ and BSD
Signature : RSA/SHA1, Wed May 31 23:30:38 2017, Key ID 600317bc381d7ac3
Source RPM : gcc-4.9.3-154843.1.src.rpm
Build Date : Wed May 31 23:27:04 2017
Build Host : euler-armworker2
Relocations : (not relocatable)
Packager : http://bugs.euleros.org
Vendor : huawei
Summary : Various compilers (C, C++, Objective-C, Java, ...)
Description :
This is compiler for arm64.
EulerOS:~ # gcc --verbose
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/aarch64-linux-gnu/4.9.3/lto-wrapper
Target: aarch64-linux-gnu
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release -with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,lto --enable-plugin --enable-initfini-array --disable-libgcj --without-isl --without-cloog --enable-gnu-indirect-function --build=aarch64-linux-gnu --disable-multilib
Thread model: posix
gcc version 4.9.3 20160525 (prerelease) (GCC) It seem that it is a special build by Huawei for their ARM64 servers. In my words, there are so many kinds of configurations of ARM toolchain, it is hard to list all possible items in a static list. The better solution may be to use something like regular expressions here ? Anyway, I am not an expert of ARM systems ... For performance, I am sorry for I didn't test LKL with enabled file systems and NLS_* ago, I guess that there should not make big networking performance changes. The reasons of I disable them just are to reduce complexity of adding SMP support, and get shorter building time ;) The LKL performance in my testbed, the iperf3 shows about 1.2 Gpbs bandwidth, not good. The testing steps is as the wiki (https://github.com/Rover-Yu/lkl-linux/wiki), My hardware environment is below: $ sudo lshw -short
H/W path Device Class Description
=====================================================
system Standard PC (i440FX + PIIX, 1996)
/0 bus Motherboard
/0/0 memory 96KiB BIOS
/0/400 processor Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
/0/401 processor Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
/0/402 processor Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
/0/403 processor Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
/0/404 processor Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
/0/405 processor Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
/0/406 processor Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
/0/407 processor Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
/0/408 processor Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
/0/409 processor Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
/0/40a processor Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
/0/40b processor Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
/0/1000 memory 10006MiB System Memory
/0/1000/0 memory 10006MiB DIMM RAM
... The tc/veth/netns/packet-sockets are not the performance bottleneck here, I did some micro-benchmarks for them, all of them can reach higher performance number. I don't have suitable hardware to run LKL on DPDK. The bad message is adding batch operations didn't help us get good performance much too. It seem that it is another trouble. That is all, thanks! |
@Rover-Yu thanks, the toolchains seems to include a fix which @mxi1 discussed (libos-nuse@5c5bd5c#commitcomment-22901895).
thanks. I see your point.
Thanks for sharing the information.
I don't think current DPDK support doesn't help much: packet sockets are enough for your test.
I haven't tried any of them and am not sure 100%, but xmit_more flag in skb might help for this ? |
For ARM64 porting, I think your patches are more complete than that my simple hack :) For batch operation improvement, I ever printf()ed actual batch counts in new added batch interfaces, there are only 1 in most time. so I think your suggestion make sense very much ! After I switching another hardware environment, it seem that LKL/SMP crash easier ... it is a good message :) But, the highest performance is better at this machine, 1.5Gbps now. I will focus on making SMP support more stable first, then next step is better performance. |
it's at least increasing :)
we may also consider to extend with packet_mmap to reduce the number of copies. |
The bug is fixed, lkl_start_kernel() assumed that the init process always run at CPU0, this is not always true now :) |
SMP support on ARM64 are added too. |
Hi, would you have some suggestions or concerns about this SMP prototype ? ;) It seem that it can pass the basic tests now (started iperf client about 80K times without any error, both on x86_64 and ARM64). I also tried to enable file systems and NLS* support with it, both can compile without problems. but the 'make -C tools/lkl tests' still is failed, I think it is since the linker script or build system is not ready now. |
Hi @Rover-Yu , I did take a quick look and my main concern is that the SMP implementation is duplicating stuff from the arch (x86, arm) layers. Would it be possible to implement the SMP required operations (locks, atomics, etc.) as native ops and rely on gcc atomics stuff? These would make the SMP implementation architecture independent. |
I guess that we can't implement all these operations by architecture independent stuffs. e.g. the SMP barriers and cmpxchg operations are not supported in POSIX even GNU extended libc. Something like spin locks should can be replaced by some new host operations as you said, however, I suspect that we may return back these native operations once we start performance tuning later, kernel itself implementation is better choice, e.g. queued spin lock in latest kernel releases. I also saw, this indeed breaks portability, there may have another better solution that I don't know yet :) |
I think we can implement almost all operations with gcc atomic built-ins: https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html Using the kernel implementation can be an option as well, I am not excluding, but I think for most usecases a generic implementation may be good enough. |
It seem that these "_sync*" interfaces are marked by legacy :) I remembered that the linux kernel community ever discussed whether they should use gcc new built-in C11 atomics or memory barriers. the link is https://lwn.net/Articles/586838/ It seem that kernel atomics and current C11 atomics have some subtle differences. Anyway, your concern are reasonable, I will take look more details here. BTW: I just tried to replace current posix host operation timer_*() interfaces by timerfd syscall, but It didn't help performance more. however, It indeed can avoid to create a lot of helper timer threads. |
Hi, So does latest LKL support SMP? Or we still need to have "One workaround is to shard the application thanks! |
Ping. Curious whether the latest LKL supports SMP? |
when cpu A, B, invoke smp_call_function_single() to each other, it will deadlock. reason: lkl assumes lkl_cpu_get() to be irq-disabled. |
I have been wrote a prototype of SMP support of LKL with POSIX host backend, I will open it later after clean up these ugly parts.
Below are some my experiences:
Tree RCU is not a major problem for us, although to make it workable spent me days of time. The key points are that we have to make sure RCU core can identify out these idled processors and give them opportunities to run RCU bookkeeping works at time, otherwise, GP may take long time to complete even hang up whole LKL application.
The new_host_task() API may create a kernel thread that is not on current processor, this break preconditions of switch_to_host_task(). this problem is still open in the prototype, however, I think that it is a minor problem.
I think that LKL interruption is not an ideal solution for high performance or low latency use cases. The timer is too, each timer interruption create a thread. I guess that we may need some hard works here.
I use a variable in thread local storage area to save current processor id of current LKL-task/thread. and a LKL-task may change its running processor since tasks migration or setup scheduler affinity, so we have to change it in context switching time.
the IPI and per cpu local timer support is necessary too, in my words.
With SMP support, I encountered some other interesting bugs too ...
Lastly, thanks for your great LKL works :)
The text was updated successfully, but these errors were encountered: