-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
guide to run TPCC benchmark ? #1
Comments
Currently, the documentation is far from complete... We are currently improving the coordination and partitioning, so usage will significantly change in the near future. To run tpcc, what you would need to do is the following:
Keep in mind, that Tell currently needs an infiniband network in order to work. If you do not have access to an infiniband cluster, you could try to get it running with SoftROCE. There is also a subproject helper_scripts under tellproject. There are several python scripts that start up a cluster and run a workload. However, these scripts are for internal use - this means that you would have to change them - but most of the changes should be in ServerConfig.py |
Thanks for your instructions. Now I can set up the TPCC and run on my RDMA cluster. I find the helper_script is very useful. |
It really depends how your cluster looks like. You should use 2-3 times as many tpcc-server instances than tellstore nodes. Then there are several things you can do/try: Are you 100% sure that you compile in release mode? Everything will be quite slow if you run in Debug mode (mostly because of loggin). Furthermore we usually used link time optimization when we built binaries for benchmarks - this gives another ~20% performance boost. GCC seems to generate faster binaries than clang and I think clang and the intel compiler crash when you activate link time optimization (to do so call cmake with -DCMAKE_AR=/usr/bin/gcc-ar -DCMAKE_RANLIB=/usr/bin/gcc-ranlib -DCMAKE_CXX_FLAGS="-march=native -flto -fuse-linker-plugin" -DCMAKE_BUILD_TYPE=Release). Do your machines have NUMA? In that case you should run one process per NUMA node (you can use numactl to pin them to a specific NUMA node) - NUMA awareness is still something we should build, but currently this has low priority, as the one process per NUMA node works quite well for us. Best practice is, to have all TellStore processes on the same NUMA node where your infiniband card is attached (probably node 0). We usually ran only one client process in total (the client does not really do much, it mostly sends the queries to the tpcc_server instances) - the tpcc_server instances run TellDB which does all the transaction processing and index processing - this is why they are quite heavy-weight. If my memory serves me right, we used 20 clients per machine. But you might need to play around with these numbers. Make sure you set the logging level to FATAL - this is especially important for tellstore. Make sure that you populate enough warehouses. 50 should be enough for your cluster (currently Tell has quite a large memory overhead - this is because our GC only frees memory once per cycle, another thing we should fix - as you see, there is still quite a lot to do). I hope these pointers help you out - otherwise I might need some more information. @kbocksrocker Did I forget something? Do you have something to add? |
I don't think I have anything to add to this. Can you provide us with your setup - i.e. how many instances you started and their configuration? Building in Release with Link-Time Optimization also helped quite a bit compared to a debug build. |
Thanks for your instant feedbacks !
For my set ups, our cluster has 8 nodes; the machine has no NUMA; I run 1 commitmanager, 8 tellstore instance, 8 tpcc_server, 32 clients in total(4 clients for each server instance); I compile in Release type. I also find the problem about memory overheads, so I only use 32 warehouse overall, 20G memory for each node, 10 seconds per GC cycle. As you suggested, I will try link time optimizaition, more tpcc_server intances. Thanks for your replies 👍 |
You will definitely need more clients. The three stores are:
For now, you could try logstructured. As TPCC does not need scans, it will be the fastest and uses less memory than the others. you can also try to increase the number of get/put threads to two. TellDB writes log entries to TellStore. We do not provide an option to turn this off. You could comment out the code that does the logging - but in that case transactions are not correctly implemented. Eventually the logs will be needed to roll back transactions if a telldb node fails - this is not yet implemented but we still write the logs to make sure that we measure correctly. So currently nothing would change if you comment this logging code out - but the numbers will be too high and therefore dishonest - it really depends on what you need/want. But for TPCC it should not be too expensive anyway (and insert and one delete per transaction). Try adding more clients, that should be the main bottleneck. You can also try to load fewer warehouses and have 4 storage nodes and 12 tpcc_servers. In that case the abort rate might increase, but we actually never saw a high abort rate. |
If you are talking about the LogEntry class: That one does not belong to the "process logging" but to the logstructured-memory implementation that we have. It is one of our storage backends where all information is written to an in-memory append-only structure (like a log). It's not related to the way we do transactional-logging which is part of the TellDB client library and not of the store. TellStore does not yet have support for error recovery and writes nothing at all to the disk. |
@kbocksrocker You get my point. Yeah, I was just wondering whether that part is doing the transaction logging. Now I understand: Tell maintains transaction logs during execution, which is only for rollback when transactions abort; the logs are not written into disk, because currently the fault tolerance part is not fully developed. But with these log entries, I think it is not difficult to further support a basic recovery process. |
don't use rowstore - it is the least robust of our implementation. Use columnmap and logstructured. I would settle for logstructured for now. |
Actually, I find rowstore is the most robust one....When running for too long, or collocating storage nodes and processing nodes, the system will experience sudden crash. I didn't explore the reason yet. |
The reason that logstore is stalling indicates that you are not allocating enough memory for the hash table (logstore has a different hash table). Try allocate at least 10% of the memory for the hash table (the parameter is in number of elements - but if you look at the ServerConfig.py file you see that we use a different memory config for logstructured - try to take these numbers and scale them down). Logstructured should currently be the least memory hungry storage. |
Hi, I try to run with logstructure store, but tpcc_server fails to launch because of "Error during socket operation [error = generic:111 Connection refused](in handleSocketError at /user/wentian/programs/tell/crossbow/libs/infinio/include/crossbow/infinio/BatchingMessageSocket.hpp:407)". |
the full error output message is as follows: |
Sorry for the late response. Are you starting all the processes at once? The CommitManager and especially the TellStore server need a few seconds until they are ready as they need to allocate a lot of memory. Can you start the client only after the tellstored process logs a "Storage ready" message? You might need to set the log level to INFO for tellstored to see the message. |
Hi, I am interested in your project. I want to run TPCC benchmark on Tell. It seems to I need to run tellstore together with commitmanager. Any guide to do this? An example would be better for me. Thanks in advance !
The text was updated successfully, but these errors were encountered: