Skip to content

Latest commit

 

History

History
133 lines (84 loc) · 4.16 KB

README.md

File metadata and controls

133 lines (84 loc) · 4.16 KB

Build Status

Atlas System Agent / Atlas Titus Agent

⚠️ Experimental

An agent that reports metrics for ec2 instances or titus containers.

Build Instructions

  • This build requires a C++11 compiler, some system libraries, and libatlasclient

  • To build the titus-agent:

sudo apt-get update
sudo apt-get install -y zlib1g-dev uuid-dev libblkid-dev libpcre3-dev libcap-dev

rm -rf build && mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=RELWITHDEBINFO -DTITUS_AGENT=ON ..
make VERBOSE=1 -j4
./runtests && make DESTDIR=../root install
  • To build the system agent:

Repeat the above commands but do not define -DTITUS_AGENT=ON

Titus Agent

CPU Metrics

cgroup.cpu.processingCapacity

Amount of processing time requested for the container. This value is computed based on the number of shares allocated when creating the job. Note that this is not a hard limit, if there is no contention a job can use more than the requested capacity. However, a user should not rely on getting more than requested.

Unit: seconds/second

cgroup.cpu.processingTime

Amount of time spent processing code in the container. This metric would typically get used for one of two use-cases:

  1. Utilization: to see how close it is coming to saturating the requested resources for the job you can divide the processing time by the processing capacity.
  2. Performance Regression: for comparative analysis the sum can be used. Note you should ensure that both systems being compared have the same amount of resources.

Unit: seconds/second

cgroup.cpu.shares

Number of shares configured for the job. The Titus scheduler treats each CPU core as 100 shares. Generally the processing capacity is more relevant to the user as it has been normalized to the same unit as the measured processing time.

Unit: num shares

cgroup.cpu.usageTime

Amount of time spent processing code in the container in either the system or user category.

Unit: seconds/second

Dimensions:

  • id: category of usage, either system or user

Memory Metrics

cgroup.mem.failures

Counter indicating an allocation failure occurred. Typically this will be seen when the application hits the memory limit.

Unit: failures/second

cgroup.mem.limit

Memory limit for the cgroup.

Unit: bytes

cgroup.mem.used

Memory usage for the cgroup.

Unit: bytes

cgroup.mem.pageFaults

Description from kernel.org

Counter indicating the number of times that a process of the cgroup triggered a "page fault" and a "major fault", respectively. A page fault happens when a process accesses a part of its virtual memory space which is nonexistent or protected. The former can happen if the process is buggy and tries to access an invalid address (it will then be sent a SIGSEGV signal, typically killing it with the famous Segmentation fault message). The latter can happen when the process reads from a memory zone which has been swapped out, or which corresponds to a mapped file: in that case, the kernel will load the page from disk, and let the CPU complete the memory access. It can also happen when the process writes to a copy-on-write memory zone: likewise, the kernel will preempt the process, duplicate the memory page, and resume the write operation on the process` own copy of the page. "Major" faults happen when the kernel actually has to read the data from disk. When it just has to duplicate an existing page, or allocate an empty page, it is a regular (or "minor") fault.

Unit: faults/second

Dimensions:

  • id: either minor or major.

cgroup.mem.processUsage

Amount of memory used by processes running in the cgroup.

Unit: bytes

Dimensions:

  • id: how the processes are using the memory. Values are cache, rss, rss_huge, and mapped_file.