Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: All new logging/tracing system #259

Open
AndrewDryga opened this issue Aug 6, 2017 · 5 comments
Open

RFC: All new logging/tracing system #259

AndrewDryga opened this issue Aug 6, 2017 · 5 comments

Comments

@AndrewDryga
Copy link
Member

AndrewDryga commented Aug 6, 2017

Logging got few issues and that starts to worry me:

  1. We need a good way to sanitize logs from sensitive data;
  2. To log request and response we need to have them as a binary data, which works well for small json that can be stored in a request context, but really bad for all other use cases;
  3. PostgreSQL is a bad place to store logs;
  4. The real usage demand is more about "I want to trace everything that happened in my system" rather than "I want to know what the response and request were like".

I guess it is time to reconsider this part of Annon to be a global adapter-based tracing module, which can keep working as is (which is okay for small deployments) and get much more features for production systems: send data to SaaS services, other data stores - Elasticsearch or Cassandra, tracing systems - DataDog APM, NewRelic APM, AppSignal or Scout.

Adapter API would be large and demanding, so not every service would be able to be integrated, but that is okay since most of the issues above would have an approach to solve them.

@AndrewDryga
Copy link
Member Author

Might benefit from using: https://github.com/zachdaniel/spandex

@AndrewDryga
Copy link
Member Author

Related to: #209

@AndrewDryga
Copy link
Member Author

@AndrewDryga
Copy link
Member Author

Writing logs to ProstreSQL under load is obviously inefficient, by default logs are written from an async process, but it's message queue gets backed up pretty fast.

Even worse, because the queue is so long, processes that make calls to that queue are punished by BEAM backpressure mechanism (number of reductions taken for a message passing is proportional to queue length). So deviation for incoming connection grows because processes that handle them are suspended.
I believe, in next OTP versions, it's even planned to get rid of this backpressure.

Some data (with async log writer):
2018-03-24 15 17 07

ab -p ./post_loc.txt -l -n 100000 -c 100 http://127.0.0.1:4000/httpbin/post 
This is ApacheBench, Version 2.3 <$Revision: 1796539 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests


Server Software:        gunicorn/19.6.0
Server Hostname:        127.0.0.1
Server Port:            4000

Document Path:          /httpbin/post
Document Length:        Variable

Concurrency Level:      100
Time taken for tests:   475.071 seconds
Complete requests:      100000
Failed requests:        0
Non-2xx responses:      621
Total transferred:      1104451167 bytes
Total body sent:        406500000
HTML transferred:       1069272902 bytes
Requests per second:    210.49 [#/sec] (mean)
Time per request:       475.071 [ms] (mean)
Time per request:       4.751 [ms] (mean, across all concurrent requests)
Transfer rate:          2270.32 [Kbytes/sec] received
                        835.61 kb/s sent
                        3105.93 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   36 205.5      0    7569
Processing:     7  438 204.1    433    4249
Waiting:        6  437 204.0    432    4248
Total:          7  475 266.8    448    7604

Percentage of the requests served within a certain time (ms)
  50%    448
  66%    501
  75%    540
  80%    569
  90%    679
  95%    881
  98%   1186
  99%   1505
 100%   7604 (longest request)

With sync log writer:

config :annon_api, :plugin_pipeline,
  default_features: [:log_consistency]

Message queues are empty (almost always 0) and memory consumption is under 100 Mb.

2018-03-24 15 36 46

ab -p ./post_loc.txt -l -n 100000 -c 100 http://127.0.0.1:4000/httpbin/post 
This is ApacheBench, Version 2.3 <$Revision: 1796539 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking 127.0.0.1 (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests


Server Software:        gunicorn/19.6.0
Server Hostname:        127.0.0.1
Server Port:            4000

Document Path:          /httpbin/post
Document Length:        Variable

Concurrency Level:      100
Time taken for tests:   487.047 seconds
Complete requests:      100000
Failed requests:        0
Non-2xx responses:      443
Total transferred:      1106356543 bytes
Total body sent:        406500000
HTML transferred:       1071172048 bytes
Requests per second:    205.32 [#/sec] (mean)
Time per request:       487.047 [ms] (mean)
Time per request:       4.870 [ms] (mean, across all concurrent requests)
Transfer rate:          2218.32 [Kbytes/sec] received
                        815.06 kb/s sent
                        3033.38 kb/s total

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0   24 151.0      0    2265
Processing:    15  463 188.1    444    2339
Waiting:       15  460 187.4    442    2321
Total:         16  487 224.0    450    3230

Percentage of the requests served within a certain time (ms)
  50%    450
  66%    502
  75%    548
  80%    584
  90%    699
  95%    870
  98%   1158
  99%   1361
 100%   3230 (longest request)

Even tail-case latency is better now.

@AndrewDryga
Copy link
Member Author

https://opencensus.io/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant