Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does this instance have a privacy policy? #18

Closed
johndoe432 opened this issue Jan 6, 2022 · 10 comments
Closed

Does this instance have a privacy policy? #18

johndoe432 opened this issue Jan 6, 2022 · 10 comments

Comments

@johndoe432
Copy link

Does it log anything and, if it does, for how long are these logs stored?

@mrpaulblack mrpaulblack self-assigned this Jan 8, 2022
@mrpaulblack
Copy link
Member

mrpaulblack commented Jan 8, 2022

Hi @johndoe432 ,
Thanks for your questions.

TL;DR
Yes I am logging requests.

The long answer:
Currently my stack consists of traefik as a reverse proxy and behind it filtron. Behind filtron sits my SearXNG instance. I am not logging anything with filtron nor with SearXNG itself. From traefik on the other hand I am collecting an access log and saving it indefinitly with loki to organize and display this data in grafana dashboards... The entire stack is OSS and I am NOT logging referrer, nor IP address. I also have regex filters in place to remove the search param q= from queries as well as the image_proxy params from those logs. These are the currently used regex filters on my instance:

          - replace:
              expression: '(?:[0-9]{1,3}\.){3}([0-9]{1,3})'
              replace: '***'
          - replace:
              expression: '(/search\?(q=|preferences=).*?\")'
              replace: '/search"'
          - replace:
              expression: '(/autocompleter\?(q=|preferences=).*?\")'
              replace: '/autocompleter"'
          - replace:
              expression: '(/image_proxy\?url=.*?\")'
              replace: '/image_proxy"'
          - replace:
              expression: '(/*\?(q=|preferences=).*?\")'
              replace: '/?q="'

This is a typical log line for making a search with my instance:

{"DownstreamContentSize":9160,"DownstreamStatus":200,"Duration":850427573,"OriginContentSize":9160,"OriginDuration":850159970,"OriginStatus":200,"Overhead":267603,"RequestAddr":"paulgo.io","RequestContentSize":26,"RequestCount":15087,"RequestHost":"paulgo.io","RequestMethod":"POST","RequestPath":"/search","RequestPort":"-","RequestProtocol":"HTTP/2.0","RequestScheme":"https","RetryAttempts":0,"RouterName":"searxng@docker","ServiceAddr":"172.18.0.***:8080","ServiceName":"searxng-searxng@docker","ServiceURL":{"Scheme":"http","Opaque":"","User":null,"Host":"172.18.0.***:8080","Path":"","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""},"StartLocal":"2022-01-08T20:37:37.633408099Z","StartUTC":"2022-01-08T20:37:37.633408099Z","TLSCipher":"TLS_AES_128_GCM_SHA256","TLSVersion":"1.3","entryPointName":"https","level":"info","msg":"","request_Sec-Fetch-Dest":"document","request_Sec-Fetch-Mode":"navigate","request_Sec-Fetch-Site":"none","request_Sec-Fetch-User":"?1","time":"2022-01-08T20:37:38Z"}

So here is the reasoning for doing this basic logging. I want to know if my site is actually healthy and if it works as expected for the end user. Since there are a lot of changes I want to see if these changes actually correctly on integration.

The metrics I am concerned with the most are:

  • The response time for a request. So for example if I change the by default enabled search engines, I want to see that response time stays the same or gets reduced...
  • The number of requests for each HTTP code; I basically want to know if my sites works correctly or if its broken and returns a lot of 50x HTTP codes for example
  • The TLS protocol versions and cyphers used; To see what TLS config makes sense for the userbase of my site (so I can set the most secure ciphers that are still used by most people)
  • ...

So the alternative to doing this logging would be to use metrics for example. I have been trying alternative for this issue, I experimented using prometheus metrics. Which would mean: no logging and still data like response time and so...

The problem I am having with this is that these are metrics; They are compared to log inaccurate (This is a problem for 50x logs for example since with metrics errors with the SearXNG can either get overblown or not reported...).

These graph shows the number of requests per 2 minutes for the last 1h with logging:
image

This is the same graph with prometheus metrics over the same time period:
image

So in the end that means: I can see that a search was done at a specific time; But I cannot see who did it and what that person searched for.

I am open for suggestions to make my current stack better. So please leave a comment if you disagree or have any concern with my setup.

@mrpaulblack mrpaulblack pinned this issue Jan 8, 2022
@mrpaulblack mrpaulblack removed their assignment Jan 8, 2022
@mrpaulblack
Copy link
Member

Ok just to give a quick update I am still going to try out prometheus metrics as a replacement for loki logging and keep this ticket updated with my progress on that...

@johndoe432
Copy link
Author

johndoe432 commented Jan 9, 2022

Thanks for your detailed answer. Just wanted to be sure that there is no personally identifiable information being logged.

I appreciate your work of sharing tools for saving privacy with people!

@mrpaulblack
Copy link
Member

Ok just to give an update. I have been working on a prometheus dashboard and got it to a point where I do no longer need the loki dashboard (that uses logging) anymore. I decided to drop the access log from my reverse proxy completely and fully rely on metrics, which are IMO good enough for performance data in production.

Meaning:

  • prod server is faster since metrics are lighter on resources and can be easier offloaded to another server where prometheus is running
  • I am no longer logging anything with my reverse proxy nor SearXNG itself on my prod server and also deleted all logs that I already collected for SearXNG and every other service on my VPS
  • I still get perf, data and can see if one of my services is slow to response or throws a lot of HTTP 500 codes for example

Since I am no longer logging anything I am going to close this issue 👍

@MuntashirAkon
Copy link

MuntashirAkon commented Feb 14, 2022

You should still include a privacy-policy (with a single line in it, for example) or add a simple phrase at the footer (e.g. no logging) that you aren't collecting any PII. (ear in SearXNG has some free space at the top. You can include it there too, but I guess it might be too much.)

And thanks for your efforts. This is the best searX instance to my knowledge right now. (search.disroot.org used to be my daily driver, but sadly, it lost it last year.)

@silverwings15
Copy link

silverwings15 commented Feb 14, 2022

This is the best searX instance to my knowledge right now.

agreed, anon.sx was my go to for a good few months but paulgo edges it out slightly in speed

edit: i also tried out searx.be which was great as well, but decided to settle on paulgo

@MuntashirAkon
Copy link

Reliability is a bigger issue I think. I've also tried a few other instances now and then but most of them become slow after a few months (or even days). Paulgo used to be slow last year (when I was trying random instances after disroot's failure in getting any sane results), but it's been much improved. The results are very quick now.

But I believe this is an off-topic discussion. So, I will stop.

@mrpaulblack
Copy link
Member

@MuntashirAkon Yeah I think this is a good idea. What do you think of adding like a motd underneath the search input filed on the index page? Something like this maybe? (similar to https://www.qwant.com/)
desktop light theme:
image

mobile dark theme:
image

Otherwise I think a link to a privacy policy in the new about page would be the right step IMO.

@MuntashirAkon
Copy link

Yes, this looks good. But I do not think this is enough. For example, I would expect it to say that it logs none of the personally identifiable information such as IP address, User Agent, queries etc. or any tracking cookies, scripts, etc. which you cannot put in one line.

@sankhababu
Copy link

Yes, this looks good. But I do not think this is enough. For example, I would expect it to say that it logs none of the personally identifiable information such as IP address, User Agent, queries etc. or any tracking cookies, scripts, etc. which you cannot put in one line.

That will be the right approach IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants