Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug-1921849: support elasticsearch 8 #6741

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Conversation

relud
Copy link
Member

@relud relud commented Oct 3, 2024

use ELASTICSEARCH_MODE=PREFER_NEW to make the webapp use es8 and the processor write to both es 1.4 and es8

@relud relud force-pushed the relud-es-8-crash-storage branch 11 times, most recently from d96b425 to 9ac223f Compare October 8, 2024 18:09
@relud relud requested a review from willkg October 8, 2024 18:19
@relud

This comment was marked as resolved.

@relud relud force-pushed the relud-es-8-crash-storage branch 3 times, most recently from 32912f9 to b360970 Compare October 9, 2024 23:58
@relud relud marked this pull request as ready for review October 9, 2024 23:59
@relud relud requested a review from a team as a code owner October 9, 2024 23:59
@willkg

This comment was marked as resolved.

@relud relud force-pushed the relud-es-8-crash-storage branch 4 times, most recently from 19b3ea4 to c0e2cfd Compare October 30, 2024 17:55
@relud

This comment was marked as resolved.

Copy link
Contributor

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to do this PR in two passes. This is a code read pass.

While you're fixing things I brought up, I'll spend some time going through some manual testing for things I'm wondering about.

Then after you make changes, I'll read through those and add anything that came up in manual testing.

socorro/external/es/crashstorage.py Outdated Show resolved Hide resolved
socorro/external/es/crashstorage.py Outdated Show resolved Hide resolved
socorro/external/es/super_search_fields.py Outdated Show resolved Hide resolved
socorro/external/es/super_search_fields.py Show resolved Hide resolved
socorro/external/es/super_search_fields.py Outdated Show resolved Hide resolved
socorro/tests/external/es/test_supersearch.py Outdated Show resolved Hide resolved
"storage_mapping": {
"analyzer": "semicolon_keywords",
"type": "text",
"fielddata": True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my notes, fielddata docs are here:

https://www.elastic.co/guide/en/elasticsearch/reference/8.15/text.html#fielddata-mapping-param

This is a text field and in order to aggregate/sort on it, we need to set fielddata=True.

We don't want to do this:

"storage_mapping": {
    "analyzer": "semicolon_keywords",
    "fields": {"full": {"type": "keyword"}},
    "type": "text",
}

because we want to treat each token separately for aggregation.

@relud Is that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that matches my understanding

# out which *key* had the bad input.
for key, value in kwargs.items():
if value == bad_input:
raise BadArgumentError(key) from exc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell me more about the changes in this section and the below section? It seems like the original code handled two different kinds of errors and the new code only handles one of those here and the other one in a different block. Is that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. This used to handle malformed query and invalid regex, but now it only handles malformed query, and bad regex is now considered a shard failure by ES, so it's handled below.

socorro/external/es/supersearch.py Outdated Show resolved Hide resolved
socorro/tests/external/es/test_crashstorage.py Outdated Show resolved Hide resolved
@relud relud force-pushed the relud-es-8-crash-storage branch 2 times, most recently from 991c724 to b460f21 Compare November 7, 2024 00:23
@relud relud requested a review from willkg November 7, 2024 00:25
except elasticsearch.exceptions.TransportError as e:
# If this is a TransportError, we try to figure out what the error
except elasticsearch.BadRequestError as e:
# If this is a BadRequestError, we try to figure out what the error
# is and fix the document and try again
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this section removes fields that cause document_parsing_exception and retries the document. This seems like an odd choice given that it happens after value fixing occurs, which should already be preventing the three types of failure we catch here. The only way i can think of to reach this code block in production is if we are writing to a field not in our mapping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used to be the case that it wrote all the data into Elasticsearch even if it wasn't in the mapping. That way when they add new fields to the crash report, they'd get indexed even if Socorro didn't explicitly have support for it. While the intentions were good, that was terrible so I changed it such that it only indexes what's defined in super search fields and in the mapping.

@willkg willkg self-assigned this Nov 18, 2024
Copy link
Contributor

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because crashstorage metrics keys are composed in the processor for crash storage destinations, when we switch to PREFER_NEW, a new metrics key is emitted which isn't documented in socorro/statsd_metrics.yml.

Running the local dev environment and processing crashes kicks this up:

socorro-processor-1   | Traceback (most recent call last):
socorro-processor-1   |   File "/app/socorro/lib/threaded_task_manager.py", line 250, in run
socorro-processor-1   |     function(*args, **kwargs)  # execute the task
socorro-processor-1   |     ^^^^^^^^^^^^^^^^^^^^^^^^^
socorro-processor-1   |   File "/app/socorro/processor/processor_app.py", line 144, in transform
socorro-processor-1   |     self.process_crash(
socorro-processor-1   |   File "/app/socorro/processor/processor_app.py", line 215, in process_crash
socorro-processor-1   |     with METRICS.timer(
socorro-processor-1   |   File "/usr/local/lib/python3.11/contextlib.py", line 144, in __exit__
socorro-processor-1   |     next(self.gen)
socorro-processor-1   |   File "/usr/local/lib/python3.11/site-packages/markus/main.py", line 509, in timer
socorro-processor-1   |     self.timing(stat, value=delta * 1000.0, tags=tags)
socorro-processor-1   |   File "/usr/local/lib/python3.11/site-packages/markus/main.py", line 420, in timing
socorro-processor-1   |     self._publish(
socorro-processor-1   |   File "/usr/local/lib/python3.11/site-packages/markus/main.py", line 280, in _publish
socorro-processor-1   |     record = metrics_filter.filter(record)
socorro-processor-1   |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socorro-processor-1   |   File "/usr/local/lib/python3.11/site-packages/markus/filters.py", line 122, in filter
socorro-processor-1   |     raise MetricsUnknownKey(f"metrics key {record.key!r} is unknown")
socorro-processor-1   | markus.filters.MetricsUnknownKey: metrics key 'socorro.processor.legacy_es.save_processed_crash' is unknown

Can you add that to statsd_metrics.yml?

This kicks up an error because I think ES 8 hasn't finishing starting up, yet:

  1. set ELASTICSEARCH_MODE=PREFER_NEW in .env (I rebased against main to pick up the just and .env changes)
  2. do docker compose stop to stop everything so nothing is running
  3. do just build
  4. do just setup

Error:

Traceback (most recent call last):
  File "/app/socorro-cmd", line 202, in <module>
    cmd_main()
  File "/app/socorro-cmd", line 198, in cmd_main
    import_and_run(runner)
  File "/app/socorro-cmd", line 129, in import_and_run
    sys.exit(app(sys.argv[1:]))
             ^^^^^^^^^^^^^^^^^
  File "/app/bin/es_cli.py", line 156, in main
    es_group(argv)
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/bin/es_cli.py", line 143, in cmd_delete
    indices_to_delete = crashstorage.get_indices()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/socorro/external/es/crashstorage.py", line 409, in get_indices
    indices = self.client.get_indices()
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/socorro/external/es/connection_context.py", line 86, in get_indices
    return self.indices_client().get_alias().keys()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/elasticsearch/_sync/client/utils.py", line 446, in wrapped
    return api(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/elasticsearch/_sync/client/indices.py", line 1901, in get_alias
    return self.perform_request(  # type: ignore[return-value]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/elasticsearch/_sync/client/_base.py", line 423, in perform_request
    return self._client.perform_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/elasticsearch/_sync/client/_base.py", line 271, in perform_request
    response = self._perform_request(
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/elasticsearch/_sync/client/_base.py", line 316, in _perform_request
    meta, resp_body = self.transport.perform_request(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/elastic_transport/_transport.py", line 342, in perform_request
    resp = node.perform_request(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/elastic_transport/_node/_http_urllib3.py", line 202, in perform_request
    raise err from None
elastic_transport.ConnectionError: Connection error caused by: ConnectionError(Connection error caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x745b18446cd0>: Failed to establish a new connection: [Errno 111] Connection refused))

If I pause and then run just setup a second time, it works fine.

Can you add a depends_on or a waitfor or whatever it is that's needed, please?

I went through and tested these things with both the default and PREFER_NEW settings:

  1. processing crash reports, super search, signature report -- work fine

  2. top crashers report -- kicks up error:

    socorro-webapp-1      | Traceback (most recent call last):
    socorro-webapp-1      |   File "/usr/local/lib/python3.11/site-packages/django/core/handlers/exception.py", line 55, in inner
    socorro-webapp-1      |     response = get_response(request)
    socorro-webapp-1      |                ^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/usr/local/lib/python3.11/site-packages/django/core/handlers/base.py", line 197, in _get_response
    socorro-webapp-1      |     response = wrapped_callback(request, *callback_args, **callback_kwargs)
    socorro-webapp-1      |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/usr/local/lib/python3.11/site-packages/sentry_sdk/integrations/django/views.py", line 90, in sentry_wrapped_callback
    socorro-webapp-1      |     return callback(request, *args, **kwargs)
    socorro-webapp-1      |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/app/webapp/crashstats/crashstats/decorators.py", line 149, in inner
    socorro-webapp-1      |     response = view(request, *args, **kwargs)
    socorro-webapp-1      |                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/app/webapp/crashstats/crashstats/decorators.py", line 101, in inner
    socorro-webapp-1      |     return view(request, *args, **kwargs)
    socorro-webapp-1      |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/app/webapp/crashstats/crashstats/decorators.py", line 68, in inner
    socorro-webapp-1      |     return view(request, *args, **kwargs)
    socorro-webapp-1      |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/app/webapp/crashstats/topcrashers/views.py", line 298, in topcrashers
    socorro-webapp-1      |     return render(request, "topcrashers/topcrashers.html", context)
    socorro-webapp-1      |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/usr/local/lib/python3.11/site-packages/sentry_sdk/utils.py", line 1788, in runner
    socorro-webapp-1      |     return sentry_patched_function(*args, **kwargs)
    socorro-webapp-1      |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/usr/local/lib/python3.11/site-packages/sentry_sdk/integrations/django/templates.py", line 105, in render
    socorro-webapp-1      |     return real_render(request, template_name, context, *args, **kwargs)
    socorro-webapp-1      |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/usr/local/lib/python3.11/site-packages/django/shortcuts.py", line 24, in render
    socorro-webapp-1      |     content = loader.render_to_string(template_name, context, request, using=using)
    socorro-webapp-1      |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/usr/local/lib/python3.11/site-packages/django/template/loader.py", line 62, in render_to_string
    socorro-webapp-1      |     return template.render(context, request)
    socorro-webapp-1      |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/usr/local/lib/python3.11/site-packages/django_jinja/backend.py", line 59, in render
    socorro-webapp-1      |     return mark_safe(self._process_template(self.template.render, context, request))
    socorro-webapp-1      |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/usr/local/lib/python3.11/site-packages/django_jinja/backend.py", line 105, in _process_template
    socorro-webapp-1      |     return handler(context)
    socorro-webapp-1      |            ^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/usr/local/lib/python3.11/site-packages/jinja2/environment.py", line 1304, in render
    socorro-webapp-1      |     self.environment.handle_exception()
    socorro-webapp-1      |   File "/usr/local/lib/python3.11/site-packages/jinja2/environment.py", line 939, in handle_exception
    socorro-webapp-1      |     raise rewrite_traceback_stack(source=source)
    socorro-webapp-1      |   File "/app/webapp/crashstats/topcrashers/jinja2/topcrashers/topcrashers.html", line 7, in top-level template code
    socorro-webapp-1      |     {% extends "crashstats_base.html" %}
    socorro-webapp-1      |     ^^^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/app/webapp/crashstats/crashstats/jinja2/crashstats_base.html", line 150, in top-level template code
    socorro-webapp-1      |     {% block content %}{% endblock %}
    socorro-webapp-1      |     ^^^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/app/webapp/crashstats/topcrashers/jinja2/topcrashers/topcrashers.html", line 193, in block 'content'
    socorro-webapp-1      |     {% if topcrashers_stats_item.is_startup_window_crash %}
    socorro-webapp-1      |     ^^^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/usr/local/lib/python3.11/site-packages/jinja2/environment.py", line 487, in getattr
    socorro-webapp-1      |     return getattr(obj, attribute)
    socorro-webapp-1      |            ^^^^^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/usr/local/lib/python3.11/site-packages/django/utils/functional.py", line 57, in __get__
    socorro-webapp-1      |     res = instance.__dict__[self.name] = self.func(instance)
    socorro-webapp-1      |                                          ^^^^^^^^^^^^^^^^^^^
    socorro-webapp-1      |   File "/app/webapp/crashstats/crashstats/utils.py", line 177, in is_startup_window_crash
    socorro-webapp-1      |     if row["term"] < 60:
    socorro-webapp-1      |        ^^^^^^^^^^^^^^^^
    socorro-webapp-1      | TypeError: '<' not supported between instances of 'str' and 'int'
    
  3. custom queries (only available to obs team) work fine

  4. I did aggregations with missing_symbols, modules_in_stack, topmost_filename, and useragent_locale and they all work fine -- aggregation is done on tokens and not the whole value

  5. abort_message works with both matches and is-exact operators

  6. startup_crash has T and F values

  7. _return_query=1 with local dev environment looks fine; it's different than what we see in prod, but that's expected due to ES API changes

  8. I went through all the supersearchfacet examples in the crashstats-tools docs and they all worked fine

This is looking good. That TopCrashers issue needs to be looked at.

socorro/external/es/super_search_fields.py Show resolved Hide resolved
@relud relud force-pushed the relud-es-8-crash-storage branch 4 times, most recently from 9988801 to 0e51573 Compare November 20, 2024 23:29

startup_crash_msg = 'title="Startup Crash"'
potential_startup_crash_msg = 'title="Potential Startup Crash"'
potential_startup_window_crash_msg = 'title="Potential Startup Crash, more than '
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified this test to also check is_startup_window_crash to cover the failure case you manually observed, and confirm that my solution for identifying boolean aggregation terms is working as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done--thank you!

@willkg
Copy link
Contributor

willkg commented Nov 22, 2024

@relud Can you pull in the fix in PR #6813? Then I can go through this PR again.

@relud
Copy link
Member Author

relud commented Nov 22, 2024

@relud Can you pull in the fix in PR #6813? Then I can go through this PR again.

done

Copy link
Contributor

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through and tested the issues I raised in the previous review.

  1. bin/process_crashes.sh works now.
  2. Things correctly wait for both es and legacy_es containers to start up.
  3. TopCrashers works now.

I wrote up bug 1933824 about a curiosity I hit when uploading some crash report dump files to gcs emulator. That's not related to these changes.

My only issue is that I'm not sure why you added four new metrics to statsd_metrics.yml rather than just the one I was hitting issues with.

Everything else looks fine as far as I can tell.

r+wc

socorro.processor.legacy_es.save_processed_crash:
type: "timing"
description: |
Timer for how long it takes to save the processed crash to Elasticsearch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we only needed to add socorro.processor.legacy_es.save_processed_crash because that's the only one that's composed and emitted by the processor:

with METRICS.timer(
f"processor.{dest.crash_destination_name}.save_processed_crash"
):
dest.save_processed_crash(raw_crash, processed_crash)

When you tested this, did you hit errors that caused you to add the other three metrics?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants