Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve timer #5672

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Improve timer #5672

wants to merge 3 commits into from

Conversation

holmanb
Copy link
Member

@holmanb holmanb commented Sep 6, 2024

Note to reviewers

This looks big, but it is broken into three logically separate changes. I would recommend reviewing a commit at a time.

Motivation

The recent addition of subp execution time was a huge benefit to exposing slow external commands. As we try to push performance more, more data is vital to making the biggest impact that we can. Comparing this branch against the tip of main, this change makes the total log size 1 line longer - but this will vary depending per-platform.

  • builtin timing instrumentation in production allows diagnosing everywhere we can get a log
  • reduce noise by only logging slow events
  • avoid the need for manual instrumentation
  • cloud-init analyze is neat, but it is low granularity, relies log timestamps (so is only an approximation), and it requires extra steps
  • standardize logging to in shared code

Description of changes

  • Improve on the existing timer helper by simplifying its usage and implementation
  • Remove obsolete functionality in the existing timer function
  • Eliminate irrelevant logs by only logging if a threshold is exceeded.
  • Update existing callsites to use context manager
  • Refactor logging helpers out of util.py
  • Replace duplicate timer implementations with timer helper
    • socket.py
    • subp.py
  • Add timer helper to shared callsites
    • DataSource.get_data()
    • each cloud-config module

Todo

  • fix unittests
  • bikeshed the name (I originally implemented the context handler as a function, but now it's a class so it's suddenly un-pythonic)

Merge type

  • Squash merge using "Proposed Commit Message"
  • Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

@holmanb holmanb force-pushed the holmanb/timer branch 4 times, most recently from 3169a13 to 86daea4 Compare September 7, 2024 00:08
@holmanb holmanb marked this pull request as ready for review September 7, 2024 00:25
@TheRealFalcon TheRealFalcon self-assigned this Sep 9, 2024
Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Overall a really good change. I left some comments inline.

cloudinit/log/__init__.py Outdated Show resolved Hide resolved
cloudinit/log/log_util.py Outdated Show resolved Hide resolved
cloudinit/log/log_util.py Outdated Show resolved Hide resolved
cloudinit/log/log_util.py Outdated Show resolved Hide resolved
cloudinit/log/log_util.py Outdated Show resolved Hide resolved
cloudinit/sources/DataSourceEc2.py Outdated Show resolved Hide resolved
cloudinit/ssh_util.py Outdated Show resolved Hide resolved
assert should_auto_attach_value == _should_auto_attach(ua_section)
assert (
Copy link
Member

@TheRealFalcon TheRealFalcon Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...I think this highlights a gap in functionality where we might wind up losing valuable logging messages that no longer take as much time to execute. Looking at the Pro module now, there's now no log anywhere to indicate that we're attempting to attach to Pro unless it is taking too long, and I'm assuming that's not what we want. I'm not sure I have a great solution other than looking through all of the changed call sites and using a threshold of 0 for those we think we want to log regardless of the time taken.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I reworked the context manager to support a parameter called log_mode, with three options: "always", "threshold", and "skip" (default is "threshold"). I audited the modified callsites and set "always" to those that I think need it. Let me know if there are any others that you think should also be modified.

@holmanb holmanb force-pushed the holmanb/timer branch 4 times, most recently from dfa55df to 446fee4 Compare September 19, 2024 18:07
The logic in Timed() has defaults which are designed to avoid low-value log
messages. A log is only emitted when the threshold is exceeded. Additionally,
the log_mode parameter allows the caller to prevent Timed() logs, so that the
caller may coalesce the message into another log.

- Make timer helper into a context manager.
- Add new decorator for timing functions.
- Simplify timer helper function, eliminate obsolete logic.
- Eliminate redundant logs and timer calls.
refactors existing code to utilize timer codepaths:

- replace manual timer implementations in subp.py and sockets.py
- replace open() / read() calls with util.load_text_file() where appropriate
@holmanb
Copy link
Member Author

holmanb commented Sep 19, 2024

@TheRealFalcon thanks for the review. I think that I've addressed all of your comments. Ready for re-review.

Copy link
Member

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more comments, but everything is mostly minor.

I really like the new decorator.

@@ -1273,15 +1277,15 @@ def sub_main(args):
report_on = True
if name == "init":
if args.local:
rname, rdesc = ("init-local", "searching for local datasources")
rname, rdesc = ("local", "searching for local datasources")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the impetus for these next 3 changes? Their replacements are less greppable, the first log message of each stage still uses the original name, and we have other tons of references to the original names all over the code base. I would prefer to keep the originals.

args=(name, args),
)
with performance.Timed(f"cloud-init stage: '{rname}'"):
retval = functor(name, args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_uptime was kind of an odd kwarg to have in the old function, and it looks like here is the only place it was used. I'm not sure if/how it was useful, but do you think it makes sense to throw an uptime log as early as possible in init-local as a replacement?

msg="backgrounded Resizing",
func=do_resize,
args=(resize_cmd,),
do_resize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the removal here intentional?

def __init__(
self,
msg: str,
/,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a good reason to enforce this?

run_name, mod.handle, func_args, freq=freq
)

with performance.Timed("and", log_mode="skip") as timer:
Copy link
Member

@TheRealFalcon TheRealFalcon Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still feels a bit hard to understand at first. What do you think about having this message be "" and then later f"{run_name} ran successfully and took {timer.delta:.3f} seconds"?

Not a blocking comment if you prefer it as-is.

# open filename in mode 'omode', write content, set permissions to 'mode'
"""open filename in mode omode, write content, set permissions to mode"""

with performance.Timed(f"Writing {filename}"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the decorator be used here now instead?

default_flow_style=False,
Dumper=(NoAliasSafeDumper if noalias else yaml.dumper.SafeDumper),
)
with performance.Timed("Dumping yaml"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the new decorator work here?

@@ -475,6 +483,7 @@ def clear_cached_attrs(self, attr_defaults=()):
if not attr_defaults:
self._dirty_cache = False

@performance.timed("Getting metadata")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably log always

/,
*,
threshold: float = 0.01,
log_mode: str = "threshold",
Copy link
Member

@TheRealFalcon TheRealFalcon Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love this parameter being a magic string, especially when it only takes two values. Is there a reason setting threshold to 0 isn't a valid way to signal to log always? If so, could this field be a boolean like always_log: bool = False or something similar?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants