Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

built-in measurements #3

Open
jesteria opened this issue May 11, 2022 · 10 comments
Open

built-in measurements #3

jesteria opened this issue May 11, 2022 · 10 comments
Assignees
Labels
Milestone

Comments

@jesteria
Copy link
Member

As a user of Netrics – either a "legacy" user or "new" – I expect the software to ship with a collection of built-in measurements (at least those available in the previous software). As such, existing measurement code, of potential use to us or to others, must be ported to the new Netrics framework.

This work is largely one of deletion rather than addition – measurements should be boiled down to their essential functionality, (without retry logic, etc.).


Under the Netrics framework, all measurements are simple executables.

Builtins will be installed with names prepended by netrics-, e.g. netrics-ping. This convention will make their purpose clear, have the effect of "bundling" them despite their separate existence, and these names will be given special treatment in configuration: (references to the measurement ping will imply the executable netrics-ping, and whether netrics-ping is a built-in or not).

However, builtins needn't be named this way in the codebase – (it is trivial to apply this convention upon installation) – within the codebase, ping is fine.

The "ideal" or most simple measurement is a standalone script: Python or otherwise. (It must handle its own execution via shebang, etc.). Installation will copy/link each built-in measurement script to the appropriate path and under the appropriate name.

That said, Python-based measurements are free to be more sophisticated than standalone scripts. (So long as there is a module with a single function to invoke without arguments – e.g. main() – then installation may trivially construct a script to invoke it.) This might allow our builtins to share code (to be "DRY"), etc. Such a package of built-in measurements might be contained within the netrics package (at src/netrics/measurement/builtin/ or otherwise). (How this looks remains to be seen and is up to the implementer.)

Upon execution, the measurement's contract with the framework is as follows:

  • Upon execution/invocation, run the measurement! Just the measurement. No need for retries, etc. (If it's ping, then run ping.) If there's an exception, the return code should reflect this.
  • Result(s) to be persisted are printed to standard output (stdout) (the default for print, echo, etc.) – presumably in JSON format, but the framework does not (at this point) care.
  • If you expect interesting measurement-specific configuration, we might set that in the framework's new measurements configuration file, but simply deliver it via standard input (stdin) – there's no need for the measurement executable to know where the framework configuration is or how it works ***.

*** (I'm really split on the best data format for this. JSON input and output make sense; and, if it's JSON then you'd just: json.load(sys.stdin). Obviously this is more natural if the main measurements config is itself JSON or YAML. But, I think TOML is friendlier than YAML for such files. Regardless, the framework can send the measurement config in whatever format, and even let it be configurable. To the framework code, there's nothing exceptional about reading TOML and writing JSON; though, to human beings involved on either end, it might be a little odd.)

@kyle-macmillan
Copy link
Collaborator

I will work on porting the current netrics tests to individual python executables. I've ported the basic ping command (need to fix how it loads in custom command, currently hard-coded config). The python executable can be found under /src/netrics/measurements/builtin/ as netrics-ping.py. @jesteria can you take a look at it and let me know if this is roughly what you had in mind? Thanks.

@kyle-macmillan
Copy link
Collaborator

The following list of tests should be including in the /builtin/ (or stdlib) measurements:

  • netrics-ping (simple ping to specified destinations)
  • netrics-traceroute (traceroute to specified desintations)
  • netrics-ookla (runs ookla speed test)
  • netrics-ndt7 (runs NDT7 speed test)
  • netrics-lml (measures the last mile latency)
  • netrics-dns-latency (measure dns response time)
  • netrics-devs (counts the number of devices)

Many of these will likely get their own issue as we develop / port them from Netrics v1.

@jesteria
Copy link
Member Author

jesteria commented May 13, 2022 via email

@jesteria
Copy link
Member Author

@kyle-macmillan – Yes looks totally reasonable.

Brief comments on the ping module that might be helpful:

  • [ref] I'd suggest looking at subprocess.run. You could perhaps even eliminate this helper (since it's doing essentially the same thing). This would also grant you a higher-level API, complete with a CompletedProcess return object and exceptions for command failures with CalledProcessError. (This might even satisfy what you were hoping to have vis-a-vis exceptions underlying the ultimate communication with the framework via exit codes.) And it just might be easier (at least when reproduced across all the modules.)

  • [ref] Not immediately important but these will want to at least consider the case that the configuration is unspecified, missing or malformed. I'd suggest, at least, and if appropriate to the module, to have a set of defaults. This can be added very simply:

    PARAM_DEFAULTS = {…}
    
    …
    
    def main():
      params = dict(PARAM_DEFAULTS, **json.load(sys.stdin))
  • [ref] Certainly no big deal, but if you specify this as an argument list (list or simply tuple) then there's no need for the subprocess to launch a shell – (you're not scripting but just executing ping with those arguments). So say ('ping', '-i', …).

  • [ref] Nothing wrong with debugging output. Just keep in mind that ultimately, any such thing will need to be printed to stderr. Even with print this is trivial: print(…, file=sys.stderr).

  • [ref] Nothing bad here. Just keep in mind that on error we want to communicate it via exit code and whatever combination of stdout and stderr makes sense. I understand that here you're running ping multiple times and so perhaps you don't want to error out when ping fails once; though, I'm curious if an exit code of 2 can be a network-level error. (Perhaps it can/should be treated as a fatal error?)
    Anyway, with the higher-level subprocess API this could read a bit differently:

    try:
      ping_result = subprocess.run(['ping', '-i', …], capture_output=True, check=True)
    except subprocess.CalledProcessError:
      …

    And I'll keep the above for reference, though admittedly in the case of ping error codes are more meaningful than with other commands, so we might instead "check" it ourselves:

    ping_result = subprocess.run(['ping', '-i', …], capture_output=True)
    if ping_result.returncode == 2:
      # we're done d00d!
      # tho we captured it in this case let's pass on the stderr ... maybe dress it up a little
      print("command 'ping' failed with exit code 2 and stderr <", ping_result.stderr, ">", file=sys.stderr)
      # and let's communicate the failure via exit code ... say with ping's exit code:
      sys.exit(2)
    
    # otherwise let's send back our ping results
    # this could certainly be much the same as is, though I wouldn't strictly call this an error handler or an error that were reporting. It just looks like a friendly message we're sticking in the results JSON.
    res[dst]['msg'] = error_handler(ping_result.returncode, ping_result.stderr)
    
    # Extract results from ping output
    res[dst]['results'] = parse_ping_output(ping_result.returncode, ping_result.stdout)
  • Do bear in mind that these don't have to be entirely stand-alone scripts. If you find that there are useful utility functions you want across them, etc., they can import from the netrics package, (or even be written as modules in the package – so long as they expose some main function – as here, just no need in that case for the top-level if __name__ == '__main__'). But it would be good to establish what really is non-trivial shared functionality or boilerplate before deciding to go that route. (I can imagine some standard helper for reporting errors perhaps? Or even via a logging module. But I'm not really sure.)

  • I really don't have any preference for how results look, errors look, etc. Here you have errors in the proper results that are collected by the system. I'm suggesting an alternative since these aren't measurements; but if there's a desire for these to be included as "null results" that's certainly fine. (Though of course if we're retrying by default I'm not sure the null result is useful. In that case perhaps we really do want to consider what a fatal error that should be retried is … but I'm not yet sure.) We could also structure errors sent to stderr in much the same way (as JSON). Currently I believe we only collect results. Perhaps we should collect errors separately … or perhaps that's not useful 🤷

  • Final thought, not important, and back to ping specifically – any reason (not) to parallelize the pings? (This is more code but not difficult.) Just a question of whether this would be a problem for the measurements and whether it would be useful.

@jesteria
Copy link
Member Author

(Re: retries: Perhaps we'll want to require measurements to explicitly request a retry rather than doing it by default. That likely makes more sense, and better matches what we were doing before. I'm not sure ping has any need of retries. But e.g. a speedtest might get a "busy" response, and request an immediate-to-soon retry. Say, by default – sys.exit(42) – and the framework will put the measurement into the retry queue. This could also be made configurable for any measurement with the setting retry_on: [42].)

@kyle-macmillan
Copy link
Collaborator

Thanks for this great feedback! I will look into suggestions / basically agree with you on all points. A few comments/quesitons:

  • I will plan to implement the first three points. All good and reasonable suggetions
  • I think having separate json objects for results and errors is a good suggestion. Will integrate that as well.
  • I think parallelizing ping should be totally fine. I can work on that as well. I have used the multiprocess before. Do you have any suggestions?
  • Want to make sure I have these things straight: Upon completion, the module prints results to stdout and errors to stderr (both as JSON objects) as well as exiting with the appropriate exit code.

@jesteria
Copy link
Member Author

jesteria commented May 19, 2022 via email

@kyle-macmillan
Copy link
Collaborator

kyle-macmillan commented May 24, 2022

Thanks for the feedback, I've opened a separate issue (#5) for netrics-ping discussion.

What exit code should be raised for measurements like netrics-ping and netrics-traceroute that effectively run multiple tests? One idea is that netrics-ping should ping a single destination and not take a list of destinations to ping. This effectively shift responsibility for concurrency to the controller...

kyle-macmillan added a commit that referenced this issue May 24, 2022
@jesteria
Copy link
Member Author

jesteria commented Jun 2, 2022

Right:

  • So if a measurement runs without any issues, then exit code 0 makes sense
  • (If the measurement would like to be put back onto the queue for any reason, then whatever special exit code we like should be returned, say 42)
  • If any of its sub-tests have an error, then it's up to the test – I'd say it probably makes sense for the module itself to also exit with a non-zero exit code like 1.

Presumably, the results for all succeeding sub-tests will be written to stdout, and persisted. (This is up in the air but it seems like that's the direction we're headed, regardless of exit code.) But if the exit code is non-zero, the framework will also flag this exceptional result.


Separately, indeed, we could add support for concurrency to the controller 😄 I dunno! Should we?

(We might have a configuation key like parameters: […] and an optional key like concurrency: N? …For compatibility with generic measurement configuration, the parameters value could just be merged into the measurement configuration and the framework look for this.)

@jesteria
Copy link
Member Author

jesteria commented Jun 2, 2022

I guess one important question regarding the possibility of pushing concurrency onto the framework: how should the results of concurrently-executed, parameterized tests be persisted?

If they're simply persisted as unrelated tests run around the same microsecond, I'm not sure users would be happy. (Though I'm not sure – would that suffice for you?)

On the other hand, is there a generic way for results to be joined that would make everyone happy? (And I say everyone because this might be a tough thing to make configurable.)

Say with the following measurements configuration:

ping:
  parameters:
    - google.com
    - wikipedia.org
  concurrency: 2

For the above, the framework might run the netrics-ping command concurrently for the arguments google.com and wikipedia.org, perhaps by setting their stdin to something like (for example with google.com):

{
  "parameter": "google.com",
  "parameters": [
    "google.com",
    "wikipedia.org"
  ]
}

And then perhaps the result persisted for the collection of test executions could be generically:

{
  "google.com": {
    
  },
  "wikipedia.org": {
    
  }
}

That might well work and it might well be worthwhile. But I'm unsure that it's sufficiently generic for all conceivable modules.


Another possibility perhaps, rather than asking people to somehow write a proper reducer function into their configuration, would be to again leverage Jinja templating, to construct the results (e.g. to recreate the above proposed default):

ping:
  persist: |
    {
      {% for name, result in results %}
      "{{ name }}": {{ result }}{{ "" if loop.last else "," }}
      {% endfor %}
    }

(…But doesn't templating out JSON kind of stink?)

@jesteria jesteria self-assigned this Jan 17, 2023
jesteria added a commit that referenced this issue Jan 24, 2023
…isting

…Plus additional functionality:

* complete measurement configurability (via fate/netrics measurements file)

* validation of structured configuration via schema

* pre-measurement checks of execution environment (localhost) and LAN
  (gateway)

* parallelization of internet ping requests against configured targets

Future work:

* "near parity" is defined by the deferral of the command-debugging zip
  archive to future work

Issue tracking:

* part of #3: "built-in measurements"

* resolves #12: "…common/global folder for stdlib netrics tests"

* resolves #17: "netrics-ping returns errors if one address of the list is unreachable…"
jesteria added a commit that referenced this issue Jan 24, 2023
…isting

…Plus additional functionality:

* complete measurement configurability (via fate/netrics measurements file)

* validation of structured configuration via schema

* pre-measurement checks of execution environment (localhost) and LAN
  (gateway)

* parallelization of internet ping requests against configured targets

Future work:

* "near parity" is defined by the deferral of the command-debugging zip
  archive to future work

Issue tracking:

* part of #3: "built-in measurements"

* resolves #12: "…common/global folder for stdlib netrics tests"

* resolves #17: "netrics-ping returns errors if one address of the list is unreachable…"
jesteria added a commit that referenced this issue Jan 25, 2023
…isting

…Plus additional functionality:

* complete measurement configurability (via fate/netrics measurements file)

* validation of structured configuration via schema

* pre-measurement checks of execution environment (localhost) and LAN
  (gateway)

* parallelization of internet ping requests against configured targets

Future work:

* "near parity" is defined by the deferral of the command-debugging zip
  archive to future work

Issue tracking:

* part of #3: "built-in measurements"

* resolves #12: "…common/global folder for stdlib netrics tests"

* resolves #17: "netrics-ping returns errors if one address of the list is unreachable…"
jesteria added a commit that referenced this issue Feb 28, 2023
Adds both task `lml` (executable `netrics-lml`) and task alias
`lml-scamper` (`netrics-lml-scamper`) for clarity.

This version of the measurement is added to the default configuration
but commented out as it may not be enabled by default.

For speed and niceness, this lml test will randomly select an endpoint
to trace, from a default set of Google DNS as well as CloudFlare DNS.
(The former is generally speedier to respond; regardless, it might be
"nice" to round-robin, and for that matter to fall back in the event of
response failure.)

Otherwise:

Measurements' schedules in default configuration now use hashed cron
expressions to *discourage* their collision (though any which *may not*
run concurrently will be configured as such).

completes #13

part of #3
jesteria added a commit that referenced this issue Feb 28, 2023
Adds both task `lml` (executable `netrics-lml`) and task alias
`lml-scamper` (`netrics-lml-scamper`) for clarity.

This version of the measurement is added to the default configuration
but commented out as it may not be enabled by default.

For speed and niceness, this lml test will randomly select an endpoint
to trace, from a default set of Google DNS as well as CloudFlare DNS.
(The former is generally speedier to respond; regardless, it might be
"nice" to round-robin, and for that matter to fall back in the event of
response failure.)

Otherwise:

Measurements' schedules in default configuration now use hashed cron
expressions to *discourage* their collision (though any which *may not*
run concurrently will be configured as such).

completes #13

part of #3
jesteria added a commit that referenced this issue Feb 28, 2023
Adds both task `lml` (executable `netrics-lml`) and task alias
`lml-scamper` (`netrics-lml-scamper`) for clarity.

This version of the measurement is added to the default configuration
but commented out as it may not be enabled by default.

For speed and niceness, this lml test will randomly select an endpoint
to trace, from a default set of Google DNS as well as CloudFlare DNS.
(The former is generally speedier to respond; regardless, it might be
"nice" to round-robin, and for that matter to fall back in the event of
response failure.)

Otherwise:

Measurements' schedules in default configuration now use hashed cron
expressions to *discourage* their collision (though any which *may not*
run concurrently will be configured as such).

completes #13

part of #3
jesteria added a commit that referenced this issue Mar 2, 2023
Adds task `lml-traceroute` (executable `netrics-lml-traceroute`).

This task mirrors the `lml-scamper` version already published. It
differs in that the endpoint is traced with `traceroute` and then
`ping`'d; (and, traceroute output must be specially parsed, which is
relatively laborious).

As with `lml-scamper`, for speed and niceness, `lml-traceroute` will
randomly select an endpoint to trace, from a default set of Google DNS
as well as CloudFlare DNS. (The former is generally speedier to respond;
regardless, it might be "nice" to round-robin, and for that matter to
fall back in the event of response failure.)

completes #25

part of #11

part of #3
jesteria added a commit that referenced this issue Mar 2, 2023
Adds task `lml-traceroute` (executable `netrics-lml-traceroute`).

This task mirrors the `lml-scamper` version already published. It
differs in that the endpoint is traced with `traceroute` and then
`ping`'d; (and, traceroute output must be specially parsed, which is
relatively laborious).

As with `lml-scamper`, for speed and niceness, `lml-traceroute` will
randomly select an endpoint to trace, from a default set of Google DNS
as well as CloudFlare DNS. (The former is generally speedier to respond;
regardless, it might be "nice" to round-robin, and for that matter to
fall back in the event of response failure.)

completes #25

part of #11

part of #3
jesteria added a commit that referenced this issue Mar 7, 2023
Adds task `hops_traceroute` (with executable `netrics-hops-traceroute`).

This version is added to default configuration as `hops`. (An
experimental scamper version is forthcoming.)

part of #28

part of #3
jesteria added a commit that referenced this issue Mar 8, 2023
Adds task `hops` (with executable `netrics-hops`) and task alias
`hops-scamper` (`netrics-hops-scamper`) for clarity.

This version of the measurement is added to the default configuration
but commented out as it may not be enabled by default.

resolves #28

part of #3
jesteria added a commit that referenced this issue Mar 8, 2023
Adds task `hops_traceroute` (with executable `netrics-hops-traceroute`).

This version is added to default configuration as `hops`. (An
experimental scamper version is forthcoming.)

part of #28

part of #3
jesteria added a commit that referenced this issue Mar 8, 2023
Adds task `hops` (with executable `netrics-hops`) and task alias
`hops-scamper` (`netrics-hops-scamper`) for clarity.

This version of the measurement is added to the default configuration
but commented out as it may not be enabled by default.

resolves #28

part of #3
jesteria added a commit that referenced this issue Mar 9, 2023
Adds `ip` measurement (executable `netrics-ip`) to retrieve public IP
address and write as a result.

Measurement is added to default configuration, but easily disableable
by setting device environment variable `NETRICS_IP_QUERY=0`.

resolves #24

part of #3
jesteria added a commit that referenced this issue Mar 9, 2023
Adds `ip` measurement (executable `netrics-ip`) to retrieve public IP
address and write as a result.

Measurement is added to default configuration, but easily disableable
by setting device environment variable `NETRICS_IP_QUERY=0`.

resolves #24

part of #3
jesteria added a commit that referenced this issue Mar 14, 2023
Completes task `dns-latency` (with executable `netrics-dns-latency`).

This task is added to the built-in default measurements with currently
in-use hosts parameters.

part of #3

resolves #27
jesteria added a commit that referenced this issue Mar 14, 2023
Completes task `dns-latency` (with executable `netrics-dns-latency`).

This task is added to the built-in default measurements with currently
in-use hosts parameters.

part of #3

resolves #27
jesteria added a commit that referenced this issue Mar 14, 2023
Completes task `dns-latency` (with executable `netrics-dns-latency`).

This task is added to the built-in default measurements with currently
in-use hosts parameters.

part of #3

resolves #27
@marcwitasee marcwitasee added this to the Beta milestone Mar 24, 2023
jesteria added a commit that referenced this issue Apr 12, 2023
Adds measurement `dev` (executable `netrics-dev`) to scan local network
and report on the number of connected devices.

Notably, this change implements Fate task state persistence, applied to
this measurement.

part of #3

part of #4

resolves #29
jesteria added a commit that referenced this issue Apr 12, 2023
Adds measurement `dev` (executable `netrics-dev`) to scan local network
and report on the number of connected devices.

Notably, this change implements Fate task state persistence, applied to
this measurement.

part of #3

part of #4

resolves #29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants