Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: epoch / randomness tracking or auditing, and callback API #85

Open
cjacoby opened this issue Apr 18, 2017 · 11 comments
Open

Proposal: epoch / randomness tracking or auditing, and callback API #85

cjacoby opened this issue Apr 18, 2017 · 11 comments

Comments

@cjacoby
Copy link
Collaborator

cjacoby commented Apr 18, 2017

  • Sometimes it's useful to know when an "epoch" has been completed. Certain libraries (coughkerascough) have this notion explicitly.
  • But also, it might be useful when using a Mux to know how often certain streams have been sampled after the fact

I suspect that some sort of callback or query on the Mux object is the right solution here, but I figured it would be best to initiate a discussion first.

@bmcfee
Copy link
Collaborator

bmcfee commented Apr 21, 2017

I suspect that some sort of callback or query on the Mux object is the right solution here

I like this idea in the abstract. I'm not sure how exactly it should look though -- a full keras-style callback infrastructure seems overkill, since we don't really have a top-level controller to trigger events.

Thinking a little more about it, this seems like two issues to me.

  1. Book-keeping in mux. We should definitely add this, and provide accessor methods to report statistics (# samples drawn, # times active, etc.)
  2. Epoch callbacks. I think this might be best implemented as a separate controller class that you can stick in front of any iterable/streamer, and which triggers a callback after every n steps. Something like:
def my_callback(...):
    do some stuff

epoch = EpochCallback(n=1000, callbacks=[my_callback])

for item in epoch(my_streamer):
    do some other stuff

then it's just a matter of specifying the interface for callback functions.

@bmcfee bmcfee modified the milestone: 1.1.0 Apr 21, 2017
@bmcfee bmcfee mentioned this issue May 5, 2017
@cjacoby
Copy link
Collaborator Author

cjacoby commented Jun 29, 2017

Thoughts from the peanut gallery - is this still in for 1.1.0, or is this really a 2.0 feature? #92 is labelled as 2.0. Trying to clean up what really needs to be done for 1.1 so I can prioritize.

@ejhumphrey
Copy link
Collaborator

I'd say ... if the right design is a callback, then 2.0; if it's setting some counters in the object, then maybe 1.1. maybe. thoughts?

@cjacoby
Copy link
Collaborator Author

cjacoby commented Jun 29, 2017

I think the callback is probably better / more future-proof.

@ejhumphrey
Copy link
Collaborator

on the auditing front, I was just wrestling with my design of "how" I wanted to sample some data for training, and decided I wanted to log my samplers so I could parse things out later and check my statistics. The important parts look like this..

but first! this is entirely proof-of-concept, though I am curious to subsequently discuss better designs, impact on efficiency, etc.

When I'm writing research packages, I like to have my data stream machinery defined in the same submodule. After my import preamble (which includes logging), I set a global stream_logger for all "samplers" (the generator that produces observations from a bag of observations) and a method to create file handlers later:

stream_logger = logging.getLogger("stream_logger")

# Nothing crazy here.
def init_stream_logging(log_file):
    hdlr = logging.FileHandler(log_file)
    formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
    hdlr.setFormatter(formatter)
    stream_logger.addHandler(hdlr)
    stream_logger.setLevel(logging.INFO)

Then, say I have a sampler that plucks observations from an NPZ file ... I add a logging statement on index selection, before yielding data:

def my_sampler(feature_dir, key):
    with np.load(os.path.join(feature_dir, "{}.npz".format(key)) as data:
        X = data['x_in']
        Y = data['y_true']
    N = len(X)
    while True:
        n = np.random.randint(0, N)
        stream_logger.info(json.dumps(dict(key=key, n=n, y_true=Y[n])))
        yield dict(x_in=X[n], y_true=Y[n])

Then, for completeness, I'll init a file handler and mux a stream. Assume I've got a bunch of files that are like '/path/to/features/a.npz', etc...

init_stream_logging("samples.log")
stream = pescador.Mux(
    [pescador.Streamer(my_sampler, "/path/to/features", key) for key in 'abcdefg'],
    k=5, rate=10, revive=True, with_replacement=False, prune_empty_streams=True)
list(stream.iterate(max_iter=1000))

Now we can go pop open that log file and look at some stats!

from collections import Counter
samples = [json.loads(l.strip().split("INFO ")[-1]) for l in open("samples.log")]
Counter([x['y_true'] for x in samples])
# Produces something like...
Counter({0: 110, 1: 99, 2: 101, 3: 92, 4: 100, 5: 101, 6: 111, 7: 92, 8: 94, 9: 100})

Counter([x['key'] for x in samples])
# Produces something like...
Counter({'a': 166, 'b': 127, 'c': 154, 'd': 136, 'e': 152, 'f': 122, 'g': 143})

or whatever.

I haven't had much time to mull this over, so I'm sure I'll have more ideas / opinions later, but thought this was worth sharing given the discussion in #104 (assuming @stefan-balke, @cjacoby may care). In particular though, I'm somewhat worried about the time this would lose to json serialization (there will be so many samples...), and the log-parsing after the fact is quite gross. I'm not sure if atomic logging versus building a cache that'd be worth background threading, but ... I'm guessing.

Also, I kinda like the idea of using logging (rather than keras history) to track training loss / error, but maybe this is out of scope. I'd also be keen to "type" the logs to filter on say samples versus other events, but didn't come across any docs on this (nor did I look very hard).

also also, I looked around and logging still seems to be the best logging library out there ... a few things wrap it (daiquiri, pygogo), but nothing replaces it.

@bmcfee
Copy link
Collaborator

bmcfee commented Jan 24, 2018

Can this be a 2.1 feature? It should only add to the API, not change existing functionality.

@cjacoby
Copy link
Collaborator Author

cjacoby commented Jan 24, 2018 via email

@bmcfee bmcfee modified the milestones: 2.0.0, 2.1.0 Jan 24, 2018
@bmcfee bmcfee changed the title Proposal: epoch / randomness tracking or auditing Proposal: epoch / randomness tracking or auditing, and callback API Jan 25, 2018
@bmcfee
Copy link
Collaborator

bmcfee commented Aug 12, 2019

@cjacoby got any cycles to look into this? I'd like to get 2.1 off my stack in the short term.

@bmcfee
Copy link
Collaborator

bmcfee commented Aug 22, 2019

Given the radio silence on this, and a lack of clear picture of what exactly the API should be, how do you all feel about dropping this feature @ejhumphrey @cjacoby ? I can see its utility, but it also makes things much more complicated.

@cjacoby
Copy link
Collaborator Author

cjacoby commented Aug 25, 2019 via email

@bmcfee
Copy link
Collaborator

bmcfee commented Aug 28, 2019

I think it might make sense to punt to next version just to at least clarify the API/approach. Or, we make it "provisional"?

Ok. How about we punt it to some yet-to-be-determined 3.x release then?

@bmcfee bmcfee removed this from the 2.1.0 milestone Aug 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants