Proposal: epoch / randomness tracking or auditing, and callback API #85

cjacoby · 2017-04-18T13:16:26Z

Sometimes it's useful to know when an "epoch" has been completed. Certain libraries (coughkerascough) have this notion explicitly.
But also, it might be useful when using a Mux to know how often certain streams have been sampled after the fact

I suspect that some sort of callback or query on the Mux object is the right solution here, but I figured it would be best to initiate a discussion first.

bmcfee · 2017-04-21T19:07:38Z

I suspect that some sort of callback or query on the Mux object is the right solution here

I like this idea in the abstract. I'm not sure how exactly it should look though -- a full keras-style callback infrastructure seems overkill, since we don't really have a top-level controller to trigger events.

Thinking a little more about it, this seems like two issues to me.

Book-keeping in mux. We should definitely add this, and provide accessor methods to report statistics (# samples drawn, # times active, etc.)
Epoch callbacks. I think this might be best implemented as a separate controller class that you can stick in front of any iterable/streamer, and which triggers a callback after every n steps. Something like:

def my_callback(...):
    do some stuff

epoch = EpochCallback(n=1000, callbacks=[my_callback])

for item in epoch(my_streamer):
    do some other stuff

then it's just a matter of specifying the interface for callback functions.

cjacoby · 2017-06-29T15:07:30Z

Thoughts from the peanut gallery - is this still in for 1.1.0, or is this really a 2.0 feature? #92 is labelled as 2.0. Trying to clean up what really needs to be done for 1.1 so I can prioritize.

ejhumphrey · 2017-06-29T15:09:23Z

I'd say ... if the right design is a callback, then 2.0; if it's setting some counters in the object, then maybe 1.1. maybe. thoughts?

cjacoby · 2017-06-29T15:13:02Z

I think the callback is probably better / more future-proof.

ejhumphrey · 2017-10-31T21:35:00Z

on the auditing front, I was just wrestling with my design of "how" I wanted to sample some data for training, and decided I wanted to log my samplers so I could parse things out later and check my statistics. The important parts look like this..

but first! this is entirely proof-of-concept, though I am curious to subsequently discuss better designs, impact on efficiency, etc.

When I'm writing research packages, I like to have my data stream machinery defined in the same submodule. After my import preamble (which includes logging), I set a global stream_logger for all "samplers" (the generator that produces observations from a bag of observations) and a method to create file handlers later:

stream_logger = logging.getLogger("stream_logger")

# Nothing crazy here.
def init_stream_logging(log_file):
    hdlr = logging.FileHandler(log_file)
    formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')
    hdlr.setFormatter(formatter)
    stream_logger.addHandler(hdlr)
    stream_logger.setLevel(logging.INFO)

Then, say I have a sampler that plucks observations from an NPZ file ... I add a logging statement on index selection, before yielding data:

def my_sampler(feature_dir, key):
    with np.load(os.path.join(feature_dir, "{}.npz".format(key)) as data:
        X = data['x_in']
        Y = data['y_true']
    N = len(X)
    while True:
        n = np.random.randint(0, N)
        stream_logger.info(json.dumps(dict(key=key, n=n, y_true=Y[n])))
        yield dict(x_in=X[n], y_true=Y[n])

Then, for completeness, I'll init a file handler and mux a stream. Assume I've got a bunch of files that are like '/path/to/features/a.npz', etc...

init_stream_logging("samples.log")
stream = pescador.Mux(
    [pescador.Streamer(my_sampler, "/path/to/features", key) for key in 'abcdefg'],
    k=5, rate=10, revive=True, with_replacement=False, prune_empty_streams=True)
list(stream.iterate(max_iter=1000))

Now we can go pop open that log file and look at some stats!

from collections import Counter
samples = [json.loads(l.strip().split("INFO ")[-1]) for l in open("samples.log")]
Counter([x['y_true'] for x in samples])
# Produces something like...
Counter({0: 110, 1: 99, 2: 101, 3: 92, 4: 100, 5: 101, 6: 111, 7: 92, 8: 94, 9: 100})

Counter([x['key'] for x in samples])
# Produces something like...
Counter({'a': 166, 'b': 127, 'c': 154, 'd': 136, 'e': 152, 'f': 122, 'g': 143})

or whatever.

I haven't had much time to mull this over, so I'm sure I'll have more ideas / opinions later, but thought this was worth sharing given the discussion in #104 (assuming @stefan-balke, @cjacoby may care). In particular though, I'm somewhat worried about the time this would lose to json serialization (there will be so many samples...), and the log-parsing after the fact is quite gross. I'm not sure if atomic logging versus building a cache that'd be worth background threading, but ... I'm guessing.

Also, I kinda like the idea of using logging (rather than keras history) to track training loss / error, but maybe this is out of scope. I'd also be keen to "type" the logs to filter on say samples versus other events, but didn't come across any docs on this (nor did I look very hard).

also also, I looked around and logging still seems to be the best logging library out there ... a few things wrap it (daiquiri, pygogo), but nothing replaces it.

bmcfee · 2018-01-24T16:20:03Z

Can this be a 2.1 feature? It should only add to the API, not change existing functionality.

cjacoby · 2018-01-24T17:48:28Z

👍

…

On Wed, Jan 24, 2018 at 8:20 AM Brian McFee ***@***.***> wrote: Can this be a 2.1 feature? It should only add to the API, not change existing functionality. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#85 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA4t8ylUexgDF8pCTYcLiVpd1Ca4H_Seks5tN1gzgaJpZM4NAVCS> .

bmcfee · 2019-08-12T17:17:58Z

@cjacoby got any cycles to look into this? I'd like to get 2.1 off my stack in the short term.

bmcfee · 2019-08-22T17:57:41Z

Given the radio silence on this, and a lack of clear picture of what exactly the API should be, how do you all feel about dropping this feature @ejhumphrey @cjacoby ? I can see its utility, but it also makes things much more complicated.

cjacoby · 2019-08-25T17:39:46Z

Sorry! I could make time to work on this this week if you think it's useful. (Reading emails tho, that I am bad at making time for ;) ) I think it might make sense to punt to next version just to at least clarify the API/approach. Or, we make it "provisional"? I have *an* approach in my head, though don't know if it's the best one.

…

On Thu, Aug 22, 2019, 10:57 Brian McFee ***@***.***> wrote: Given the radio silence on this, and a lack of clear picture of what exactly the API should be, how do you all feel about dropping this feature @ejhumphrey <https://github.com/ejhumphrey> @cjacoby <https://github.com/cjacoby> ? I can see its utility, but it also makes things much more complicated. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#85?email_source=notifications&email_token=AAHC344ASMD5TIFEJQTUX7TQF3HRNA5CNFSM4DIBKCJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD454TRY#issuecomment-524011975>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAHC34ZDXZ5SJV4DTQGQ3NDQF3HRNANCNFSM4DIBKCJA> .

bmcfee · 2019-08-28T00:25:53Z

I think it might make sense to punt to next version just to at least clarify the API/approach. Or, we make it "provisional"?

Ok. How about we punt it to some yet-to-be-determined 3.x release then?

cjacoby added API enhancement labels Apr 18, 2017

bmcfee modified the milestone: 1.1.0 Apr 21, 2017

bmcfee mentioned this issue May 5, 2017

Validators #92

Closed

cjacoby modified the milestones: 2.0.0, 1.1.0 Jul 1, 2017

ejhumphrey mentioned this issue Oct 29, 2017

Non-stochastic sampling and literature #104

Closed

bmcfee modified the milestones: 2.0.0, 2.1.0 Jan 24, 2018

bmcfee changed the title ~~Proposal: epoch / randomness tracking or auditing~~ Proposal: epoch / randomness tracking or auditing, and callback API Jan 25, 2018

bmcfee removed this from the 2.1.0 milestone Aug 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: epoch / randomness tracking or auditing, and callback API #85

Proposal: epoch / randomness tracking or auditing, and callback API #85

cjacoby commented Apr 18, 2017

bmcfee commented Apr 21, 2017

cjacoby commented Jun 29, 2017

ejhumphrey commented Jun 29, 2017

cjacoby commented Jun 29, 2017

ejhumphrey commented Oct 31, 2017

bmcfee commented Jan 24, 2018

cjacoby commented Jan 24, 2018 via email

bmcfee commented Aug 12, 2019

bmcfee commented Aug 22, 2019

cjacoby commented Aug 25, 2019 via email

bmcfee commented Aug 28, 2019

Proposal: epoch / randomness tracking or auditing, and callback API #85

Proposal: epoch / randomness tracking or auditing, and callback API #85

Comments

cjacoby commented Apr 18, 2017

bmcfee commented Apr 21, 2017

cjacoby commented Jun 29, 2017

ejhumphrey commented Jun 29, 2017

cjacoby commented Jun 29, 2017

ejhumphrey commented Oct 31, 2017

bmcfee commented Jan 24, 2018

cjacoby commented Jan 24, 2018 via email

bmcfee commented Aug 12, 2019

bmcfee commented Aug 22, 2019

cjacoby commented Aug 25, 2019 via email

bmcfee commented Aug 28, 2019