add timestamps for json representation for GC #3

reiddraper · 2012-01-26T03:09:24Z

I don't have a suggestion (yet), but wanted to put this issue down while I was thinking about it. Currently the JSON representation for these types, like the or-set, don't have the timestamp information that is needed to perform garbage collection.

aphyr · 2012-01-26T03:14:57Z

I was thinking one could choose logically or time-ordered tags for
observed-remove which are amenable to GC.

On 01/25/2012 07:09 PM, Reid Draper wrote:

I don't have a suggestion (yet), but wanted to put this issue down while I was thinking about it. Currently the JSON representation for these types, like the or-set, don't have the timestamp information that is needed to perform garbage collection.

Reply to this email directly or view it on GitHub:
#3

reiddraper · 2012-01-26T03:18:32Z

Not sure I follow, could you give an example?

aphyr · 2012-01-26T03:37:33Z

Let's say you use snowflake for unique ID generation. Snowflake is
k-ordered and has a time component, so each ID can be roughly located in
time. Use snowflake IDs for the tags in your observed-removed set. For
GC, remove any tag which has a tag with a time component older than your
threshold. Or just order all the tags and keep the highest n.

On 01/25/2012 07:18 PM, Reid Draper wrote:

Not sure I follow, could you give an example?

Reply to this email directly or view it on GitHub:
#3 (comment)

reiddraper · 2012-01-26T03:43:38Z

Ah, I see, cool idea. What about the two-phase set? I've almost thought about removing it in knockbox, to be honest.

aphyr · 2012-01-26T03:53:38Z

It makes sense for certain ephemeral structures (say, purchase orders),
and I'm inclined to keep it for completeness' sake. Not really married
to the concept, though.

On 01/25/2012 07:43 PM, Reid Draper wrote:

Ah, I see, cool idea. What about the two-phase set? I've almost thought about removing it in knockbox, to be honest.

Reply to this email directly or view it on GitHub:
#3 (comment)

reiddraper · 2012-01-26T04:29:03Z

Regarding using k-ordered timestamps for tag, I'm afraid it's a bit of a large dependency. Suppose you want to use CRDTs in javascript on the browserside, or you just don't want your CRDT library to depend on an external ID generation service. It seems there should also be an additional "normal" timestamp field, that can be used across all CRDT libs without assumptions about infrastructure such as snowflake. Unfortunately logically ordered timestamps don't help for what I think is going to be the most common pruning predicate, keeping garbage around longer than your longest expected partition.

aphyr · 2012-01-26T04:35:08Z

Or you could use unix epoch seconds, or use iso8601 timestamps, etc,
plus hostnames... there are many possible strategies here.

What bothers me is that you might use timestamps for your tags. Then
the timestamp information would be redundant but necessary. LWW-sets
imply a clock of some kind too, though it could be logical and not
wallclock time... hmmmmmm.

I don't want to define the tags to be a specific kind of timestamp,
because you might want to use a logical timestamp or a tag handed to you
by some other system. Maybe we should establish special variants of the
LWW and OR set types that include a predefined timestamp and GC scheme?

On 01/25/2012 08:29 PM, Reid Draper wrote:

Regarding using k-ordered timestamps for tag, I'm afraid it's a bit of a large dependency. Suppose you want to use CRDTs in javascript on the browserside, or you just don't want your CRDT library to depend on an external ID generation service. It seems there should also be an additional "normal" timestamp field, that can be used across all CRDT libs without assumptions about infrastructure such as snowflake. Unfortunately logically ordered timestamps don't help for what I think is going to be the most common pruning predicate, keeping garbage around longer than your longest expected partition.

Reply to this email directly or view it on GitHub:
#3 (comment)

reiddraper · 2012-01-26T04:44:59Z

Timestamp + hostname might be good enough for the cases I'm thinking, at least if it's at millisecond resolution. I suppose it also wouldn't be hard to have an incrementing counter for the library, so tags are time + hostname + counter. I definitely like the idea of predefined timestamp/GC schemes. I'm (hopefully) going to have some time this weekend to dive back into knockbox, and GC is high on my TODO list.

aphyr · 2012-01-26T04:58:57Z

I like that idea. I'm cool with implementing a flake-like at the library
level. Say default unique tags are

[unix_time, host, pid, unique]

and for lww-sets, just unix-time. Floats OK for times?

Note, unix time can jump backwards, stop, hiccup, etc... we expose
people to all of that but at the timescales we're talking (days, right?)
it shouldn't be an issue.

Obviously there will be a (gc obj) function to force garbage collection,
probably with age or size constraints. If you just go by "preserve n
tags" it's easy to implement regardless of tag type: just sort the tags
and take the biggest n. If you want to go by time then we filter (<
threshold (first tag)).

For users who want to use their own tag strategy, they can easily extend
this by just using [unix-time, custom-tag]. The GC culling won't know
the difference. Sound good?

--Kyle

On 01/25/2012 08:44 PM, Reid Draper wrote:

Timestamp + hostname might be good enough for the cases I'm thinking, at least if it's at millisecond resolution. I suppose it also wouldn't be hard to have an incrementing counter for the library, so tags are time + hostname + counter. I definitely like the idea of predefined timestamp/GC schemes. I'm (hopefully) going to have some time this weekend to dive back into knockbox, and GC is high on my TODO list.

Reply to this email directly or view it on GitHub:
#3 (comment)

reiddraper · 2012-02-02T04:13:40Z

What are your thoughts on where to store pruning parameters, ie. # of items to keep and/or # of seconds to keep? My gut says it's fine to have the library just decide this, but I could imagine some use-cases where you'd want to be able to have the object carry around this information, as it might be specific for the object. Thoughts?

aphyr · 2012-02-02T04:16:54Z

I would do both; if provided, you use, say "gc_max_items" and
"gc_max_seconds" for history pruning. If not provided, it's library
dependent.

--Kyle

On 02/01/2012 08:13 PM, Reid Draper wrote:

What are your thoughts on where to store pruning parameters, ie. # of items to keep and/or # of seconds to keep? My gut says it's fine to have the library just decide this, but I could imagine some use-cases where you'd want to be able to have the object carry around this information, as it might be specific for the object. Thoughts?

Reply to this email directly or view it on GitHub:
#3 (comment)

reiddraper · 2012-02-02T14:53:50Z

+1 to that idea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add timestamps for json representation for GC #3

add timestamps for json representation for GC #3

reiddraper commented Jan 26, 2012

aphyr commented Jan 26, 2012

reiddraper commented Jan 26, 2012

aphyr commented Jan 26, 2012

reiddraper commented Jan 26, 2012

aphyr commented Jan 26, 2012

reiddraper commented Jan 26, 2012

aphyr commented Jan 26, 2012

reiddraper commented Jan 26, 2012

aphyr commented Jan 26, 2012

reiddraper commented Feb 2, 2012

aphyr commented Feb 2, 2012

reiddraper commented Feb 2, 2012

add timestamps for json representation for GC #3

add timestamps for json representation for GC #3

Comments

reiddraper commented Jan 26, 2012

aphyr commented Jan 26, 2012

reiddraper commented Jan 26, 2012

aphyr commented Jan 26, 2012

reiddraper commented Jan 26, 2012

aphyr commented Jan 26, 2012

reiddraper commented Jan 26, 2012

aphyr commented Jan 26, 2012

reiddraper commented Jan 26, 2012

aphyr commented Jan 26, 2012

reiddraper commented Feb 2, 2012

aphyr commented Feb 2, 2012

reiddraper commented Feb 2, 2012