Need efficient diff-based storage for archives #8

dcwalk · 2017-03-09T03:47:12Z

From @titaniumbones on February 2, 2017 15:25

It's not immediately obvious how to turn our giant stores of files into diffs so that we can minimize file storage/transfer issues. At present we have lots of data duplication. Looking for a data engineer here I think.

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#4

dcwalk · 2017-03-09T03:47:13Z

From @atesgoral on February 2, 2017 16:39

I'm no data storage expert, but a layered/union file system could potentially help.

https://docs.docker.com/engine/userguide/storagedriver/aufs-driver/
https://en.wikipedia.org/wiki/Aufs
https://en.wikipedia.org/wiki/OverlayFS

dcwalk · 2017-03-09T03:47:14Z

From @titaniumbones on February 3, 2017 5:58

agreed. do you know what the performance penalty for using such a file
system is?

On 02/02/2017 11:39 AM, Ates Goral wrote:

I'm no data storage expert, but a layered/union file system could
potentially help.

https://docs.docker.com/engine/userguide/storagedriver/aufs-driver/
https://en.wikipedia.org/wiki/Aufs
https://en.wikipedia.org/wiki/OverlayFS

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
edgi-govdata-archiving/web-monitoring-ui#4 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAWPNDIHLf-la8E7GkFSuIc_Tgf17m_-ks5rYgavgaJpZM4L1Ply.

dcwalk · 2017-03-09T03:47:14Z

From @lh00000000 on February 6, 2017 4:10

is the problem limited storage capacity? since it seems like diffs between a snapshot a long time ago and a snapshot a long time + a day ago won't be as relevant as diffs between the latest snapshot and the second-to-latest snapshot, maybe some kind of strategy that archives snapshots to cheaper places like aws glacier (or s3 'Infrequent Access") in an automated way might be an alternative? keeping full snapshots is expensive storage-wise, but it's easier to grok / try new data-transformation ideas on later.

dcwalk · 2017-03-09T03:47:14Z

From @titaniumbones on February 11, 2017 0:7

also written on the plane and a little out of doubt, but, well, still posting:

@lh00000000 Sorry I missed this in a flood of notifications. Interesting idea. hmm. I'm wondering how common the situation is, in which an analyst will want to consult a long time series of diffs, and how expensive that would be to access.

I think time series data will be of interest to social scientists eventually, but for now maybe not so much. Maybe @trinberg or @ambergman have thoughts?

dcwalk · 2017-03-12T14:06:23Z

@Mr0grog and @danielballan are you able to speak to where we are currently at in considering storage -- especially given that the 'local' pagefreezer deploy option I think postdates this conversation...?

dcwalk · 2017-03-12T14:07:37Z

I'm also gonna paste over this comment from (similarly old #9):

Also, in terms of long-term storage, has Internet Archive's S3 API been considered? https://github.com/vmbrasseur/IAS3API

danielballan · 2017-03-12T16:42:38Z

It's good to know about IAS3API. It sounds like we will be storing:

A one-time grab from Versionista, comprising legacy snapshots. This stash won't grow over time, so IMO it's not especially important how we store it.
PageFreezer data stored in a way determined by PageFreezer behind an API they provide. (We can revisit this once we know the details.)
A cache of any computed diffs that we want fast access to. Since these can always be regenerated, we will only store the ones of current interest and the cache should not grow unbounded.
Possibly a cache of frequently-referenced pages from IA, but these we can always delete and re-request later.

So, in summary, I don't think this project will end up in charge of any large-scale long-term storage. This is probably a good thing. I think this issue can be closed, unless I am confused about any of the above.

Mr0grog · 2017-03-12T17:26:07Z

I would pretty much echo @danielballan here.

For now, my naive thinking was to store every page version as an object on S3 or Google Cloud Storage—we are not remotely pushing the limits in terms of cost and capacity there. If we assumed the average size of a page was 50 kB (its more likely 15-20) and we assumed every page had a new version every day (not remotely true), the most expensive of those storage methods would still be only about $1 a month (or 45 GB/month). And that is almost certainly way overestimating.

50kB * 30000 * 30 days * 0.023usd/GB/mo = $1.03

For context, I was just talking with a friend who managing storage and analysis of API logs at Mapbox yesterday and they do hundreds of GB per day, all stored on S3. Even as we scale up, we are unlikely to push much more than that.

It's also worth noting that, if this storage is on S3 and people are running large analyses over it in EC2 (in the same data center, at least), transfer is free. I assume (but haven't checked) the same is true between Google Storage and Google Compute.

Also, do we have our own stores of data already out there somewhere? I've been focused on the Versionista situation, so am not aware of what we might already have stored ourselves.

PageFreezer data stored in a way determined by PageFreezer behind an API they provide. (We can revisit this once we know the details.)

@danielballan Are we actually going to have data stored at PageFreezer? My [very limited] understanding was that they were sending us archives, that we had to manage.

danielballan · 2017-03-12T18:16:30Z

Are we actually going to have data stored at PageFreezer?

That's not clear to me. At first they were sending us archives, but now it sounds like they might be giving us some custom-baked API. Or maybe that API on regards diffing, not version retrieval. We'll find out next week.

ambergman · 2017-03-15T07:04:35Z

Unless I'm really confused, I'm pretty sure the cloud environment that PageFreezer is setting up for us have separate storage instances where our data will live - definitely correct me if I'm wrong @danielballan.

To answer your question, though, @Mr0grog - we don't quite have them set up yet, so none of the PageFreezer data is yet accessible on our own cloud storage. DongWoo from PageFreezer was doing some testing and suggesting that we were burning through storage space more quickly than expected because of bigger files like videos. We may want (or need) to make a decision to not store those for now and come up with another place to dump them - or maybe just decide not to store them until we have a relationship with IA and don't have to worry about space (which would be a lovely, brave new world)

dcwalk · 2017-03-18T21:58:08Z

Wonderful, thanks everyone for chiming in. I'm closing for now!

dcwalk added the storage label Mar 9, 2017

dcwalk mentioned this issue Mar 9, 2017

efficient diff-based storage for archives edgi-govdata-archiving/web-monitoring-ui#4

Closed

dcwalk removed diffing labels Mar 9, 2017

dcwalk changed the title ~~efficient diff-based storage for archives~~ Need efficient diff-based storage for archives Mar 12, 2017

dcwalk mentioned this issue Mar 12, 2017

Build PageFreezer-Outputter that fits into current Versionista workflow #9

Closed

dcwalk closed this as completed Mar 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need efficient diff-based storage for archives #8

Need efficient diff-based storage for archives #8

dcwalk commented Mar 9, 2017 •

edited

Loading

dcwalk commented Mar 9, 2017

dcwalk commented Mar 9, 2017 •

edited

Loading

dcwalk commented Mar 9, 2017

dcwalk commented Mar 9, 2017

dcwalk commented Mar 12, 2017

dcwalk commented Mar 12, 2017

danielballan commented Mar 12, 2017

Mr0grog commented Mar 12, 2017 •

edited

Loading

danielballan commented Mar 12, 2017

ambergman commented Mar 15, 2017

dcwalk commented Mar 18, 2017

Need efficient diff-based storage for archives #8

Need efficient diff-based storage for archives #8

Comments

dcwalk commented Mar 9, 2017 • edited Loading

dcwalk commented Mar 9, 2017

dcwalk commented Mar 9, 2017 • edited Loading

dcwalk commented Mar 9, 2017

dcwalk commented Mar 9, 2017

dcwalk commented Mar 12, 2017

dcwalk commented Mar 12, 2017

danielballan commented Mar 12, 2017

Mr0grog commented Mar 12, 2017 • edited Loading

danielballan commented Mar 12, 2017

ambergman commented Mar 15, 2017

dcwalk commented Mar 18, 2017

dcwalk commented Mar 9, 2017 •

edited

Loading

dcwalk commented Mar 9, 2017 •

edited

Loading

Mr0grog commented Mar 12, 2017 •

edited

Loading