Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need efficient diff-based storage for archives #8

Closed
dcwalk opened this issue Mar 9, 2017 · 11 comments
Closed

Need efficient diff-based storage for archives #8

dcwalk opened this issue Mar 9, 2017 · 11 comments

Comments

@dcwalk
Copy link
Contributor

dcwalk commented Mar 9, 2017

From @titaniumbones on February 2, 2017 15:25

It's not immediately obvious how to turn our giant stores of files into diffs so that we can minimize file storage/transfer issues. At present we have lots of data duplication. Looking for a data engineer here I think.

Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#4

@dcwalk
Copy link
Contributor Author

dcwalk commented Mar 9, 2017

From @atesgoral on February 2, 2017 16:39

I'm no data storage expert, but a layered/union file system could potentially help.

https://docs.docker.com/engine/userguide/storagedriver/aufs-driver/
https://en.wikipedia.org/wiki/Aufs
https://en.wikipedia.org/wiki/OverlayFS

@dcwalk dcwalk added the storage label Mar 9, 2017
@dcwalk
Copy link
Contributor Author

dcwalk commented Mar 9, 2017

From @titaniumbones on February 3, 2017 5:58

agreed. do you know what the performance penalty for using such a file
system is?

On 02/02/2017 11:39 AM, Ates Goral wrote:

I'm no data storage expert, but a layered/union file system could
potentially help.

https://docs.docker.com/engine/userguide/storagedriver/aufs-driver/
https://en.wikipedia.org/wiki/Aufs
https://en.wikipedia.org/wiki/OverlayFS


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
edgi-govdata-archiving/web-monitoring-ui#4 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAWPNDIHLf-la8E7GkFSuIc_Tgf17m_-ks5rYgavgaJpZM4L1Ply.

@dcwalk
Copy link
Contributor Author

dcwalk commented Mar 9, 2017

From @lh00000000 on February 6, 2017 4:10

is the problem limited storage capacity? since it seems like diffs between a snapshot a long time ago and a snapshot a long time + a day ago won't be as relevant as diffs between the latest snapshot and the second-to-latest snapshot, maybe some kind of strategy that archives snapshots to cheaper places like aws glacier (or s3 'Infrequent Access") in an automated way might be an alternative? keeping full snapshots is expensive storage-wise, but it's easier to grok / try new data-transformation ideas on later.

@dcwalk
Copy link
Contributor Author

dcwalk commented Mar 9, 2017

From @titaniumbones on February 11, 2017 0:7

also written on the plane and a little out of doubt, but, well, still posting:


@lh00000000 Sorry I missed this in a flood of notifications. Interesting idea. hmm. I'm wondering how common the situation is, in which an analyst will want to consult a long time series of diffs, and how expensive that would be to access.

I think time series data will be of interest to social scientists eventually, but for now maybe not so much. Maybe @trinberg or @ambergman have thoughts?

@dcwalk
Copy link
Contributor Author

dcwalk commented Mar 12, 2017

@Mr0grog and @danielballan are you able to speak to where we are currently at in considering storage -- especially given that the 'local' pagefreezer deploy option I think postdates this conversation...?

@dcwalk
Copy link
Contributor Author

dcwalk commented Mar 12, 2017

I'm also gonna paste over this comment from (similarly old #9):

Also, in terms of long-term storage, has Internet Archive's S3 API been considered? https://github.com/vmbrasseur/IAS3API

@dcwalk dcwalk changed the title efficient diff-based storage for archives Need efficient diff-based storage for archives Mar 12, 2017
@danielballan
Copy link
Contributor

It's good to know about IAS3API. It sounds like we will be storing:

  • A one-time grab from Versionista, comprising legacy snapshots. This stash won't grow over time, so IMO it's not especially important how we store it.
  • PageFreezer data stored in a way determined by PageFreezer behind an API they provide. (We can revisit this once we know the details.)
  • A cache of any computed diffs that we want fast access to. Since these can always be regenerated, we will only store the ones of current interest and the cache should not grow unbounded.
  • Possibly a cache of frequently-referenced pages from IA, but these we can always delete and re-request later.

So, in summary, I don't think this project will end up in charge of any large-scale long-term storage. This is probably a good thing. I think this issue can be closed, unless I am confused about any of the above.

@Mr0grog
Copy link
Member

Mr0grog commented Mar 12, 2017

I would pretty much echo @danielballan here.

For now, my naive thinking was to store every page version as an object on S3 or Google Cloud Storage—we are not remotely pushing the limits in terms of cost and capacity there. If we assumed the average size of a page was 50 kB (its more likely 15-20) and we assumed every page had a new version every day (not remotely true), the most expensive of those storage methods would still be only about $1 a month (or 45 GB/month). And that is almost certainly way overestimating.

50kB * 30000 * 30 days * 0.023usd/GB/mo = $1.03

For context, I was just talking with a friend who managing storage and analysis of API logs at Mapbox yesterday and they do hundreds of GB per day, all stored on S3. Even as we scale up, we are unlikely to push much more than that.

It's also worth noting that, if this storage is on S3 and people are running large analyses over it in EC2 (in the same data center, at least), transfer is free. I assume (but haven't checked) the same is true between Google Storage and Google Compute.

Also, do we have our own stores of data already out there somewhere? I've been focused on the Versionista situation, so am not aware of what we might already have stored ourselves.

PageFreezer data stored in a way determined by PageFreezer behind an API they provide. (We can revisit this once we know the details.)

@danielballan Are we actually going to have data stored at PageFreezer? My [very limited] understanding was that they were sending us archives, that we had to manage.

@danielballan
Copy link
Contributor

Are we actually going to have data stored at PageFreezer?

That's not clear to me. At first they were sending us archives, but now it sounds like they might be giving us some custom-baked API. Or maybe that API on regards diffing, not version retrieval. We'll find out next week.

@ambergman
Copy link

Unless I'm really confused, I'm pretty sure the cloud environment that PageFreezer is setting up for us have separate storage instances where our data will live - definitely correct me if I'm wrong @danielballan.

To answer your question, though, @Mr0grog - we don't quite have them set up yet, so none of the PageFreezer data is yet accessible on our own cloud storage. DongWoo from PageFreezer was doing some testing and suggesting that we were burning through storage space more quickly than expected because of bigger files like videos. We may want (or need) to make a decision to not store those for now and come up with another place to dump them - or maybe just decide not to store them until we have a relationship with IA and don't have to worry about space (which would be a lovely, brave new world)

@dcwalk
Copy link
Contributor Author

dcwalk commented Mar 18, 2017

Wonderful, thanks everyone for chiming in. I'm closing for now!

@dcwalk dcwalk closed this as completed Mar 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants