-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need efficient diff-based storage for archives #8
Comments
From @atesgoral on February 2, 2017 16:39 I'm no data storage expert, but a layered/union file system could potentially help. https://docs.docker.com/engine/userguide/storagedriver/aufs-driver/ |
From @titaniumbones on February 3, 2017 5:58 agreed. do you know what the performance penalty for using such a file On 02/02/2017 11:39 AM, Ates Goral wrote:
|
From @lh00000000 on February 6, 2017 4:10 is the problem limited storage capacity? since it seems like diffs between a snapshot a long time ago and a snapshot a long time + a day ago won't be as relevant as diffs between the latest snapshot and the second-to-latest snapshot, maybe some kind of strategy that archives snapshots to cheaper places like aws glacier (or s3 'Infrequent Access") in an automated way might be an alternative? keeping full snapshots is expensive storage-wise, but it's easier to grok / try new data-transformation ideas on later. |
From @titaniumbones on February 11, 2017 0:7 also written on the plane and a little out of doubt, but, well, still posting: @lh00000000 Sorry I missed this in a flood of notifications. Interesting idea. hmm. I'm wondering how common the situation is, in which an analyst will want to consult a long time series of diffs, and how expensive that would be to access. I think time series data will be of interest to social scientists eventually, but for now maybe not so much. Maybe @trinberg or @ambergman have thoughts? |
@Mr0grog and @danielballan are you able to speak to where we are currently at in considering storage -- especially given that the 'local' pagefreezer deploy option I think postdates this conversation...? |
I'm also gonna paste over this comment from (similarly old #9):
|
It's good to know about IAS3API. It sounds like we will be storing:
So, in summary, I don't think this project will end up in charge of any large-scale long-term storage. This is probably a good thing. I think this issue can be closed, unless I am confused about any of the above. |
I would pretty much echo @danielballan here. For now, my naive thinking was to store every page version as an object on S3 or Google Cloud Storage—we are not remotely pushing the limits in terms of cost and capacity there. If we assumed the average size of a page was 50 kB (its more likely 15-20) and we assumed every page had a new version every day (not remotely true), the most expensive of those storage methods would still be only about $1 a month (or 45 GB/month). And that is almost certainly way overestimating.
For context, I was just talking with a friend who managing storage and analysis of API logs at Mapbox yesterday and they do hundreds of GB per day, all stored on S3. Even as we scale up, we are unlikely to push much more than that. It's also worth noting that, if this storage is on S3 and people are running large analyses over it in EC2 (in the same data center, at least), transfer is free. I assume (but haven't checked) the same is true between Google Storage and Google Compute. Also, do we have our own stores of data already out there somewhere? I've been focused on the Versionista situation, so am not aware of what we might already have stored ourselves.
@danielballan Are we actually going to have data stored at PageFreezer? My [very limited] understanding was that they were sending us archives, that we had to manage. |
That's not clear to me. At first they were sending us archives, but now it sounds like they might be giving us some custom-baked API. Or maybe that API on regards diffing, not version retrieval. We'll find out next week. |
Unless I'm really confused, I'm pretty sure the cloud environment that PageFreezer is setting up for us have separate storage instances where our data will live - definitely correct me if I'm wrong @danielballan. To answer your question, though, @Mr0grog - we don't quite have them set up yet, so none of the PageFreezer data is yet accessible on our own cloud storage. DongWoo from PageFreezer was doing some testing and suggesting that we were burning through storage space more quickly than expected because of bigger files like videos. We may want (or need) to make a decision to not store those for now and come up with another place to dump them - or maybe just decide not to store them until we have a relationship with IA and don't have to worry about space (which would be a lovely, brave new world) |
Wonderful, thanks everyone for chiming in. I'm closing for now! |
From @titaniumbones on February 2, 2017 15:25
It's not immediately obvious how to turn our giant stores of files into diffs so that we can minimize file storage/transfer issues. At present we have lots of data duplication. Looking for a data engineer here I think.
Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#4
The text was updated successfully, but these errors were encountered: