-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build PageFreezer-Outputter that fits into current Versionista workflow #9
Comments
From @allanpichardo on February 10, 2017 15:7 From what I've observed with the Pagefreezer API, taking a diff of 2 pages takes an average of 5 seconds. If there are ~ 30,000 pages to monitor, then pagefreezer is probably not the most appropriate diff service for this task. I think the main bottleneck in Pagefreezer is that they transcode the diff information into HTML for every request. I have run similar diffs on my machine using Git diff and it usually takes one second or less. Here's what I suggest:
|
From @titaniumbones on February 10, 2017 15:14 I love the idea. Question : what if the HTML (DOM) context is what tells us whether a diff is significant? On February 10, 2017 10:07:57 AM EST, Allan Pichardo [email protected] wrote:
-- |
From @allanpichardo on February 10, 2017 15:29 @titaniumbones Yeah, I suspect that it will. If we do this with Node.js, then we have the option of using jQuery to parse the HTML archives. We can determine that certain HTML nodes are insignificant, such as The diff would look like a regular git diff, but then the visualizer would have some logic that could convert the When viewing the page in the visualizer, an analyst would have the option of selecting a DOM element and saying that it's insignificant, thus adding it to an ongoing list that would be fed back to the CLI on the next cycle. |
From @ambergman on February 10, 2017 16:28 @allanpichardo Really disappointing to hear the PageFreezer API moves so slow but, as you've described, it seems like we have plenty of of other options (and, of course, we always new we had git as a backup). Your 1-3 above sound really great, and I think it makes perfect sense, as you said in point 3, to only parse the diff in the visualizer when it's called back up by an analyst. Regarding you last comment - I think it sounds great to have the option in the visualizer, maybe in some simple dropdown form, of marking a diff as insignificant. That'll mean everything lives just in that simple visualizer, and everything else will run in the CLI. The only other thing to add, then, would be to perhaps have a couple of different visualization options, perhaps a "side-by-side" or "in-line" page view for changes, but then also a "changes only" view (very useful for huge pages with only a few changes. I'll write something about that in issue #19, the visualization issue, as well. |
From @titaniumbones on February 10, 2017 17:0 I think this is great. In yr opinion are there pieces of this I should ask folks to work on in SF tomorrow? On February 10, 2017 10:29:33 AM EST, Allan Pichardo [email protected] wrote:
-- |
From @allanpichardo on February 10, 2017 17:15 @ambergman Yes, here's what I see overall at a high level, I understand that the 30,000 URLs are kept in the spreadsheet. Those are live URLs. Where are the previous versions archived? Is it held remotely on another server? I ask because if it's possible to have a structure such that:
@titaniumbones If this architecture is something we can work with, then maybe the SF group can set up the directory structure, and put some test files in it, and start a cli utility that would read the files and use git to diff them with the live URLs and save the diffs into another directory. |
From @titaniumbones on February 10, 2017 17:40 Allan Pichardo [email protected] writes:
We don't know where the long-term storage will be. We have talked to a http://edgistorage.hackinghistory.ca/ You can dl it yourself there, but the large zipfile is about 7g. I've also unzipped the zipfile in I think there's something about this in the docs in
take a look at the zipfile structure. Probably we could do that but it
yeah sounds great. I think ruby & python also have dom-arware diff
Whatever solution we come up with, we will need to make all this stuff
yup, sounds great. -- |
From @allanpichardo on February 10, 2017 20:30 @titaniumbones the directory structure from the zip files will work because the url is preserved in the file structure. So, it seems that the directories per domain mirror the exact directory structure that is remote. So this is good for knowing what compares to what. The only issue is that the archives come with a lot of other files that we don't need, but that's OK, for this purpose, we can just traverse the tree and take html files and ignore the others. So I suppose, wherever this service runs, it could download an archive zip, extract it, and run through it creating the diffs and updating the spreadsheet. When the process is done, it can delete the downloaded archive. Then rinse and repeat. |
From @lh00000000 on February 10, 2017 23:41 concerns
i've been kind of kind of playing around with the idea of an architecture that relies on s3 and aws lambdas (to avoid running-server costs). i wanted to get more details on the current situation to avoid proposing solutions to problems already solved but maybe it's better to just spit it out: two pieces: A. archiving / diff emitter system
B. the diff alerting/pre-filtering/exploration system
(if you're not familar with aws lambda, it's an aws service that allows you to upload code for single node or python functions, where they are invoked according to "triggers" you define (such as a new object being put in s3 or an http request to some static url). they're kind of a pain in the butt but the payoff is that you only pay $0.000000208 - $0.000002501 per call for month and aws will take care of handling scale-outs (for unexpected bursts of load)) |
From @leepro on February 13, 2017 18:23 FYI. I am from Pagefreezer. To make clear on the slowness of our Diff API, I would like to give some idea on it. As @allanpichardo mentioned, one API call takes around avg 5 sec. Actually, it is due to the network latency of AWS Lambda / API Gateway. For the purpose of this project, we took the diff service from our production and made a AWS Lambda version. Our internal benchmark is as follows:
So, to use the diff API with a large number of files, I recommend to use multi-threaded(or event-driven) client codes to make the API calls. AWS Lambda will serve them at scale. To compare 30k files with 100 threads, it simply needs to take 41 minutes. The multithreaded client doesn't do anything except of making a connection to API and waiting its result. As a note, whatever you use a tool/algorithm to make diff, the time taking to make diff will be somehow in proportion to the source's HTML structure (#DOM nodes and text size). EOF |
From @mriedijk on February 14, 2017 0:50 In addition: consider bulk-uploading the HTML pages to AWS first before you use the PageFreezer Diff service, it may decrease the network latency. |
From @mekarpeles on February 17, 2017 23:34 Have you folks talked to @markjohngraham about how some of these html diff problems are (or could be addressed) in the Wayback Machine -- web.archive.org? |
From @mekarpeles on February 17, 2017 23:37 Also, in terms of long-term storage, has Internet Archive's S3 API been considered? https://github.com/vmbrasseur/IAS3API |
From @titaniumbones on February 18, 2017 1:58 @mekarpeles have not talked to markjohngraham, but we've been talking to Jefferson a little about IA as the end-game for this effort. Haven't got into the nitty-gritty, but clearly IA seems like the best home for this effort in the long run! sorry to have missed you in SF! was looking forward to meeting you but missed the connection somehow. |
From @mekarpeles on February 18, 2017 2:5 As long as everything gets backed up and the community is able to find a way to produce a registry so other institutions can cross-check and participate, I'm a super happy camper! Very thankful for your + team's efforts. Sorry to have missed you in SF as well! @flyingzumwalt had great things to say about you :) |
This is such a great conversation of decisions in the project! I'm having a hard time knowing what the next step from here is with our current architecture/implementation. I'm seeing a few TODOs which I have moved into different issues:
But what else from this is left as an 'issue' to address? @titaniumbones or @danielballan your guidance here would be appreciated |
I think this can be closed. |
Closing per @danielballan. |
From @ambergman on February 10, 2017 8:5
To replicate the Versionista workflow with the new PageFreezer archives, we need a little module that takes as input a diff already returned (either by PageFreezer's server or another diff service we build), and simply outputs a row in a CSV, as our versionista-outputter already does. If a particular URL has not been altered, then the diff being returned as input to this module should be null, and no row should be output. Please see @danielballan's issue summarizing the Versionista workflow for reference.
I'll follow up soon of a list of the current columns being output to the CSV by the versionista-outputter, a short description of how the analysts use the CSV they're working with, and some screenshots of what everything looks like for clarity
Copied from original issue: edgi-govdata-archiving/web-monitoring-ui#17
The text was updated successfully, but these errors were encountered: