-
-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse attachments from docket when available #718
Conversation
8764cfb
to
f5b9e0b
Compare
2a24450
to
94e2cd0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks. A couple little comments. Two things I'm thinking about though:
-
Do we need to upgrade the make_doc1_url functions? If not, I'd say let's not bother.
-
I think it'd be better to get the court ID numbers from the prior item rather than from a lookup table. It feels tidier that way, to rely on the internal data rather than the external. Could you tweak it to do that and then remove the two big look ups? Sorry I crashed last night before realizing what you were up to.
Thanks again for this. Nice to have scraper updates.
So I did this mostly as a safeguard to ensure the mapping tables are accurate and to add additional sanity checks for any code using these functions(ie for example the test code with bad/mismatched court_id's). Making court_id optional may also allow for simplifying some logic elsewhere in the codebase.
My previous approach of extracting from a previous item was making the code a good bit more complex and I suspect a bit slower as well due to the extra complexity, note that the merge function will error out if there's a court ID number mismatch so we're still effectively validating the court ID numbers are correct against previous entries. Actually having the ability to compute the full doc id entirely independently from the docket entry doc id here I think makes the merge more robust in terms of being able to validate that they match. |
OK, let's clean up the other comments and we can land this. Please request my review when you think it's ready again. |
94e2cd0
to
715a5f6
Compare
I saw the review request, but there are still a few outstanding tweaks. Could you do those, please, and do another request? |
715a5f6
to
0ebd7e1
Compare
I think I missed an example but I did add docstrings with explanations, the docstrings may not have been obvious with how github was showing the diffs in the review comment threads. |
Great. Merging. Juriscraper doesn't get auto-released, so to get this live we need to cut a new version and then update it in CourtListener. If you're thinking of doing other parsing work (which we desperately need), I'd suggest landing those PRs first, then doing a release. |
I don't have any immediate plans for more parsing work here, planning to look at integrating this into courtlistener next(after cleaning up some of the docket insertion code) but would be good to have a release to make it easier to test the integration. |
Could you try to use a git install for a bit so we can stay focused on Elastic Search? |
Sure, probably going to wait until some preliminary refactoring like #3120 is merged first anyways. |
Needs some cleanup/testing but seems to work.