Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse attachments from docket when available #718

Merged
merged 1 commit into from
Sep 8, 2023

Conversation

ttys0dev
Copy link
Contributor

@ttys0dev ttys0dev commented Sep 6, 2023

Needs some cleanup/testing but seems to work.

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks. A couple little comments. Two things I'm thinking about though:

  1. Do we need to upgrade the make_doc1_url functions? If not, I'd say let's not bother.

  2. I think it'd be better to get the court ID numbers from the prior item rather than from a lookup table. It feels tidier that way, to rely on the internal data rather than the external. Could you tweak it to do that and then remove the two big look ups? Sorry I crashed last night before realizing what you were up to.

Thanks again for this. Nice to have scraper updates.

juriscraper/pacer/docket_report.py Outdated Show resolved Hide resolved
juriscraper/pacer/docket_report.py Outdated Show resolved Hide resolved
juriscraper/pacer/docket_report.py Show resolved Hide resolved
@ttys0dev
Copy link
Contributor Author

ttys0dev commented Sep 7, 2023

  1. Do we need to upgrade the make_doc1_url functions? If not, I'd say let's not bother.

So I did this mostly as a safeguard to ensure the mapping tables are accurate and to add additional sanity checks for any code using these functions(ie for example the test code with bad/mismatched court_id's). Making court_id optional may also allow for simplifying some logic elsewhere in the codebase.

I think it'd be better to get the court ID numbers from the prior item rather than from a lookup table. It feels tidier that way, to rely on the internal data rather than the external. Could you tweak it to do that and then remove the two big look ups? Sorry I crashed last night before realizing what you were up to.

My previous approach of extracting from a previous item was making the code a good bit more complex and I suspect a bit slower as well due to the extra complexity, note that the merge function will error out if there's a court ID number mismatch so we're still effectively validating the court ID numbers are correct against previous entries. Actually having the ability to compute the full doc id entirely independently from the docket entry doc id here I think makes the merge more robust in terms of being able to validate that they match.

@mlissner
Copy link
Member

mlissner commented Sep 7, 2023

OK, let's clean up the other comments and we can land this. Please request my review when you think it's ready again.

@ttys0dev ttys0dev requested a review from mlissner September 7, 2023 22:19
@mlissner
Copy link
Member

mlissner commented Sep 8, 2023

I saw the review request, but there are still a few outstanding tweaks. Could you do those, please, and do another request?

@ttys0dev
Copy link
Contributor Author

ttys0dev commented Sep 8, 2023

there are still a few outstanding tweaks

I think I missed an example but I did add docstrings with explanations, the docstrings may not have been obvious with how github was showing the diffs in the review comment threads.

@mlissner
Copy link
Member

mlissner commented Sep 8, 2023

Great. Merging. Juriscraper doesn't get auto-released, so to get this live we need to cut a new version and then update it in CourtListener. If you're thinking of doing other parsing work (which we desperately need), I'd suggest landing those PRs first, then doing a release.

@mlissner mlissner merged commit 8f45e61 into freelawproject:main Sep 8, 2023
@ttys0dev ttys0dev deleted the docket-attachments branch September 8, 2023 06:32
@ttys0dev
Copy link
Contributor Author

ttys0dev commented Sep 8, 2023

I don't have any immediate plans for more parsing work here, planning to look at integrating this into courtlistener next(after cleaning up some of the docket insertion code) but would be good to have a release to make it easier to test the integration.

@mlissner
Copy link
Member

mlissner commented Sep 8, 2023

Could you try to use a git install for a bit so we can stay focused on Elastic Search?

@ttys0dev
Copy link
Contributor Author

ttys0dev commented Sep 8, 2023

Could you try to use a git install for a bit so we can stay focused on Elastic Search?

Sure, probably going to wait until some preliminary refactoring like #3120 is merged first anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants