-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add first draft of CL search #2051
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just did a quick pass for code style, and LGTM! Left two tiny suggestions 🙂
I also took the liberty of adding Jack as a reviewer, who I expect might be more equipped than me to address your more detailed questions 🙂
web/main/legal_document_sources.py
Outdated
@@ -13,8 +13,13 @@ | |||
from django.conf import settings | |||
from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector | |||
from pyquery import PyQuery | |||
from .case_xml_converter import xml_to_html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't really matter, but FWIW: I think this project's convention is to stick with absolute imports (e.g. from main.case_xml_converter import xml_to_html
, as with from main.utils import ...
) rather than relative imports like from .case_xml_converter
. But yeah, totally doesn't matter.
web/main/legal_document_sources.py
Outdated
if looks_like_citation(search_params.q) | ||
else {"q": search_params.q} | ||
) | ||
params = CourtListener.cl_params(search_params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny suggestion: what would you think of renaming CourtListener.cl_params
to CourtListener.format_search_params
or CourtListener.get_search_params
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most definitely, I will update
Thank you Becky, I addressed your suggestions in my last commit. And I will work on any changes that Jack might suggest, especially those around the questions I had as you mentioned. |
We want to render the top part of the head matter ourselves, rather than use the info printed in the book -- that lets us provide more consistent formatting between cases published in different books. Check out So some fields are hidden because the custom header makes them redundant. I wasn't part of this, but I'm guessing we're hiding other fields like syllabus and parties simply for user preference. As long as we're rendering the same as cases fetched from the CAP API, let's not revisit that decision for now. |
I haven't looked if you're doing this yet -- I think we'll want to record which courtlistener field was used to populate the case. For example I'm pretty sure if we do need footnote_regexes, we only need it if xml_harvard was the source. |
web/main/legal_document_sources.py
Outdated
if cluster["filepath_json_harvard"]: | ||
harvard_xml_data = "" | ||
for sub_opinion in cluster["sub_opinions"]: | ||
opinion = CourtListener.get_opinion_body(sub_opinion) | ||
if opinion["xml_harvard"]: | ||
opinion_xml = opinion["xml_harvard"].replace( | ||
'<?xml version="1.0" encoding="utf-8"?>', "" | ||
) | ||
harvard_xml_data += f"{opinion_xml}\n" | ||
case_html = CourtListener.prepare_case_html(cluster, harvard_xml_data) | ||
else: | ||
opinion = CourtListener.get_opinion_body(cluster["sub_opinions"][0]) | ||
case_html = opinion["html"] if opinion["html"] else opinion["plain_text"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here wants to be something like "use all xml_harvard if they exist, else all html if they exist, else all plain_text." So what if you do something like this?
if cluster["filepath_json_harvard"]: | |
harvard_xml_data = "" | |
for sub_opinion in cluster["sub_opinions"]: | |
opinion = CourtListener.get_opinion_body(sub_opinion) | |
if opinion["xml_harvard"]: | |
opinion_xml = opinion["xml_harvard"].replace( | |
'<?xml version="1.0" encoding="utf-8"?>', "" | |
) | |
harvard_xml_data += f"{opinion_xml}\n" | |
case_html = CourtListener.prepare_case_html(cluster, harvard_xml_data) | |
else: | |
opinion = CourtListener.get_opinion_body(cluster["sub_opinions"][0]) | |
case_html = opinion["html"] if opinion["html"] else opinion["plain_text"] | |
for cl_type in ('xml_harvard', 'html', 'plain_text'): | |
case_text = ''.join(sub_opinion[cl_type] for sub_opinion in cluster['sub_opinions']) | |
if case_text: | |
break | |
else: | |
# failed to find anything ... | |
# do stuff based on cl_type and case_text ... |
I think this will also address your question
I saw that some opinions don't have either of the content fields (xml_harvard, plain_text). Think about what to do in this case. Can we fall back on other html fields? Or disable importing for those documents?
Are there clusters where none of the opinions have any of those fields? I think it might be just that some opinions have some of the fields and others have other fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember running into this scenario where neither of those fields were populated for an opinion, but darn, it looks like I didn't save its id. Also there is some info here about the available text fields, but it doesn't mention whether any of them will always be populated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example where the existing logic doesn't work: https://www.courtlistener.com/api/rest/v3/clusters/86480/
This has the full opinion text in the first opinion as "html", and then the full opinion text split into three parts in "xml_harvard." Checking type by type instead of opinion by opinion will handle this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jack, can you clarify the checking type by type a instead of opinion by opinion piece? To be able to see the content type ('xml_harvard', 'html', 'plain_text' etc) of the opinion, I still need to query the sub-opinion first as there is not an indicator on the cluster level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My big thought here is that it is never correct to assemble multiple types -- op[0]['plain_text'] + op[1]['html']
is always wrong. Logically what we're trying to do is find the chosen type and then glue it together. So I think this algorithm will turn out to be more robust against weird edge cases, and easy to get correct, if the logic is:
- fetch all subopinions
- from most to least preferred type, concatenate that type together from all subopinions. when you find one that isn't empty when concatenated, break, that's the chosen type.
- process the concatenation as appropriate to the type
That's what I was trying to gesture at with my code sketch, though I skipped the step of pre-fetching all the subopinions.
By the way! I noticed that opinions aren't always sorted in the correct order: https://www.courtlistener.com/api/rest/v3/clusters/3390160/ . From CL slack it sounds like the best thing for now is to sort them by ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this and also added the sorting. My initial thought was to prevent querying the other opinions if the source was Harvard. Also, I had chosen the opinion with index 0 in the otherwise block because all of the clusters that weren't from Harvard had only 1 opinion (those that I used in testing). But it makes sense to account for the existence of multiples since I didn't check all of the existing clusters. Let me know how it looks now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, avoiding extra queries totally makes sense! I ended up seeing enough variation in the CL API that I like the idea of being more robust to edge cases, as well as having a clean way to add stuff later like "turns out we prefer the html_columbia field but it needs special processing." Thanks for switching it around.
web/main/legal_document_sources.py
Outdated
case_name = "" | ||
if cluster["case_name"]: | ||
case_name = cluster["case_name"] | ||
elif cluster["case_name_full"]: | ||
case_name = cluster["case_name_full"][:10000] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style note, I find this clearer as case_name = cluster["case_name"] or cluster["case_name_full"][:10000]
This looks great -- I think with updates it'll be good to test on stage. |
... but we might want a feature flag since xml conversion isn't ready yet. |
I added a template for court listener modeling it after cap_header.html. One change I made to both was to remove the div with legal_doc.get_title as get_title method didn't exist, and so it wasn't rendering anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Just one request to switch the order we check courtlistener case types, then I think we're good to try this on stage.
) | ||
resp.raise_for_status() | ||
cluster["html_info"] = {"source": "court listener"} | ||
cluster["sub_opinions"].sort(key=lambda url: int(url.split("/")[-2])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
web/main/legal_document_sources.py
Outdated
sub_opinion_jsons.append(CourtListener.get_opinion_body(opinion)) | ||
|
||
text_source = "" | ||
for content_type in ("xml_harvard", "html", "plain_text"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at https://www.courtlistener.com/help/api/rest/v3/case-law/#opinion-endpoint , I'm realizing that "html" and "plain_text" are the least preferred fields, and html_with_citations is the most preferred field. What if we use this preference:
- First prefer xml_harvard, because we already know how to render it nicely on h2o.
- Then fields in the preferred order at that link starting with html_with_citations.
It looks to me like the citations markup [that they add when creating html_with_citations] is harmless and we could just pass it through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I just added all text field options in the order CL specifies with the exception of xml_harvard
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, let's try it!
This is a WIP PR for the CL case XML -> HTML conversion integration.
Things that were done:
xml_harvard
fields from the opinions endpoint if the cluster has afilepath_json_harvard
, otherwise thehtml
field will be used (withplain_text
as worst case scenario).U
, 2000 with sourceCU
). Source descriptions here.type
andid
attributes in the source xml.A few bug fixes were made:
effectiveDate
to prevent errors that are thrown if the CL API returns a date string longer than 25 chars, I saw that was the case for some clusters with the time offset including seconds.id
s with thecluster_id
s. Because the ids and the cluster_ids do not match in the search endpoint results, the subsequent clusters endpoint call with id was erroring out.opinion id
s (grabbed from cluster endpoint responsesub_opinions
field) since we need to look at all sub_opinions to construct the Harvard xml. Previously the search result id was being used.xml_harvard
data, and nohtml
, so I defaulted to use theplain_text
field of the opinion.Things to consider:
Sample converted legal doc (chopped):
This is how it would look like if the elements I mentioned above weren't set to
display: none
.A case that both CAP and CL return, and this is how they look like when imported (both chopped):
CAP (with
display: none
removed from.case-text .syllabus
):CourtListener (with
display: none
removed from elements in headmatter):