Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add first draft of CL search #2051

Merged
merged 20 commits into from
Jun 26, 2024
Merged

Conversation

teovin
Copy link
Contributor

@teovin teovin commented Jun 4, 2024

This is a WIP PR for the CL case XML -> HTML conversion integration.

Things that were done:

  • Activated the CourtListener legal document source in Django admin.
  • Added logic to grab xml_harvard fields from the opinions endpoint if the cluster has a filepath_json_harvard, otherwise the html field will be used (with plain_text as worst case scenario).
  • Enabled advanced search for CourtListener source by adding additional params logic to the source calls.
  • Added Jack's xml to html conversion script.
  • Ran the script against 4000 clusters/cases grabbed from CL API (2000 with source U, 2000 with source CU). Source descriptions here.
  • Made a change to the conversion script to handle cases where elements might be missing type and id attributes in the source xml.

A few bug fixes were made:

  • Handle cases where clusters might not have any citations. This was causing the search to error.
  • Update the format citations are being added to the legal doc. Previously they were being added in a json format instead of a list which didn't match how we display citations for other legal doc sources. Sample diff:

image

  • Made an update to the effectiveDate to prevent errors that are thrown if the CL API returns a date string longer than 25 chars, I saw that was the case for some clusters with the time offset including seconds.
  • Updated the search result ids with the cluster_ids. Because the ids and the cluster_ids do not match in the search endpoint results, the subsequent clusters endpoint call with id was erroring out.
  • Updated the opinions endpoint calls to use opinion ids (grabbed from cluster endpoint response sub_opinions field) since we need to look at all sub_opinions to construct the Harvard xml. Previously the search result id was being used.
  • I saw some cases where there wasn't any xml_harvard data, and no html, so I defaulted to use the plain_text field of the opinion.

Things to consider:

  • I mapped the cluster call response json to the metadata field like we do for other search sources. One thing I see Cap does is add the footnote regexes to metadata. Is this needed for CL?
  • I noticed we are hiding some elements like .parties, .decisiondate and .docketnumber in case-text class. What's the reasoning behind this?
  • I saw that some opinions don't have either of the content fields (xml_harvard, plain_text). Think about what to do in this case. Can we fall back on other html fields? Or disable importing for those documents?
  • Any edge cases that I should consider? Do some more testing around those.

Sample converted legal doc (chopped):

Screenshot 2024-06-12 at 10 26 44 AM

This is how it would look like if the elements I mentioned above weren't set to display: none.

Screenshot 2024-06-12 at 10 43 33 AM

A case that both CAP and CL return, and this is how they look like when imported (both chopped):

CAP (with display: none removed from .case-text .syllabus):

CAP

CourtListener (with display: none removed from elements in headmatter):

CL

@teovin teovin marked this pull request as ready for review June 17, 2024 17:51
@teovin teovin requested a review from a team as a code owner June 17, 2024 17:51
@teovin teovin requested review from rebeccacremona and removed request for a team June 17, 2024 17:51
@rebeccacremona rebeccacremona requested a review from jcushman June 18, 2024 14:59
Copy link
Contributor

@rebeccacremona rebeccacremona left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just did a quick pass for code style, and LGTM! Left two tiny suggestions 🙂

I also took the liberty of adding Jack as a reviewer, who I expect might be more equipped than me to address your more detailed questions 🙂

@@ -13,8 +13,13 @@
from django.conf import settings
from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector
from pyquery import PyQuery
from .case_xml_converter import xml_to_html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really matter, but FWIW: I think this project's convention is to stick with absolute imports (e.g. from main.case_xml_converter import xml_to_html, as with from main.utils import ...) rather than relative imports like from .case_xml_converter. But yeah, totally doesn't matter.

if looks_like_citation(search_params.q)
else {"q": search_params.q}
)
params = CourtListener.cl_params(search_params)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny suggestion: what would you think of renaming CourtListener.cl_params to CourtListener.format_search_params or CourtListener.get_search_params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most definitely, I will update

@teovin
Copy link
Contributor Author

teovin commented Jun 18, 2024

I just did a quick pass for code style, and LGTM! Left two tiny suggestions 🙂

I also took the liberty of adding Jack as a reviewer, who I expect might be more equipped than me to address your more detailed questions 🙂

Thank you Becky, I addressed your suggestions in my last commit. And I will work on any changes that Jack might suggest, especially those around the questions I had as you mentioned.

@jcushman
Copy link
Contributor

I noticed we are hiding some elements like .parties, .decisiondate and .docketnumber in case-text class. What's the reasoning behind this?

We want to render the top part of the head matter ourselves, rather than use the info printed in the book -- that lets us provide more consistent formatting between cases published in different books. Check out cap_header.html for where that's done. My guess is you have to adapt that business logic to also work with CL.

So some fields are hidden because the custom header makes them redundant. I wasn't part of this, but I'm guessing we're hiding other fields like syllabus and parties simply for user preference. As long as we're rendering the same as cases fetched from the CAP API, let's not revisit that decision for now.

@jcushman
Copy link
Contributor

I haven't looked if you're doing this yet -- I think we'll want to record which courtlistener field was used to populate the case. For example I'm pretty sure if we do need footnote_regexes, we only need it if xml_harvard was the source.

Comment on lines 584 to 596
if cluster["filepath_json_harvard"]:
harvard_xml_data = ""
for sub_opinion in cluster["sub_opinions"]:
opinion = CourtListener.get_opinion_body(sub_opinion)
if opinion["xml_harvard"]:
opinion_xml = opinion["xml_harvard"].replace(
'<?xml version="1.0" encoding="utf-8"?>', ""
)
harvard_xml_data += f"{opinion_xml}\n"
case_html = CourtListener.prepare_case_html(cluster, harvard_xml_data)
else:
opinion = CourtListener.get_opinion_body(cluster["sub_opinions"][0])
case_html = opinion["html"] if opinion["html"] else opinion["plain_text"]
Copy link
Contributor

@jcushman jcushman Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here wants to be something like "use all xml_harvard if they exist, else all html if they exist, else all plain_text." So what if you do something like this?

Suggested change
if cluster["filepath_json_harvard"]:
harvard_xml_data = ""
for sub_opinion in cluster["sub_opinions"]:
opinion = CourtListener.get_opinion_body(sub_opinion)
if opinion["xml_harvard"]:
opinion_xml = opinion["xml_harvard"].replace(
'<?xml version="1.0" encoding="utf-8"?>', ""
)
harvard_xml_data += f"{opinion_xml}\n"
case_html = CourtListener.prepare_case_html(cluster, harvard_xml_data)
else:
opinion = CourtListener.get_opinion_body(cluster["sub_opinions"][0])
case_html = opinion["html"] if opinion["html"] else opinion["plain_text"]
for cl_type in ('xml_harvard', 'html', 'plain_text'):
case_text = ''.join(sub_opinion[cl_type] for sub_opinion in cluster['sub_opinions'])
if case_text:
break
else:
# failed to find anything ...
# do stuff based on cl_type and case_text ...

I think this will also address your question

I saw that some opinions don't have either of the content fields (xml_harvard, plain_text). Think about what to do in this case. Can we fall back on other html fields? Or disable importing for those documents?

Are there clusters where none of the opinions have any of those fields? I think it might be just that some opinions have some of the fields and others have other fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember running into this scenario where neither of those fields were populated for an opinion, but darn, it looks like I didn't save its id. Also there is some info here about the available text fields, but it doesn't mention whether any of them will always be populated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example where the existing logic doesn't work: https://www.courtlistener.com/api/rest/v3/clusters/86480/

This has the full opinion text in the first opinion as "html", and then the full opinion text split into three parts in "xml_harvard." Checking type by type instead of opinion by opinion will handle this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jack, can you clarify the checking type by type a instead of opinion by opinion piece? To be able to see the content type ('xml_harvard', 'html', 'plain_text' etc) of the opinion, I still need to query the sub-opinion first as there is not an indicator on the cluster level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My big thought here is that it is never correct to assemble multiple types -- op[0]['plain_text'] + op[1]['html'] is always wrong. Logically what we're trying to do is find the chosen type and then glue it together. So I think this algorithm will turn out to be more robust against weird edge cases, and easy to get correct, if the logic is:

  • fetch all subopinions
  • from most to least preferred type, concatenate that type together from all subopinions. when you find one that isn't empty when concatenated, break, that's the chosen type.
  • process the concatenation as appropriate to the type

That's what I was trying to gesture at with my code sketch, though I skipped the step of pre-fetching all the subopinions.

By the way! I noticed that opinions aren't always sorted in the correct order: https://www.courtlistener.com/api/rest/v3/clusters/3390160/ . From CL slack it sounds like the best thing for now is to sort them by ID.

Copy link
Contributor Author

@teovin teovin Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this and also added the sorting. My initial thought was to prevent querying the other opinions if the source was Harvard. Also, I had chosen the opinion with index 0 in the otherwise block because all of the clusters that weren't from Harvard had only 1 opinion (those that I used in testing). But it makes sense to account for the existence of multiples since I didn't check all of the existing clusters. Let me know how it looks now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, avoiding extra queries totally makes sense! I ended up seeing enough variation in the CL API that I like the idea of being more robust to edge cases, as well as having a clean way to add stuff later like "turns out we prefer the html_columbia field but it needs special processing." Thanks for switching it around.

Comment on lines 608 to 612
case_name = ""
if cluster["case_name"]:
case_name = cluster["case_name"]
elif cluster["case_name_full"]:
case_name = cluster["case_name_full"][:10000]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style note, I find this clearer as case_name = cluster["case_name"] or cluster["case_name_full"][:10000]

@jcushman
Copy link
Contributor

This looks great -- I think with updates it'll be good to test on stage.

@jcushman
Copy link
Contributor

... but we might want a feature flag since xml conversion isn't ready yet.

@teovin
Copy link
Contributor Author

teovin commented Jun 26, 2024

We want to render the top part of the head matter ourselves, rather than use the info printed in the book -- that lets us provide more consistent formatting between cases published in different books. Check out cap_header.html for where that's done. My guess is you have to adapt that business logic to also work with CL.

So some fields are hidden because the custom header makes them redundant. I wasn't part of this, but I'm guessing we're hiding other fields like syllabus and parties simply for user preference. As long as we're rendering the same as cases fetched from the CAP API, let's not revisit that decision for now.

I added a template for court listener modeling it after cap_header.html. One change I made to both was to remove the div with legal_doc.get_title as get_title method didn't exist, and so it wasn't rendering anything.

@teovin teovin requested a review from jcushman June 26, 2024 13:30
Copy link
Contributor

@jcushman jcushman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Just one request to switch the order we check courtlistener case types, then I think we're good to try this on stage.

)
resp.raise_for_status()
cluster["html_info"] = {"source": "court listener"}
cluster["sub_opinions"].sort(key=lambda url: int(url.split("/")[-2]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

sub_opinion_jsons.append(CourtListener.get_opinion_body(opinion))

text_source = ""
for content_type in ("xml_harvard", "html", "plain_text"):
Copy link
Contributor

@jcushman jcushman Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at https://www.courtlistener.com/help/api/rest/v3/case-law/#opinion-endpoint , I'm realizing that "html" and "plain_text" are the least preferred fields, and html_with_citations is the most preferred field. What if we use this preference:

  • First prefer xml_harvard, because we already know how to render it nicely on h2o.
  • Then fields in the preferred order at that link starting with html_with_citations.

It looks to me like the citations markup [that they add when creating html_with_citations] is harmless and we could just pass it through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I just added all text field options in the order CL specifies with the exception of xml_harvard.

Copy link
Contributor

@jcushman jcushman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, let's try it!

@teovin teovin merged commit 3d7b319 into harvard-lil:develop Jun 26, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants