add first draft of CL search #2051

teovin · 2024-06-04T15:39:36Z

This is a WIP PR for the CL case XML -> HTML conversion integration.

Things that were done:

Activated the CourtListener legal document source in Django admin.
Added logic to grab xml_harvard fields from the opinions endpoint if the cluster has a filepath_json_harvard, otherwise the html field will be used (with plain_text as worst case scenario).
Enabled advanced search for CourtListener source by adding additional params logic to the source calls.
Added Jack's xml to html conversion script.
Ran the script against 4000 clusters/cases grabbed from CL API (2000 with source U, 2000 with source CU). Source descriptions here.
Made a change to the conversion script to handle cases where elements might be missing type and id attributes in the source xml.

A few bug fixes were made:

Handle cases where clusters might not have any citations. This was causing the search to error.
Update the format citations are being added to the legal doc. Previously they were being added in a json format instead of a list which didn't match how we display citations for other legal doc sources. Sample diff:

Made an update to the effectiveDate to prevent errors that are thrown if the CL API returns a date string longer than 25 chars, I saw that was the case for some clusters with the time offset including seconds.
Updated the search result ids with the cluster_ids. Because the ids and the cluster_ids do not match in the search endpoint results, the subsequent clusters endpoint call with id was erroring out.
Updated the opinions endpoint calls to use opinion ids (grabbed from cluster endpoint response sub_opinions field) since we need to look at all sub_opinions to construct the Harvard xml. Previously the search result id was being used.
I saw some cases where there wasn't any xml_harvard data, and no html, so I defaulted to use the plain_text field of the opinion.

Things to consider:

I mapped the cluster call response json to the metadata field like we do for other search sources. One thing I see Cap does is add the footnote regexes to metadata. Is this needed for CL?
I noticed we are hiding some elements like .parties, .decisiondate and .docketnumber in case-text class. What's the reasoning behind this?
I saw that some opinions don't have either of the content fields (xml_harvard, plain_text). Think about what to do in this case. Can we fall back on other html fields? Or disable importing for those documents?
Any edge cases that I should consider? Do some more testing around those.

Sample converted legal doc (chopped):

This is how it would look like if the elements I mentioned above weren't set to `display: none`.

A case that both CAP and CL return, and this is how they look like when imported (both chopped):

CAP (with display: none removed from .case-text .syllabus):

CourtListener (with display: none removed from elements in headmatter):

rebeccacremona

I just did a quick pass for code style, and LGTM! Left two tiny suggestions 🙂

I also took the liberty of adding Jack as a reviewer, who I expect might be more equipped than me to address your more detailed questions 🙂

rebeccacremona · 2024-06-18T15:08:53Z

web/main/legal_document_sources.py

@@ -13,8 +13,13 @@
 from django.conf import settings
 from django.contrib.postgres.search import SearchQuery, SearchRank, SearchVector
 from pyquery import PyQuery
+from .case_xml_converter import xml_to_html


This doesn't really matter, but FWIW: I think this project's convention is to stick with absolute imports (e.g. from main.case_xml_converter import xml_to_html, as with from main.utils import ...) rather than relative imports like from .case_xml_converter. But yeah, totally doesn't matter.

rebeccacremona · 2024-06-18T15:14:50Z

web/main/legal_document_sources.py

-                if looks_like_citation(search_params.q)
-                else {"q": search_params.q}
-            )
+            params = CourtListener.cl_params(search_params)


Tiny suggestion: what would you think of renaming CourtListener.cl_params to CourtListener.format_search_params or CourtListener.get_search_params?

most definitely, I will update

teovin · 2024-06-18T15:40:28Z

I just did a quick pass for code style, and LGTM! Left two tiny suggestions 🙂

I also took the liberty of adding Jack as a reviewer, who I expect might be more equipped than me to address your more detailed questions 🙂

Thank you Becky, I addressed your suggestions in my last commit. And I will work on any changes that Jack might suggest, especially those around the questions I had as you mentioned.

jcushman · 2024-06-18T17:43:34Z

I noticed we are hiding some elements like .parties, .decisiondate and .docketnumber in case-text class. What's the reasoning behind this?

We want to render the top part of the head matter ourselves, rather than use the info printed in the book -- that lets us provide more consistent formatting between cases published in different books. Check out cap_header.html for where that's done. My guess is you have to adapt that business logic to also work with CL.

So some fields are hidden because the custom header makes them redundant. I wasn't part of this, but I'm guessing we're hiding other fields like syllabus and parties simply for user preference. As long as we're rendering the same as cases fetched from the CAP API, let's not revisit that decision for now.

jcushman · 2024-06-18T17:48:05Z

I haven't looked if you're doing this yet -- I think we'll want to record which courtlistener field was used to populate the case. For example I'm pretty sure if we do need footnote_regexes, we only need it if xml_harvard was the source.

jcushman · 2024-06-18T18:02:15Z

web/main/legal_document_sources.py

+            if cluster["filepath_json_harvard"]:
+                harvard_xml_data = ""
+                for sub_opinion in cluster["sub_opinions"]:
+                    opinion = CourtListener.get_opinion_body(sub_opinion)
+                    if opinion["xml_harvard"]:
+                        opinion_xml = opinion["xml_harvard"].replace(
+                            '<?xml version="1.0" encoding="utf-8"?>', ""
+                        )
+                        harvard_xml_data += f"{opinion_xml}\n"
+                case_html = CourtListener.prepare_case_html(cluster, harvard_xml_data)
+            else:
+                opinion = CourtListener.get_opinion_body(cluster["sub_opinions"][0])
+                case_html = opinion["html"] if opinion["html"] else opinion["plain_text"]


The logic here wants to be something like "use all xml_harvard if they exist, else all html if they exist, else all plain_text." So what if you do something like this?

Suggested change

if cluster["filepath_json_harvard"]:

harvard_xml_data = ""

for sub_opinion in cluster["sub_opinions"]:

opinion = CourtListener.get_opinion_body(sub_opinion)

if opinion["xml_harvard"]:

opinion_xml = opinion["xml_harvard"].replace(

'<?xml version="1.0" encoding="utf-8"?>', ""

)

harvard_xml_data += f"{opinion_xml}\n"

case_html = CourtListener.prepare_case_html(cluster, harvard_xml_data)

else:

opinion = CourtListener.get_opinion_body(cluster["sub_opinions"][0])

case_html = opinion["html"] if opinion["html"] else opinion["plain_text"]

for cl_type in ('xml_harvard', 'html', 'plain_text'):

case_text = ''.join(sub_opinion[cl_type] for sub_opinion in cluster['sub_opinions'])

if case_text:

break

else:

# failed to find anything ...

# do stuff based on cl_type and case_text ...

I think this will also address your question

I saw that some opinions don't have either of the content fields (xml_harvard, plain_text). Think about what to do in this case. Can we fall back on other html fields? Or disable importing for those documents?

Are there clusters where none of the opinions have any of those fields? I think it might be just that some opinions have some of the fields and others have other fields.

I remember running into this scenario where neither of those fields were populated for an opinion, but darn, it looks like I didn't save its id. Also there is some info here about the available text fields, but it doesn't mention whether any of them will always be populated.

Here's an example where the existing logic doesn't work: https://www.courtlistener.com/api/rest/v3/clusters/86480/

This has the full opinion text in the first opinion as "html", and then the full opinion text split into three parts in "xml_harvard." Checking type by type instead of opinion by opinion will handle this.

Jack, can you clarify the checking type by type a instead of opinion by opinion piece? To be able to see the content type ('xml_harvard', 'html', 'plain_text' etc) of the opinion, I still need to query the sub-opinion first as there is not an indicator on the cluster level.

My big thought here is that it is never correct to assemble multiple types -- op[0]['plain_text'] + op[1]['html'] is always wrong. Logically what we're trying to do is find the chosen type and then glue it together. So I think this algorithm will turn out to be more robust against weird edge cases, and easy to get correct, if the logic is:

fetch all subopinions

from most to least preferred type, concatenate that type together from all subopinions. when you find one that isn't empty when concatenated, break, that's the chosen type.

process the concatenation as appropriate to the type

That's what I was trying to gesture at with my code sketch, though I skipped the step of pre-fetching all the subopinions.

By the way! I noticed that opinions aren't always sorted in the correct order: https://www.courtlistener.com/api/rest/v3/clusters/3390160/ . From CL slack it sounds like the best thing for now is to sort them by ID.

I implemented this and also added the sorting. My initial thought was to prevent querying the other opinions if the source was Harvard. Also, I had chosen the opinion with index 0 in the otherwise block because all of the clusters that weren't from Harvard had only 1 opinion (those that I used in testing). But it makes sense to account for the existence of multiples since I didn't check all of the existing clusters. Let me know how it looks now.

Yeah, avoiding extra queries totally makes sense! I ended up seeing enough variation in the CL API that I like the idea of being more robust to edge cases, as well as having a clean way to add stuff later like "turns out we prefer the html_columbia field but it needs special processing." Thanks for switching it around.

jcushman · 2024-06-18T18:03:21Z

web/main/legal_document_sources.py

+        case_name = ""
+        if cluster["case_name"]:
+            case_name = cluster["case_name"]
+        elif cluster["case_name_full"]:
+            case_name = cluster["case_name_full"][:10000]


Style note, I find this clearer as case_name = cluster["case_name"] or cluster["case_name_full"][:10000]

jcushman · 2024-06-18T18:06:59Z

This looks great -- I think with updates it'll be good to test on stage.

jcushman · 2024-06-18T18:07:21Z

... but we might want a feature flag since xml conversion isn't ready yet.

…opinions

teovin · 2024-06-26T13:21:25Z

We want to render the top part of the head matter ourselves, rather than use the info printed in the book -- that lets us provide more consistent formatting between cases published in different books. Check out cap_header.html for where that's done. My guess is you have to adapt that business logic to also work with CL.

So some fields are hidden because the custom header makes them redundant. I wasn't part of this, but I'm guessing we're hiding other fields like syllabus and parties simply for user preference. As long as we're rendering the same as cases fetched from the CAP API, let's not revisit that decision for now.

I added a template for court listener modeling it after cap_header.html. One change I made to both was to remove the div with legal_doc.get_title as get_title method didn't exist, and so it wasn't rendering anything.

jcushman

Cool! Just one request to switch the order we check courtlistener case types, then I think we're good to try this on stage.

jcushman · 2024-06-26T13:25:23Z

web/main/legal_document_sources.py

-            )
-            resp.raise_for_status()
+            cluster["html_info"] = {"source": "court listener"}
+            cluster["sub_opinions"].sort(key=lambda url: int(url.split("/")[-2]))


jcushman · 2024-06-26T13:31:29Z

web/main/legal_document_sources.py

+                sub_opinion_jsons.append(CourtListener.get_opinion_body(opinion))
+
+            text_source = ""
+            for content_type in ("xml_harvard", "html", "plain_text"):


Looking at https://www.courtlistener.com/help/api/rest/v3/case-law/#opinion-endpoint , I'm realizing that "html" and "plain_text" are the least preferred fields, and html_with_citations is the most preferred field. What if we use this preference:

First prefer xml_harvard, because we already know how to render it nicely on h2o.

Then fields in the preferred order at that link starting with html_with_citations.

It looks to me like the citations markup [that they add when creating html_with_citations] is harmless and we could just pass it through.

Cool, I just added all text field options in the order CL specifies with the exception of xml_harvard.

jcushman

Awesome, let's try it!

teovin added 13 commits June 4, 2024 10:57

add first draft of CL search

e97acd6

truncate name like other sources

42575a5

map additional fields to legal doc

776bb60

enable advanced search for cl source

a040698

clean up headmatter html

c22eadc

fix tag

1d7be58

feed headmatter into xml as well, add jack's script

041007a

remove placeholder fn

7bc82ff

linting

1a1f7d2

default non existing attrs to none, remove a bad attr

5aaf88c

account for cases without case_name

1cb9e47

use empty string

09d79c0

remove namespace

7c20b0a

teovin marked this pull request as ready for review June 17, 2024 17:51

teovin requested a review from a team as a code owner June 17, 2024 17:51

teovin requested review from rebeccacremona and removed request for a team June 17, 2024 17:51

rebeccacremona requested a review from jcushman June 18, 2024 14:59

rebeccacremona reviewed Jun 18, 2024

View reviewed changes

update fn name, update import

ef85d09

jcushman requested changes Jun 18, 2024

View reviewed changes

teovin added 4 commits June 18, 2024 14:43

store html source, use shorthand

4393e2e

add cl template

c26f0a9

grab additional fields to map to cl header template

f522cf2

remove div with unexisting method from template, remove spaces, sort …

6562520

…opinions

update the way case texts are aggregated

6afd2c2

teovin requested a review from jcushman June 26, 2024 13:30

jcushman requested changes Jun 26, 2024

View reviewed changes

add all case text types to the check

e05079e

jcushman approved these changes Jun 26, 2024

View reviewed changes

teovin merged commit 3d7b319 into harvard-lil:develop Jun 26, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add first draft of CL search #2051

add first draft of CL search #2051

teovin commented Jun 4, 2024 •

edited

Loading

rebeccacremona left a comment

rebeccacremona Jun 18, 2024

rebeccacremona Jun 18, 2024

teovin Jun 18, 2024

teovin commented Jun 18, 2024

jcushman commented Jun 18, 2024

jcushman commented Jun 18, 2024

jcushman Jun 18, 2024 •

edited

Loading

teovin Jun 18, 2024

jcushman Jun 20, 2024

teovin Jun 24, 2024

jcushman Jun 24, 2024

teovin Jun 26, 2024 •

edited

Loading

jcushman Jun 26, 2024

jcushman Jun 18, 2024

jcushman commented Jun 18, 2024

jcushman commented Jun 18, 2024

teovin commented Jun 26, 2024

jcushman left a comment

jcushman Jun 26, 2024

jcushman Jun 26, 2024 •

edited

Loading

teovin Jun 26, 2024

jcushman left a comment

add first draft of CL search #2051

add first draft of CL search #2051

Conversation

teovin commented Jun 4, 2024 • edited Loading

Things that were done:

A few bug fixes were made:

Things to consider:

Sample converted legal doc (chopped):

This is how it would look like if the elements I mentioned above weren't set to display: none.

A case that both CAP and CL return, and this is how they look like when imported (both chopped):

rebeccacremona left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teovin commented Jun 18, 2024

jcushman commented Jun 18, 2024

jcushman commented Jun 18, 2024

jcushman Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teovin Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcushman commented Jun 18, 2024

jcushman commented Jun 18, 2024

teovin commented Jun 26, 2024

jcushman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcushman Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcushman left a comment

Choose a reason for hiding this comment

teovin commented Jun 4, 2024 •

edited

Loading

This is how it would look like if the elements I mentioned above weren't set to `display: none`.

jcushman Jun 18, 2024 •

edited

Loading

teovin Jun 26, 2024 •

edited

Loading

jcushman Jun 26, 2024 •

edited

Loading