keyphrases in sustained complaints #1

baileyb0t · 2024-07-22T16:54:41Z

Zac asked last week if we could check whether there are common phrases present in the allegations that contribute to whether an allegation is sustained.

There's an open source python library, pke, that could be useful for identifying common phrases. I took a pass at it with the TopicRank extractor but found that we may be incorrectly separating allegations that trail onto the next page, so I'll need to fix that issue before continuing to process phrases from the allegations.

The text was updated successfully, but these errors were encountered:

baileyb0t · 2024-07-22T17:10:29Z

We want to ignore field headers that are likely to congest the common string detection, so my thinking was to combine allegations (the cleaned string that follows "SUMMARY OF ALLEGATION(S)") with the text from findings_of_fact.

Both are technically written by the DPA, but the allegations reiterate the complainant's narrative broken out into individual allegations of misconduct. The findings_of_fact discuss the collected evidence and interviews conducted by DPA associated with a given allegation.

I think the findings_of_fact are more likely to be formulaic and contain common phrases (enough that we already have a few indicators set up to detect these phrases, ie. jlp for "justified, lawful, and proper"), but may be important to consider when the summary of allegations is extremely brief.

Open to suggestions! Maybe it's more reasonable to just use the allegations only, since that's more connected the language of the original complaint? Zac was looking for ways to improve the language of submitted complaints based on what has historically received more traction in terms of sustained allegations.

baileyb0t · 2024-07-22T17:13:13Z

May also be worth noting here that when allegations are added by the DPA (or OCC, since we also have those older records), these tended to be sustained more often due to the nature of how they come to exist.

It might be best to exclude these if our interest is in features of the original complaint/allegation language, though I'm not sure and will process them altogether for now.

tarakc02 · 2024-07-22T17:31:05Z

My initial thought is we should treat allegations and findings of fact separately rather than combining them. Although both are written by DPA, in terms of strategizing how to word complaints we submit in the future, we'll want to focus on language in allegations. That said, it might be the case that there is language in the findings that points to specific ways of presenting important evidence, that we can adopt for when we submit complaints. I just think it's worth trying it with just the allegations for this purpose (in addition to combined which we can also set up, the code should not be much different).

Ideally, we structure this as a classification problem, where we classify whether the complaint is sustained or not. And we use extracted keyphrases as features, along with: type of alleged misconduct, OCC vs. DPA, and whether or not the complaint was original or added by the agency. Those are the three major features that I imagine we'd want to be able to "control" for in some way (maybe there are others?). From there we can examine variable importance for specific keyphrases, and given your observation here, we should stratify those by whether the complaint was added or not (and perhaps focus on DPA vs OCC). In addition to auto-extracted keyphrases, make sure we include those that Zac has already come up with as features.

baileyb0t · 2024-07-22T17:58:27Z

That makes sense, I'll work with just the allegations text for now.

I did remember the goal to make it a classification problem and agree that category_of_conduct, report_type, and dpa_added|occ_added plus the extracted keyphrases should be the starting set of features.

I'm fiddling with the model parameters to make sure the extracted keyphrases are useful (top two results are consistently "officer" and "complainant") and I'll go back to confirm which of the existing phrases we indicate were suggested by Zac so we can pull those in.

Thanks for the feedback!

…s` data; progress on #1

baileyb0t mentioned this issue Jul 22, 2024

Allegations continued on the next page sometimes appear as separate allegations #2

Open

baileyb0t added a commit that referenced this issue Jul 23, 2024

adding initial test of pke for keyphrase detection from `allegation…

790201f

…s` data; progress on #1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keyphrases in sustained complaints #1

keyphrases in sustained complaints #1

baileyb0t commented Jul 22, 2024

baileyb0t commented Jul 22, 2024

baileyb0t commented Jul 22, 2024

tarakc02 commented Jul 22, 2024

baileyb0t commented Jul 22, 2024

keyphrases in sustained complaints #1

keyphrases in sustained complaints #1

Comments

baileyb0t commented Jul 22, 2024

baileyb0t commented Jul 22, 2024

baileyb0t commented Jul 22, 2024

tarakc02 commented Jul 22, 2024

baileyb0t commented Jul 22, 2024