Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keyphrases in sustained complaints #1

Open
baileyb0t opened this issue Jul 22, 2024 · 4 comments
Open

keyphrases in sustained complaints #1

baileyb0t opened this issue Jul 22, 2024 · 4 comments

Comments

@baileyb0t
Copy link
Contributor

Zac asked last week if we could check whether there are common phrases present in the allegations that contribute to whether an allegation is sustained.

There's an open source python library, pke, that could be useful for identifying common phrases. I took a pass at it with the TopicRank extractor but found that we may be incorrectly separating allegations that trail onto the next page, so I'll need to fix that issue before continuing to process phrases from the allegations.

@baileyb0t
Copy link
Contributor Author

We want to ignore field headers that are likely to congest the common string detection, so my thinking was to combine allegations (the cleaned string that follows "SUMMARY OF ALLEGATION(S)") with the text from findings_of_fact.

Both are technically written by the DPA, but the allegations reiterate the complainant's narrative broken out into individual allegations of misconduct. The findings_of_fact discuss the collected evidence and interviews conducted by DPA associated with a given allegation.

I think the findings_of_fact are more likely to be formulaic and contain common phrases (enough that we already have a few indicators set up to detect these phrases, ie. jlp for "justified, lawful, and proper"), but may be important to consider when the summary of allegations is extremely brief.

Open to suggestions! Maybe it's more reasonable to just use the allegations only, since that's more connected the language of the original complaint? Zac was looking for ways to improve the language of submitted complaints based on what has historically received more traction in terms of sustained allegations.

@baileyb0t
Copy link
Contributor Author

May also be worth noting here that when allegations are added by the DPA (or OCC, since we also have those older records), these tended to be sustained more often due to the nature of how they come to exist.

It might be best to exclude these if our interest is in features of the original complaint/allegation language, though I'm not sure and will process them altogether for now.

@tarakc02
Copy link

My initial thought is we should treat allegations and findings of fact separately rather than combining them. Although both are written by DPA, in terms of strategizing how to word complaints we submit in the future, we'll want to focus on language in allegations. That said, it might be the case that there is language in the findings that points to specific ways of presenting important evidence, that we can adopt for when we submit complaints. I just think it's worth trying it with just the allegations for this purpose (in addition to combined which we can also set up, the code should not be much different).

Ideally, we structure this as a classification problem, where we classify whether the complaint is sustained or not. And we use extracted keyphrases as features, along with: type of alleged misconduct, OCC vs. DPA, and whether or not the complaint was original or added by the agency. Those are the three major features that I imagine we'd want to be able to "control" for in some way (maybe there are others?). From there we can examine variable importance for specific keyphrases, and given your observation here, we should stratify those by whether the complaint was added or not (and perhaps focus on DPA vs OCC). In addition to auto-extracted keyphrases, make sure we include those that Zac has already come up with as features.

@baileyb0t
Copy link
Contributor Author

That makes sense, I'll work with just the allegations text for now.

I did remember the goal to make it a classification problem and agree that category_of_conduct, report_type, and dpa_added|occ_added plus the extracted keyphrases should be the starting set of features.

I'm fiddling with the model parameters to make sure the extracted keyphrases are useful (top two results are consistently "officer" and "complainant") and I'll go back to confirm which of the existing phrases we indicate were suggested by Zac so we can pull those in.

Thanks for the feedback!

baileyb0t added a commit that referenced this issue Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants