conformed cleaner to new parser json structure + unit test updates #163

nicolassaw · 2024-11-02T21:42:56Z

After the new updates to the parser and cleaner separately, I conformed the cleaner to properly access fields in the new structure of json coming from the parser module.

Also, I fixed some unit tests to respond to the new structure differences. I also consolidated test and mock files and fixed mapping issues with some unit tests. Also commented out certain unit tests if they were too long or needed configuration.

newswim · 2024-11-03T20:18:01Z

resources/test_files/cleaned_test_json/test_123456.json

+    "Case Metadata": {
+        "county": "hays"
+    },
+    "Defendant Information": {


🤔 we should use snake_case on these keys.

newswim · 2024-11-04T02:31:53Z

src/cleaner/__init__.py

+        if isinstance(data, dict):
+            # Remove 'judicial officer' if it exists in this dictionary
+            if "judicial officer" in data:
+                del data["judicial officer"]


🔧 we should add a doc comment for this method. I can't recall why we need to delete the JO information.

newswim · 2024-11-04T02:32:41Z

src/cleaner/__init__.py

            "parsing_date": dt.datetime.today().strftime("%Y-%m-%d"),
+            "html_hash": input_dict["html_hash"],
+            "Case Metadata": {


🔧 just a reminder here to update the casing of these keys.

newswim · 2024-11-04T02:35:09Z

src/cleaner/__init__.py

+            "Charge Information": [],
+            "Case Details": {
+                "earliest_charge_date": "",
+                "has_evidence_of_representation": False,


I wonder if we should just omit case details and charge information from this initializing record, then append the values later. I wonder if there's a chance to get accidental false negatives here.

newswim · 2024-11-04T02:36:39Z

src/parser/hays.py

-                    elif SECTION == "disposition":
+                print(f'printing row: {row}')
+                if len(row) >= 2:
+                    if row[1] in ["Disposition", "Disposition:", "Amended Disposition"]:


Should the colon be there in "Disposition:"?

newswim · 2024-11-04T02:44:48Z

src/tester/test_unittest.py

@@ -8,11 +8,16 @@
 import tempfile
 from bs4 import BeautifulSoup

+sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..','..')))


I was curious about these lines and ran some questions through Claude.ai -- it suggested we use pythonpath setting in pyproject.toml to accomplish the same thing (modifying the way Python resolves module paths).

It seems like there are also a few other ways (like using setup.py and a config file for our test suite).

I assume this is fine for now, but we should try to figure out the discrepancies between how these modules are getting resolved, because right now it seems to be OS-dependent.

newswim · 2024-11-04T02:47:30Z

src/tester/test_unittest.py

-
-    @patch("os.listdir", return_value=["case1.json", "case2.json"])
+    def test_process_single_case(self):
+        county = "hays"


Seems good that we were able to avoid the @patch-ing happening for this test 👍

newswim · 2024-11-04T02:50:58Z

src/updater/__init__.py

+
+    def get_database_container(self):
+        #This loads the environment for interacting with CosmosDB #Dan: Should this be moved to the .env file?
+        load_dotenv()


Hm, just looking at the docs and it seems like you may want to use dotenv_values(".env") here instead of loading the env, just to make this a little more isolated.

https://pypi.org/project/python-dotenv/#:~:text=Other%20Use%20Cases-,Load%20configuration%20without%20altering%20the%20environment,-The%20function%20dotenv_values

So the following lines could read..

env = dotenv_values(".env") # I assume `env` isn't a reserved word URL = env["URL"] URL = env("URL") KEY = env("KEY") DATA_BASE_NAME = env("DATA_BASE_NAME") CONTAINER_NAME_CLEANED = env("CONTAINER_NAME_CLEANED")

You could also just move load_dotenv() to the top of this module, I think.

nicolassaw · 2024-11-09T21:58:59Z

I agree across the board on these updates. Thanks for reading through and giving comments. I'm going to leave a note to come back to this PR and make tickets for these.

conformed cleaner to new parser json structure

d068863

nicolassaw requested review from Matt343 and newswim November 2, 2024 21:43

newswim approved these changes Nov 4, 2024

View reviewed changes

nicolassaw merged commit cc08b6a into main Nov 9, 2024
2 checks passed

nicolassaw deleted the fix-cleaner-to-parser-mismatch branch November 9, 2024 21:59

nicolassaw mentioned this pull request Nov 9, 2024

Create tickets to implement the ideas in #163 #166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conformed cleaner to new parser json structure + unit test updates #163

conformed cleaner to new parser json structure + unit test updates #163

nicolassaw commented Nov 2, 2024

newswim Nov 3, 2024

newswim Nov 4, 2024

newswim Nov 4, 2024

newswim Nov 4, 2024

newswim Nov 4, 2024

newswim Nov 4, 2024

newswim Nov 4, 2024

newswim Nov 4, 2024

newswim Nov 4, 2024

nicolassaw commented Nov 9, 2024

conformed cleaner to new parser json structure + unit test updates #163

conformed cleaner to new parser json structure + unit test updates #163

Conversation

nicolassaw commented Nov 2, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicolassaw commented Nov 9, 2024