SQLite Utils: clean_like_nulls and More! #203

d33bs · 2022-06-06T20:56:43Z

This change adds a new series of utilities under pycytominer/cyto_utils/sqlite.py for detecting the need for and making changes to SQLite databases used by this project. The work seeks to improve performance, especially memory resource challenges as originally cited in #195 and further investigated as part of #198. Minimal logging has been added to the utilities to assist those making use of the functions and also as work towards #5 .

Special notes:

Python >= 3.7 is required due to the use of the sqlite3.Connection.backup (a safety measure to avoid database corruption during changes within update_columns_to_nullable).
Functions within are provided to assist with someone's ability to investigate SQLite source databases. These are intended to be used optionally within other classes and methods or on their own.

Open questions:

Will moving to SQLite Null and related numpy.nan within subsequent Pandas dataframes on read present any challenges (or other opportunities) for using data elsewhere within this library? Would we need any additional changes as a result?
Would it be beneficial to add clean_like_nulls as a "pre-step" to functions like merge_single_cells?

Thank you for any feedback, suggestions, and thoughts you may have!

What is the nature of your change?

Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).
This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

I have read the CONTRIBUTING.md guidelines.
My code follows the style guidelines of this project.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
New and existing unit tests pass locally with my changes.
I have added tests that prove my fix is effective or that my feature works.
I have deleted all non-relevant text in this pull request template.

work towards #198

gwaybio

This is a big contribution, thanks @d33bs ! I've made several inline comments and discussion items.

A couple general comments:

We typically use f-strings where possible in > python 3.7. I do think some of the SQL statement .formats are super clean, so you may not decide to update all instances. I made some direct suggestions.
I understood nearly all of this (with the exception of some of the advanced SQL statements! (Given your extensive testing, I don't think you need to get these reviewed by a SQL expert)), but one area where I think understanding could substantially increase is in adding some brief comments about what the format of some variables you're indexing into. I made some in-line comments to this regard.

Thanks again!

gwaybio · 2022-06-07T16:31:33Z

pycytominer/cyto_utils/sqlite.py

+from sqlalchemy import create_engine
+from sqlalchemy.engine.base import Engine
+
+# pylint: disable=consider-using-f-string


Please delete comments of this sort

pycytominer/cyto_utils/sqlite.py

gwaybio · 2022-06-07T16:49:01Z

pycytominer/cyto_utils/sqlite.py

+                sql_stmt = """
+                SELECT :table_name, name, type, [notnull] 
+                FROM pragma_table_info(:table_name);
+                """
+            else:
+                # otherwise we will focus on only the column name provided
+                sql_stmt = """
+                SELECT :table_name, name, type, [notnull]
+                FROM pragma_table_info(:table_name) WHERE name = :col_name;
+                """


Perhaps you can reduce redundancy in sql_stmt by defining the common text and using f-strings to append whats different within the if statement.

Great point, I'll make a change to this effect.

gwaybio · 2022-06-07T16:49:39Z

pycytominer/cyto_utils/sqlite.py

+            # append to column list the results
+            column_list += connection.execute(
+                sql_stmt,
+                {"table_name": str(table[0]), "col_name": str(column_name)},


why is table zero indexed?

oh I see, line 128 shows a list of tuples

On further reading, we can reference SQLAlchemy's Row object by key name here (in addition to other areas). I'll make a change to increase clarity here. Thank you for calling this forward.

gwaybio · 2022-06-07T16:50:54Z

pycytominer/cyto_utils/sqlite.py

+    return column_list
+
+
+def contains_conflicting_aff_strg_class(


does strg mean string? If so, is this SQL convention? Otherwise, I think updating the function name to str might be less confusing

I intended strg as a shortened version of storage (stated more fully, something like: contains_conflicting_column_affinity_type_and_data_storage_class), trying to reference to SQLite docs phrasing. Storage class is I feel roughly equivalent to "value actual type" (as opposed to "column preferred type"), but we may lose understandability if we use those words 🙂 . What do you think about expanding strg to the full word, making this: contains_conflicting_aff_storage_class?

yes! I think contains_conflicting_aff_storage_class is much better. Please update throughout

gwaybio · 2022-06-07T17:31:05Z

pycytominer/cyto_utils/sqlite.py

+        # if we have a table name provided, target only that table for the modifications
+        sql_stmt = (
+            "SELECT name, sql FROM sqlite_master "
+            "WHERE type = 'table' and UPPER(name) = UPPER(:table_name)"


Should this be a formatted string? (table_name)

Where I could, I tried to use SQLAlchemy execute arguments as a best practice when it comes to avoiding SQL injection (as per Bandit: B608). This proved to be challenging using variables in the way we need to for this work (column or table names), so I ended up intermixing formatted strings as well. I'm uncertain of the exact risk/attack vector in this case.

Do you feel we should move towards formatted strings for uniformity?

gwaybio · 2022-06-07T17:35:25Z

pycytominer/cyto_utils/sqlite.py

+    sqlalchemy.engine.base.Engine
+        A SQLAlchemy engine for the changed database
+    """
+    logger.info("Updating column values with str 'nan' to NULL values.")


Suggested change

logger.info("Updating column values with str 'nan' to NULL values.")

logger.info(f"Updating column values with str {LIKE_NULLS} to NULL values.")

True?

Yes, thank you. I'll add this change in as a lazy-formatted string logger.info("... %s ...", like_nulls) (as suggested by Pylint: W1201).

gwaybio · 2022-06-07T17:38:57Z

pycytominer/tests/test_cyto_utils/test_sqlite.py

+    engine = create_engine(sql_path)
+
+    # statements for creating database with simple structure
+    create_stmts = [


can you use black on this? It looks a bit wonky to me

Sure thing - thank you for calling this out. I'll change this around in the next commit. black, multi-line strings, and lists don't seem to get along well.

gwaybio · 2022-06-07T17:39:17Z

pycytominer/tests/test_cyto_utils/test_sqlite.py

+    """,
+    ]
+
+    insert_vals = [1, "sample", b"sample_blob", 0.5]


what is b?

This notation is used for bytes literals. It turned out to be a simple way to insert a blob into the test database tables and retain the blob storage class (or type) without needing to create or load a file object or similar. Inserting a string without the b notation results in a python string being inserted as a text storage class value. While this isn't crucial for testing with the current set, I felt it might be helpful to have in the future.

gwaybio · 2022-06-07T17:41:27Z

pycytominer/tests/test_cyto_utils/test_sqlite.py

+        connection.execute(
+            """
+        INSERT INTO tbl_a (col_integer, col_text, col_blob, col_real)
+        VALUES ('nan', 'nan', 'example', 0.5);


Do you think it is worth testing values other than nan - others in LIKE_NULLS i mean. it might not be!

oh i see, maybe you're doing it in line 205

Definitely worth checking - with contains_conflicting_aff_strg_class we're checking only for mismatched storage class for the values (vs what the column affinity / preference is). We then go on to check for the like_nulls strings with contains_str_like_null in another set of tests. Line 210 seeks to check each column containing various versions of the like_nulls.

@gwaybio

- address comments and code suggestions from @gwaybio for #203 - f-string's instead of format - remove pylint: disable=consider-using-f-string - reference engine_from_str for SingleCells class - reference Row keys with column names for SQLAlchemy executes instead of row indexes for clarity - update docs for functions for clarity - fetchone instead of fetchall for simpler implementation - clearer logging message(s) - like_nulls as function arg for contains_str_like_null - LIKE_NULLS as tuple instead of list for immutable like_nulls default arg value with contains_str_like_null - better formatting for test_sqlite.py SQL strings - is False or True instead of == False or True (supplementary linting) Co-Authored-By: Gregory Way <[email protected]>

d33bs · 2022-06-08T15:36:06Z

Thank you for all the great feedback @gwaybio, really appreciate your input along with the questions! I've pushed some changes just now based on your suggestions. I added you as a co-author with these as you suggested specific code changes which landed in the commit as well.

gwaybio

Looks great @d33bs - thanks for your good github commit hygiene! This made re-review very straightforward.

I'm not sure if you have merge privileges, but if so, please do once you and the the tests are happy. If not, ping me once it is so, and I'll merge

gwaybio · 2022-06-08T16:52:28Z

pycytominer/cyto_utils/sqlite.py

+    return column_list
+
+
+def contains_conflicting_aff_strg_class(


yes! I think contains_conflicting_aff_storage_class is much better. Please update throughout

gwaybio · 2022-06-08T16:53:08Z

pycytominer/tests/test_cyto_utils/test_sqlite.py

+    """,
+    ]
+
+    insert_vals = [1, "sample", b"sample_blob", 0.5]


gwaybio · 2022-06-08T16:56:14Z

pycytominer/cyto_utils/sqlite.py

+
+            if column_name is not None:
+                # otherwise we will focus on only the column name provided
+                sql_stmt += " WHERE name = :col_name;"


Suggested change

sql_stmt += " WHERE name = :col_name;"

sql_stmt = f"{sql_stmt} WHERE name = :col_name;"

I'm not sure which is more pythonic - to keep f-string consistency or avoid overwriting a variable with itself.

Feel free to ignore this suggestion if you chose option 2

Thank you - I like the f-string version better, feels more readable and consistent as you mention. I'll opt to add this one in.

- rename contains_conflicting_aff_strg_class to contains_conflicting_aff_storage_class - f-string instead of a string concatenation for consistency and readability Co-Authored-By: Gregory Way <[email protected]>

d33bs · 2022-06-08T22:55:42Z

Thank you again @gwaybio for the great feedback on this PR! I'm noticing that automated Github workflow checks may not run against the develop branch, and as a result, test results may not be as visible when merging work together there. What do you think about expanding the conditions in .github/workflows/python-app.yml to run on PR against develop (in addition to master)?

gwaybio · 2022-06-09T14:56:07Z

What do you think about expanding the conditions in .github/workflows/python-app.yml to run on PR against develop (in addition to master)?

Yes, absolutely - it'll be great to do for this and all future PRs. Can you add to this PR?

d33bs · 2022-06-09T15:34:39Z

There appear to be some challenges with how SQLite schema tables are referenced in different version of Python and possibly the containers used for tests. Investigating this further.

…able Avoid potential sqlite version issues by referencing sqlite_master as per https://sqlite.org/schematab.html#alternative_names ("...alternative (1)[sqlite_master] works anywhere.")

gwaybio · 2022-06-09T16:27:07Z

🎉 thanks Dave!

d33bs added 12 commits June 2, 2022 12:58

add cyto_utils.sqlite; util for detecting conflicts in data

01f947f

add update_columns_to_nullable and tests

794ffdf

add update_columns_nan_to_null function; repurpose column loops

72225ab

work towards #198

test collect_columns

d93fd25

better test for collect_columns; some docs

b4c1b56

add contains_str_like_null; expand update for like nulls

aaeff28

add clean_like_nulls

698074d

rename to update_values_like_null_to_null for clarity

76f0400

Update __init__.py

14e72a3

adjusting logging

8673dfb

avoid testing conflicts; version bump for sqlite3 backup capabilities

7220f5b

update to optional string types

6d4e5cf

d33bs added the enhancement New feature or request label Jun 6, 2022

d33bs requested a review from gwaybio June 6, 2022 20:56

gwaybio requested changes Jun 7, 2022

View reviewed changes

gwaybio self-requested a review June 8, 2022 16:50

gwaybio approved these changes Jun 8, 2022

View reviewed changes

This was referenced Jun 8, 2022

Build SQLite conversion tool #205

Closed

Next pycytominer release #207

Closed

various changes for #203

4cde8f1

- rename contains_conflicting_aff_strg_class to contains_conflicting_aff_storage_class - f-string instead of a string concatenation for consistency and readability Co-Authored-By: Gregory Way <[email protected]>

run tests with PR to develop branch

3a53ef9

use sqlite_master instead of sqlite_schema for update_columns_to_null…

6e9eab4

…able Avoid potential sqlite version issues by referencing sqlite_master as per https://sqlite.org/schematab.html#alternative_names ("...alternative (1)[sqlite_master] works anywhere.")

d33bs merged commit 3a69511 into cytomining:develop Jun 9, 2022

gwaybio mentioned this pull request Jun 9, 2022

Add Black Requirement and Workflow Check #208

Merged

13 tasks

gwaybio mentioned this pull request Jun 17, 2022

Scheduling a SQLite deprecation in pycytominer #202

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQLite Utils: clean_like_nulls and More! #203

SQLite Utils: clean_like_nulls and More! #203

d33bs commented Jun 6, 2022 •

edited

Loading

gwaybio left a comment

gwaybio Jun 7, 2022

d33bs Jun 7, 2022

gwaybio Jun 7, 2022

d33bs Jun 7, 2022

gwaybio Jun 7, 2022

gwaybio Jun 7, 2022

d33bs Jun 7, 2022

gwaybio Jun 7, 2022

d33bs Jun 7, 2022

gwaybio Jun 8, 2022

gwaybio Jun 7, 2022

d33bs Jun 7, 2022

gwaybio Jun 7, 2022

d33bs Jun 7, 2022

gwaybio Jun 7, 2022

d33bs Jun 7, 2022

gwaybio Jun 7, 2022

d33bs Jun 7, 2022

gwaybio Jun 8, 2022

gwaybio Jun 7, 2022

gwaybio Jun 7, 2022

d33bs Jun 7, 2022

d33bs commented Jun 8, 2022

gwaybio left a comment

gwaybio Jun 8, 2022

gwaybio Jun 8, 2022

gwaybio Jun 8, 2022

d33bs Jun 8, 2022

d33bs commented Jun 8, 2022

gwaybio commented Jun 9, 2022

d33bs commented Jun 9, 2022

gwaybio commented Jun 9, 2022

	logger.info("Updating column values with str 'nan' to NULL values.")
	logger.info(f"Updating column values with str {LIKE_NULLS} to NULL values.")

	sql_stmt += " WHERE name = :col_name;"
	sql_stmt = f"{sql_stmt} WHERE name = :col_name;"

SQLite Utils: clean_like_nulls and More! #203

SQLite Utils: clean_like_nulls and More! #203

Conversation

d33bs commented Jun 6, 2022 • edited Loading

What is the nature of your change?

Checklist

gwaybio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d33bs commented Jun 8, 2022

gwaybio left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d33bs commented Jun 8, 2022

gwaybio commented Jun 9, 2022

d33bs commented Jun 9, 2022

gwaybio commented Jun 9, 2022

d33bs commented Jun 6, 2022 •

edited

Loading