Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLite Utils: clean_like_nulls and More! #203

Merged
merged 16 commits into from
Jun 9, 2022
Merged

SQLite Utils: clean_like_nulls and More! #203

merged 16 commits into from
Jun 9, 2022

Conversation

d33bs
Copy link
Member

@d33bs d33bs commented Jun 6, 2022

This change adds a new series of utilities under pycytominer/cyto_utils/sqlite.py for detecting the need for and making changes to SQLite databases used by this project. The work seeks to improve performance, especially memory resource challenges as originally cited in #195 and further investigated as part of #198. Minimal logging has been added to the utilities to assist those making use of the functions and also as work towards #5 .

Special notes:

  • Python >= 3.7 is required due to the use of the sqlite3.Connection.backup (a safety measure to avoid database corruption during changes within update_columns_to_nullable).
  • Functions within are provided to assist with someone's ability to investigate SQLite source databases. These are intended to be used optionally within other classes and methods or on their own.

Open questions:

  • Will moving to SQLite Null and related numpy.nan within subsequent Pandas dataframes on read present any challenges (or other opportunities) for using data elsewhere within this library? Would we need any additional changes as a result?
  • Would it be beneficial to add clean_like_nulls as a "pre-step" to functions like merge_single_cells?

Thank you for any feedback, suggestions, and thoughts you may have!

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • I have read the CONTRIBUTING.md guidelines.
  • My code follows the style guidelines of this project.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • New and existing unit tests pass locally with my changes.
  • I have added tests that prove my fix is effective or that my feature works.
  • I have deleted all non-relevant text in this pull request template.

@d33bs d33bs added the enhancement New feature or request label Jun 6, 2022
@d33bs d33bs requested a review from gwaybio June 6, 2022 20:56
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big contribution, thanks @d33bs ! I've made several inline comments and discussion items.

A couple general comments:

  • We typically use f-strings where possible in > python 3.7. I do think some of the SQL statement .formats are super clean, so you may not decide to update all instances. I made some direct suggestions.
  • I understood nearly all of this (with the exception of some of the advanced SQL statements! (Given your extensive testing, I don't think you need to get these reviewed by a SQL expert)), but one area where I think understanding could substantially increase is in adding some brief comments about what the format of some variables you're indexing into. I made some in-line comments to this regard.

Thanks again!

from sqlalchemy import create_engine
from sqlalchemy.engine.base import Engine

# pylint: disable=consider-using-f-string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete comments of this sort

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Comment on lines 133 to 142
sql_stmt = """
SELECT :table_name, name, type, [notnull]
FROM pragma_table_info(:table_name);
"""
else:
# otherwise we will focus on only the column name provided
sql_stmt = """
SELECT :table_name, name, type, [notnull]
FROM pragma_table_info(:table_name) WHERE name = :col_name;
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you can reduce redundancy in sql_stmt by defining the common text and using f-strings to append whats different within the if statement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, I'll make a change to this effect.

# append to column list the results
column_list += connection.execute(
sql_stmt,
{"table_name": str(table[0]), "col_name": str(column_name)},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is table zero indexed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, line 128 shows a list of tuples

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further reading, we can reference SQLAlchemy's Row object by key name here (in addition to other areas). I'll make a change to increase clarity here. Thank you for calling this forward.

return column_list


def contains_conflicting_aff_strg_class(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does strg mean string? If so, is this SQL convention? Otherwise, I think updating the function name to str might be less confusing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intended strg as a shortened version of storage (stated more fully, something like: contains_conflicting_column_affinity_type_and_data_storage_class), trying to reference to SQLite docs phrasing. Storage class is I feel roughly equivalent to "value actual type" (as opposed to "column preferred type"), but we may lose understandability if we use those words 🙂 . What do you think about expanding strg to the full word, making this: contains_conflicting_aff_storage_class?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! I think contains_conflicting_aff_storage_class is much better. Please update throughout

# if we have a table name provided, target only that table for the modifications
sql_stmt = (
"SELECT name, sql FROM sqlite_master "
"WHERE type = 'table' and UPPER(name) = UPPER(:table_name)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a formatted string? (table_name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where I could, I tried to use SQLAlchemy execute arguments as a best practice when it comes to avoiding SQL injection (as per Bandit: B608). This proved to be challenging using variables in the way we need to for this work (column or table names), so I ended up intermixing formatted strings as well. I'm uncertain of the exact risk/attack vector in this case.

Do you feel we should move towards formatted strings for uniformity?

sqlalchemy.engine.base.Engine
A SQLAlchemy engine for the changed database
"""
logger.info("Updating column values with str 'nan' to NULL values.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info("Updating column values with str 'nan' to NULL values.")
logger.info(f"Updating column values with str {LIKE_NULLS} to NULL values.")

True?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thank you. I'll add this change in as a lazy-formatted string logger.info("... %s ...", like_nulls) (as suggested by Pylint: W1201).

engine = create_engine(sql_path)

# statements for creating database with simple structure
create_stmts = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use black on this? It looks a bit wonky to me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing - thank you for calling this out. I'll change this around in the next commit. black, multi-line strings, and lists don't seem to get along well.

""",
]

insert_vals = [1, "sample", b"sample_blob", 0.5]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is b?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This notation is used for bytes literals. It turned out to be a simple way to insert a blob into the test database tables and retain the blob storage class (or type) without needing to create or load a file object or similar. Inserting a string without the b notation results in a python string being inserted as a text storage class value. While this isn't crucial for testing with the current set, I felt it might be helpful to have in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha!

connection.execute(
"""
INSERT INTO tbl_a (col_integer, col_text, col_blob, col_real)
VALUES ('nan', 'nan', 'example', 0.5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is worth testing values other than nan - others in LIKE_NULLS i mean. it might not be!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see, maybe you're doing it in line 205

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely worth checking - with contains_conflicting_aff_strg_class we're checking only for mismatched storage class for the values (vs what the column affinity / preference is). We then go on to check for the like_nulls strings with contains_str_like_null in another set of tests. Line 210 seeks to check each column containing various versions of the like_nulls.

- address comments and code suggestions from @gwaybio for #203
- f-string's instead of format
- remove pylint: disable=consider-using-f-string
- reference engine_from_str for SingleCells class
- reference Row keys with column names for SQLAlchemy executes instead of row indexes for clarity
- update docs for functions for clarity
- fetchone instead of fetchall for simpler implementation
- clearer logging message(s)
- like_nulls as function arg for contains_str_like_null
- LIKE_NULLS as tuple instead of list for immutable like_nulls default arg value with contains_str_like_null
- better formatting for test_sqlite.py SQL strings
- is False or True instead of == False or True (supplementary linting)

Co-Authored-By: Gregory Way <[email protected]>
@d33bs
Copy link
Member Author

d33bs commented Jun 8, 2022

Thank you for all the great feedback @gwaybio, really appreciate your input along with the questions! I've pushed some changes just now based on your suggestions. I added you as a co-author with these as you suggested specific code changes which landed in the commit as well.

@gwaybio gwaybio self-requested a review June 8, 2022 16:50
Copy link
Member

@gwaybio gwaybio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @d33bs - thanks for your good github commit hygiene! This made re-review very straightforward.

I'm not sure if you have merge privileges, but if so, please do once you and the the tests are happy. If not, ping me once it is so, and I'll merge

return column_list


def contains_conflicting_aff_strg_class(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! I think contains_conflicting_aff_storage_class is much better. Please update throughout

""",
]

insert_vals = [1, "sample", b"sample_blob", 0.5]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha!


if column_name is not None:
# otherwise we will focus on only the column name provided
sql_stmt += " WHERE name = :col_name;"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
sql_stmt += " WHERE name = :col_name;"
sql_stmt = f"{sql_stmt} WHERE name = :col_name;"

I'm not sure which is more pythonic - to keep f-string consistency or avoid overwriting a variable with itself.

Feel free to ignore this suggestion if you chose option 2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you - I like the f-string version better, feels more readable and consistent as you mention. I'll opt to add this one in.

This was referenced Jun 8, 2022
- rename contains_conflicting_aff_strg_class to contains_conflicting_aff_storage_class
- f-string instead of a string concatenation for consistency and readability

Co-Authored-By: Gregory Way <[email protected]>
@d33bs
Copy link
Member Author

d33bs commented Jun 8, 2022

Thank you again @gwaybio for the great feedback on this PR! I'm noticing that automated Github workflow checks may not run against the develop branch, and as a result, test results may not be as visible when merging work together there. What do you think about expanding the conditions in .github/workflows/python-app.yml to run on PR against develop (in addition to master)?

@gwaybio
Copy link
Member

gwaybio commented Jun 9, 2022

What do you think about expanding the conditions in .github/workflows/python-app.yml to run on PR against develop (in addition to master)?

Yes, absolutely - it'll be great to do for this and all future PRs. Can you add to this PR?

@d33bs
Copy link
Member Author

d33bs commented Jun 9, 2022

There appear to be some challenges with how SQLite schema tables are referenced in different version of Python and possibly the containers used for tests. Investigating this further.

…able

Avoid potential sqlite version issues by referencing sqlite_master as per https://sqlite.org/schematab.html#alternative_names ("...alternative (1)[sqlite_master] works anywhere.")
@d33bs d33bs merged commit 3a69511 into cytomining:develop Jun 9, 2022
@gwaybio
Copy link
Member

gwaybio commented Jun 9, 2022

🎉 thanks Dave!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants