-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQLite Utils: clean_like_nulls and More! #203
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big contribution, thanks @d33bs ! I've made several inline comments and discussion items.
A couple general comments:
- We typically use
f-strings
where possible in > python 3.7. I do think some of the SQL statement.formats
are super clean, so you may not decide to update all instances. I made some direct suggestions. - I understood nearly all of this (with the exception of some of the advanced SQL statements! (Given your extensive testing, I don't think you need to get these reviewed by a SQL expert)), but one area where I think understanding could substantially increase is in adding some brief comments about what the format of some variables you're indexing into. I made some in-line comments to this regard.
Thanks again!
pycytominer/cyto_utils/sqlite.py
Outdated
from sqlalchemy import create_engine | ||
from sqlalchemy.engine.base import Engine | ||
|
||
# pylint: disable=consider-using-f-string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please delete comments of this sort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
pycytominer/cyto_utils/sqlite.py
Outdated
sql_stmt = """ | ||
SELECT :table_name, name, type, [notnull] | ||
FROM pragma_table_info(:table_name); | ||
""" | ||
else: | ||
# otherwise we will focus on only the column name provided | ||
sql_stmt = """ | ||
SELECT :table_name, name, type, [notnull] | ||
FROM pragma_table_info(:table_name) WHERE name = :col_name; | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you can reduce redundancy in sql_stmt
by defining the common text and using f-strings to append whats different within the if statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, I'll make a change to this effect.
pycytominer/cyto_utils/sqlite.py
Outdated
# append to column list the results | ||
column_list += connection.execute( | ||
sql_stmt, | ||
{"table_name": str(table[0]), "col_name": str(column_name)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is table
zero indexed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see, line 128 shows a list of tuples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further reading, we can reference SQLAlchemy's Row object by key name here (in addition to other areas). I'll make a change to increase clarity here. Thank you for calling this forward.
pycytominer/cyto_utils/sqlite.py
Outdated
return column_list | ||
|
||
|
||
def contains_conflicting_aff_strg_class( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does strg
mean string? If so, is this SQL convention? Otherwise, I think updating the function name to str
might be less confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intended strg
as a shortened version of storage
(stated more fully, something like: contains_conflicting_column_affinity_type_and_data_storage_class
), trying to reference to SQLite docs phrasing. Storage class is I feel roughly equivalent to "value actual type" (as opposed to "column preferred type"), but we may lose understandability if we use those words 🙂 . What do you think about expanding strg
to the full word, making this: contains_conflicting_aff_storage_class
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! I think contains_conflicting_aff_storage_class
is much better. Please update throughout
pycytominer/cyto_utils/sqlite.py
Outdated
# if we have a table name provided, target only that table for the modifications | ||
sql_stmt = ( | ||
"SELECT name, sql FROM sqlite_master " | ||
"WHERE type = 'table' and UPPER(name) = UPPER(:table_name)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a formatted string? (table_name
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where I could, I tried to use SQLAlchemy execute arguments as a best practice when it comes to avoiding SQL injection (as per Bandit: B608). This proved to be challenging using variables in the way we need to for this work (column or table names), so I ended up intermixing formatted strings as well. I'm uncertain of the exact risk/attack vector in this case.
Do you feel we should move towards formatted strings for uniformity?
pycytominer/cyto_utils/sqlite.py
Outdated
sqlalchemy.engine.base.Engine | ||
A SQLAlchemy engine for the changed database | ||
""" | ||
logger.info("Updating column values with str 'nan' to NULL values.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info("Updating column values with str 'nan' to NULL values.") | |
logger.info(f"Updating column values with str {LIKE_NULLS} to NULL values.") |
True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thank you. I'll add this change in as a lazy-formatted string logger.info("... %s ...", like_nulls)
(as suggested by Pylint: W1201).
engine = create_engine(sql_path) | ||
|
||
# statements for creating database with simple structure | ||
create_stmts = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use black
on this? It looks a bit wonky to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing - thank you for calling this out. I'll change this around in the next commit. black
, multi-line strings, and lists don't seem to get along well.
""", | ||
] | ||
|
||
insert_vals = [1, "sample", b"sample_blob", 0.5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is b
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This notation is used for bytes literals. It turned out to be a simple way to insert a blob into the test database tables and retain the blob storage class (or type) without needing to create or load a file object or similar. Inserting a string without the b
notation results in a python string being inserted as a text storage class value. While this isn't crucial for testing with the current set, I felt it might be helpful to have in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha!
connection.execute( | ||
""" | ||
INSERT INTO tbl_a (col_integer, col_text, col_blob, col_real) | ||
VALUES ('nan', 'nan', 'example', 0.5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is worth testing values other than nan
- others in LIKE_NULLS
i mean. it might not be!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i see, maybe you're doing it in line 205
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely worth checking - with contains_conflicting_aff_strg_class
we're checking only for mismatched storage class for the values (vs what the column affinity / preference is). We then go on to check for the like_nulls strings with contains_str_like_null
in another set of tests. Line 210 seeks to check each column containing various versions of the like_nulls.
- address comments and code suggestions from @gwaybio for #203 - f-string's instead of format - remove pylint: disable=consider-using-f-string - reference engine_from_str for SingleCells class - reference Row keys with column names for SQLAlchemy executes instead of row indexes for clarity - update docs for functions for clarity - fetchone instead of fetchall for simpler implementation - clearer logging message(s) - like_nulls as function arg for contains_str_like_null - LIKE_NULLS as tuple instead of list for immutable like_nulls default arg value with contains_str_like_null - better formatting for test_sqlite.py SQL strings - is False or True instead of == False or True (supplementary linting) Co-Authored-By: Gregory Way <[email protected]>
Thank you for all the great feedback @gwaybio, really appreciate your input along with the questions! I've pushed some changes just now based on your suggestions. I added you as a co-author with these as you suggested specific code changes which landed in the commit as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @d33bs - thanks for your good github commit hygiene! This made re-review very straightforward.
I'm not sure if you have merge privileges, but if so, please do once you and the the tests are happy. If not, ping me once it is so, and I'll merge
pycytominer/cyto_utils/sqlite.py
Outdated
return column_list | ||
|
||
|
||
def contains_conflicting_aff_strg_class( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes! I think contains_conflicting_aff_storage_class
is much better. Please update throughout
""", | ||
] | ||
|
||
insert_vals = [1, "sample", b"sample_blob", 0.5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha!
pycytominer/cyto_utils/sqlite.py
Outdated
|
||
if column_name is not None: | ||
# otherwise we will focus on only the column name provided | ||
sql_stmt += " WHERE name = :col_name;" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sql_stmt += " WHERE name = :col_name;" | |
sql_stmt = f"{sql_stmt} WHERE name = :col_name;" |
I'm not sure which is more pythonic - to keep f-string consistency or avoid overwriting a variable with itself.
Feel free to ignore this suggestion if you chose option 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you - I like the f-string version better, feels more readable and consistent as you mention. I'll opt to add this one in.
- rename contains_conflicting_aff_strg_class to contains_conflicting_aff_storage_class - f-string instead of a string concatenation for consistency and readability Co-Authored-By: Gregory Way <[email protected]>
Thank you again @gwaybio for the great feedback on this PR! I'm noticing that automated Github workflow checks may not run against the |
Yes, absolutely - it'll be great to do for this and all future PRs. Can you add to this PR? |
There appear to be some challenges with how SQLite schema tables are referenced in different version of Python and possibly the containers used for tests. Investigating this further. |
…able Avoid potential sqlite version issues by referencing sqlite_master as per https://sqlite.org/schematab.html#alternative_names ("...alternative (1)[sqlite_master] works anywhere.")
🎉 thanks Dave! |
This change adds a new series of utilities under pycytominer/cyto_utils/sqlite.py for detecting the need for and making changes to SQLite databases used by this project. The work seeks to improve performance, especially memory resource challenges as originally cited in #195 and further investigated as part of #198. Minimal logging has been added to the utilities to assist those making use of the functions and also as work towards #5 .
Special notes:
Open questions:
Null
and relatednumpy.nan
within subsequent Pandas dataframes on read present any challenges (or other opportunities) for using data elsewhere within this library? Would we need any additional changes as a result?merge_single_cells
?Thank you for any feedback, suggestions, and thoughts you may have!
What is the nature of your change?
Checklist
Please ensure that all boxes are checked before indicating that a pull request is ready for review.