-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint while collecting metadata #31
base: main
Are you sure you want to change the base?
Conversation
Congratulations 🎉. DeepCode analyzed your code in 2.69 seconds and we found no issues. Enjoy a moment of no bugs ☀️. 👉 View analysis in DeepCode’s Dashboard | Configure the bot |
Retry DeepCode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice database skills! Looks good, just two questions.
# get set of processed files in case of a restart | ||
wanted = set(filenames) | ||
done = set( | ||
filename for filename, in session.query(FileMetadata.filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that ,
intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately yes, because the query returns tuples (of one item) and therefore I need to unpack them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have not yet seen a better way to get directly a list for a 1 column query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I see. What about for filename, _ in ...
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea:
import itertools as it
done = set(it.chain(*session.query(FileMetadata.filename)))
Which is another way to flatten the list of one element tuples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or another way without any import:
done = set(next(zip(*session.query(FileMetadata.filename))))
with sqlite3.connect(dbfile) as con: | ||
mda = pd.read_sql('select * from metadata', con) | ||
mda = mda.set_index(['platform', 'level_1']) | ||
mda.fillna(value=np.nan, inplace=True) | ||
for col in mda.columns: | ||
if 'time' in col: | ||
mda[col] = mda[col].astype('datetime64[ns]') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the new solution convert time columns in the dataframe to datetime64
and replace NULL items with NaN? I don't remember why, but I think this was needed for the overlap computation to work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a nice use case for the system tests actually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a bit surprised by these lines and did a test and it worked, meaning the types were conserved when writing and re-reading into the DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic! Curious how the system tests will behave.
Thanks, I learned these things while writing application databases for dashboards using |
@carloshorn I ran the system tests on this PR and got the following error:
|
Ohh, strange that I did not run into it before... Luckily, pandas timestamps can be converted using |
You can run them yourself as follows:
|
This PR closes #25 by writing a checkpoint database while looping through all filenames.
This allows to restart in case of an interruption.