Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint while collecting metadata #31

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

carloshorn
Copy link

This PR closes #25 by writing a checkpoint database while looping through all filenames.
This allows to restart in case of an interruption.

@ghost
Copy link

ghost commented Sep 11, 2020

Congratulations 🎉. DeepCode analyzed your code in 2.69 seconds and we found no issues. Enjoy a moment of no bugs ☀️.

👉 View analysis in DeepCode’s Dashboard | Configure the bot

@carloshorn
Copy link
Author

Retry DeepCode

@coveralls
Copy link

Coverage Status

Coverage increased (+0.9%) to 57.443% when pulling 3228e76 on carloshorn:checkpoint_while_collecting into ceb31a8 on pytroll:master.

Copy link
Member

@sfinkens sfinkens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice database skills! Looks good, just two questions.

# get set of processed files in case of a restart
wanted = set(filenames)
done = set(
filename for filename, in session.query(FileMetadata.filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that , intended?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately yes, because the query returns tuples (of one item) and therefore I need to unpack them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have not yet seen a better way to get directly a list for a 1 column query.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see. What about for filename, _ in ... ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea:

import itertools as it
done = set(it.chain(*session.query(FileMetadata.filename)))

Which is another way to flatten the list of one element tuples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or another way without any import:

done = set(next(zip(*session.query(FileMetadata.filename))))

Comment on lines -148 to -154
with sqlite3.connect(dbfile) as con:
mda = pd.read_sql('select * from metadata', con)
mda = mda.set_index(['platform', 'level_1'])
mda.fillna(value=np.nan, inplace=True)
for col in mda.columns:
if 'time' in col:
mda[col] = mda[col].astype('datetime64[ns]')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the new solution convert time columns in the dataframe to datetime64 and replace NULL items with NaN? I don't remember why, but I think this was needed for the overlap computation to work correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nice use case for the system tests actually

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit surprised by these lines and did a test and it worked, meaning the types were conserved when writing and re-reading into the DataFrame.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic! Curious how the system tests will behave.

@carloshorn
Copy link
Author

Nice database skills!

Thanks, I learned these things while writing application databases for dashboards using dash. The sqlalchemy object relational model convinced me, because it does not require any line of SQL in the code, which makes the whole thing dialect independent.

@sfinkens
Copy link
Member

@carloshorn I ran the system tests on this PR and got the following error:

[INFO: 2020-09-28 09:35:36 : pygac_fdr] Collecting metadata
[DEBUG: 2020-09-28 09:35:36 : pygac_fdr] Collecting metadata from test_data/output/normal/AVHRR-GAC_FDR_1C_N06_19810330T194718Z_19810330T213158Z_R_O_20200101T000000Z_0100.nc
Traceback (most recent call last):
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1204, in _execute_context
    context = constructor(dialect, self, conn, *args)
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 839, in _init_compiled
    param.append(processors[key](compiled_params[key]))
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/dialects/sqlite/base.py", line 776, in process
    "SQLite DateTime type only accepts Python "
TypeError: SQLite DateTime type only accepts Python datetime and date objects as input.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/bin/pygac-fdr-mda-collect", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/cmsaf/nfshome/sfinkens/software/devel/pygac-fdr/pygac-fdr/bin/pygac-fdr-mda-collect", line 46, in <module>
    mda = collector.get_metadata(args.filenames)
  File "/cmsaf/nfshome/sfinkens/software/devel/pygac-fdr/pygac-fdr/pygac_fdr/metadata.py", line 154, in get_metadata
    df = pd.DataFrame(self._collect_metadata(filenames))
  File "/cmsaf/nfshome/sfinkens/software/devel/pygac-fdr/pygac-fdr/pygac_fdr/metadata.py", line 224, in _collect_metadata
    session.commit()
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 1042, in commit
    self.transaction.commit()
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 504, in commit
    self._prepare_impl()
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 483, in _prepare_impl
    self.session.flush()
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2523, in flush
    self._flush(objects)
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2664, in _flush
    transaction.rollback(_capture_exception=True)
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 69, in __exit__
    exc_value, with_traceback=exc_tb,
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/orm/session.py", line 2624, in _flush
    flush_context.execute()
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/orm/unitofwork.py", line 589, in execute
    uow,
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
    insert,
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/orm/persistence.py", line 1083, in _emit_insert_statements
    c = cached_connections[connection].execute(statement, multiparams)
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1011, in execute
    return meth(self, multiparams, params)
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1130, in _execute_clauseelement
    distilled_params,
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1207, in _execute_context
    e, util.text_type(statement), parameters, None, None
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1511, in _handle_dbapi_exception
    sqlalchemy_exception, with_traceback=exc_info[2], from_=e
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 182, in raise_
    raise exception
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1204, in _execute_context
    context = constructor(dialect, self, conn, *args)
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 839, in _init_compiled
    param.append(processors[key](compiled_params[key]))
  File "/cmsaf/cmsaf-ops3/sfinkens/conda/envs/pygac-fdr/lib/python3.7/site-packages/sqlalchemy/dialects/sqlite/base.py", line 776, in process
    "SQLite DateTime type only accepts Python "
sqlalchemy.exc.StatementError: (builtins.TypeError) SQLite DateTime type only accepts Python datetime and date objects as input.
[SQL: INSERT INTO metadata (platform, start_time, end_time, along_track, filename, orbit_number_start, orbit_number_end, equator_crossing_longitude_1, equator_crossing_time_1, equator_crossing_longitude_2, equator_crossing_time_2, midnight_line, overlap_free_start, overlap_free_end, global_quality_flag) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)]
[parameters: [{'along_track': 12560, 'overlap_free_end': nan, 'platform': 'NOAA-6', 'orbit_number_end': 9132, 'end_time': numpy.datetime64('1981-03-30T21:31:58.206 ... (382 characters truncated) ... 789, 'equator_crossing_time_2': numpy.datetime64('NaT'), 'orbit_number_start': 9131, 'start_time': numpy.datetime64('1981-03-30T19:47:18.706000000')}]]
ERROR

@carloshorn
Copy link
Author

Ohh, strange that I did not run into it before... Luckily, pandas timestamps can be converted using to_pydatetime.

@sfinkens
Copy link
Member

You can run them yourself as follows:

git merge master
<resolve conflict in setup.py>
cd pygac_fdr/test/
python fetch_data.py
pytest -vs test_end2end.py::EndToEndTestNormal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pygac-fdr-mda-collect should write checkpoints for restarts
3 participants