Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

353 new member field breaks create joined timeseries on existing datasets #355

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ python -m teehr.utils.install_spark_jars
```
Use Docker
```bash
$ docker build -t teehr:v0.4.5 .
$ docker run -it --rm --volume $HOME:$HOME -p 8888:8888 teehr:v0.4.5 jupyter lab --ip 0.0.0.0 $HOME
$ docker build -t teehr:v0.4.6 .
$ docker run -it --rm --volume $HOME:$HOME -p 8888:8888 teehr:v0.4.6 jupyter lab --ip 0.0.0.0 $HOME
```

## Examples
Expand Down
20 changes: 20 additions & 0 deletions docs/sphinx/changelog/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,26 @@ Release Notes
=============


0.4.6 - 2024-12-17
--------------------

Added
^^^^^
* Adds `add_missing_columns` to the `_validate` method in the `BaseTable` class
to allow for adding missing columns to the schema.
- When upgrading from 0.4.4 or earlier, you may need to run the following to add
the missing columns to the secondary_timeseries if you have existing datasets:
```
sdf = ev.secondary_timeseries.to_sdf()
validated_sdf = ev.secondary_timeseries._validate(sdf, add_missing_columns=True)
ev.secondary_timeseries._write_spark_df(validated_sdf)
``

Changed
^^^^^^^
* None


0.4.5 - 2024-12-09
--------------------

Expand Down
4 changes: 2 additions & 2 deletions docs/sphinx/getting_started/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ Or, if you do not want to install TEEHR in your own virtual environment, you can

.. code-block:: bash

docker build -t teehr:v0.4.5 .
docker run -it --rm --volume $HOME:$HOME -p 8888:8888 teehr:v0.4.5 jupyter lab --ip 0.0.0.0 $HOME
docker build -t teehr:v0.4.6 .
docker run -it --rm --volume $HOME:$HOME -p 8888:8888 teehr:v0.4.6 jupyter lab --ip 0.0.0.0 $HOME

Project Objectives
------------------
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "teehr"
version = "0.4.5"
version = "0.4.6"
description = "Tools for Exploratory Evaluation in Hydrologic Research"
authors = [
"RTI International",
Expand Down
2 changes: 1 addition & 1 deletion src/teehr/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = "0.4.5"
__version__ = "0.4.6"

from teehr.evaluation.evaluation import Evaluation # noqa
from teehr.models.metrics.metric_models import Metrics # noqa
Expand Down
22 changes: 20 additions & 2 deletions src/teehr/evaluation/tables/base_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from teehr.utils.utils import to_path_or_s3path, path_to_spark
from teehr.models.filters import FilterBaseModel
import logging
from pyspark.sql.functions import lit, col

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -145,7 +146,12 @@ def _get_schema(self, type: str = "pyspark"):

return self.schema_func()

def _validate(self, df: ps.DataFrame, strict: bool = True) -> ps.DataFrame:
def _validate(
self,
df: ps.DataFrame,
strict: bool = True,
add_missing_columns: bool = False
) -> ps.DataFrame:
"""Validate a DataFrame against the table schema.

Parameters
Expand All @@ -156,13 +162,25 @@ def _validate(self, df: ps.DataFrame, strict: bool = True) -> ps.DataFrame:
If True, any extra columns will be dropped before validation.
If False, will be validated as-is.
The default is True.

Returns
-------
validated_df : ps.DataFrame
The validated DataFrame.
"""
schema = self._get_schema()

logger.info(f"Validating DataFrame with {schema.columns}.")

schema_cols = schema.columns.keys()

# Add missing columns
if add_missing_columns:
for col_name in schema_cols:
if col_name not in df.columns:
df = df.withColumn(col_name, lit(None))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to make sure I understand, here we're adding an empty column(s) (column names that exist in the schema but not dataframe), then in line 186 it coerces the empty column(s) to the correct data type as defined by the schema?


if strict:
schema_cols = schema.columns.keys()
df = df.select(*schema_cols)

validated_df = schema.validate(df)
Expand Down
2 changes: 1 addition & 1 deletion version.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.4.5
0.4.6