Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

353 new member field breaks create joined timeseries on existing datasets #355

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions src/teehr/evaluation/tables/base_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from teehr.utils.utils import to_path_or_s3path, path_to_spark
from teehr.models.filters import FilterBaseModel
import logging
from pyspark.sql.functions import lit, col

logger = logging.getLogger(__name__)

Expand Down Expand Up @@ -145,7 +146,12 @@ def _get_schema(self, type: str = "pyspark"):

return self.schema_func()

def _validate(self, df: ps.DataFrame, strict: bool = True) -> ps.DataFrame:
def _validate(
self,
df: ps.DataFrame,
strict: bool = True,
add_missing_columns: bool = False
) -> ps.DataFrame:
"""Validate a DataFrame against the table schema.

Parameters
Expand All @@ -156,13 +162,25 @@ def _validate(self, df: ps.DataFrame, strict: bool = True) -> ps.DataFrame:
If True, any extra columns will be dropped before validation.
If False, will be validated as-is.
The default is True.

Returns
-------
validated_df : ps.DataFrame
The validated DataFrame.
"""
schema = self._get_schema()

logger.info(f"Validating DataFrame with {schema.columns}.")

schema_cols = schema.columns.keys()

# Add missing columns
if add_missing_columns:
for col_name in schema_cols:
if col_name not in df.columns:
df = df.withColumn(col_name, lit(None))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to make sure I understand, here we're adding an empty column(s) (column names that exist in the schema but not dataframe), then in line 186 it coerces the empty column(s) to the correct data type as defined by the schema?


if strict:
schema_cols = schema.columns.keys()
df = df.select(*schema_cols)

validated_df = schema.validate(df)
Expand Down