Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release 1.4.6 #395

Merged
merged 90 commits into from
Jan 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
fc41ab9
[MLOP-634] Butterfree dev workflow, set triggers for branches staging…
moromimay Feb 8, 2021
4be4ffe
[BUG] Fix Staging GithubActions Pipeline (#283)
moromimay Feb 8, 2021
a3a601b
Apply only wheel. (#285)
moromimay Feb 8, 2021
4339608
[BUG] Change version on setup.py to PyPI (#286)
moromimay Feb 9, 2021
a82433c
Keep milliseconds when using 'from_ms' argument in timestamp feature …
hmeretti Feb 9, 2021
dcbf540
Change trigger for pipeline staging (#287)
moromimay Feb 10, 2021
a0a9335
Create a dev package. (#288)
moromimay Feb 10, 2021
7427898
[MLOP-633] Butterfree dev workflow, update documentation (#281)
moromimay Feb 10, 2021
245eaa5
[MLOP-632] Butterfree dev workflow, automate release description (#279)
AlvaroMarquesAndrade Feb 11, 2021
d6ecfa4
[MLOP-636] Create migration classes (#282)
AlvaroMarquesAndrade Feb 18, 2021
32e24d6
[MLOP-635] Rebase Incremental Job/Interval Run branch for test on sel…
moromimay Feb 19, 2021
8da89ed
Allow slide selection (#293)
roelschr Feb 22, 2021
0df07ae
Fix Slide Duration Typo (#295)
AlvaroMarquesAndrade Feb 26, 2021
aeb7999
[MLOP-637] Implement diff method (#292)
moromimay Mar 8, 2021
9afc39c
[MLOP-640] Create CLI with migrate command (#298)
roelschr Mar 15, 2021
bf204f2
[MLOP-645] Implement query method, cassandra (#291)
AlvaroMarquesAndrade Mar 15, 2021
b518dbc
[MLOP-671] Implement get_schema on Spark client (#301)
AlvaroMarquesAndrade Mar 16, 2021
5fe4c40
[MLOP-648] Implement query method, metastore (#294)
AlvaroMarquesAndrade Mar 16, 2021
e8fc0da
Fix Validation Step (#302)
AlvaroMarquesAndrade Mar 22, 2021
3d93a09
[MLOP-647] [MLOP-646] Apply migrations (#300)
roelschr Mar 23, 2021
0d30932
[BUG] Apply create_partitions to historical validate (#303)
moromimay Mar 30, 2021
d607297
[BUG] Fix key path for validate read (#304)
moromimay Mar 30, 2021
3dcd975
[FIX] Add Partition types for Metastore (#305)
AlvaroMarquesAndrade Apr 1, 2021
8077d86
[MLOP-639] Track logs in S3 (#306)
moromimay Apr 1, 2021
6d2a8f9
[BUG] Change logging config (#307)
moromimay Apr 6, 2021
d2c5d39
Change solution for tracking logs (#308)
moromimay Apr 8, 2021
43392f4
Read and write consistency level options (#309)
github-felipe-caputo Apr 13, 2021
0f31164
Fix kafka reader. (#310)
moromimay Apr 14, 2021
e6f67e9
Fix path validate. (#311)
moromimay Apr 14, 2021
baa594b
Add local dc property (#312)
github-felipe-caputo Apr 16, 2021
a74f098
Remove metastore migrate (#313)
moromimay Apr 20, 2021
378f3a5
Fix link in our docs. (#315)
moromimay Apr 20, 2021
3b18b5a
[BUG] Fix Cassandra Connect Session (#316)
moromimay Apr 23, 2021
c46f171
Fix migration query. (#318)
moromimay Apr 26, 2021
bb124f5
Fix migration query add type key. (#319)
moromimay Apr 28, 2021
1c97316
Fix db-config condition (#321)
moromimay May 5, 2021
bb7ed77
MLOP-642 Document migration in Butterfree (#320)
roelschr May 7, 2021
5a0a622
[MLOP-702] Debug mode for Automate Migration (#322)
moromimay May 10, 2021
b1371f1
[MLOP-727] Improve logging messages (#325)
GaBrandao Jun 2, 2021
acf7022
[MLOP-728] Improve logging messages (#324)
moromimay Jun 2, 2021
d0bf61a
Fix method to generate agg feature name. (#326)
moromimay Jun 4, 2021
1cf0dbd
[MLOP-691] Include step to add partition to SparkMetastore during wr…
moromimay Jun 10, 2021
9f42f53
Add the missing link for H3 geohash (#330)
jdvala Jun 16, 2021
78927e3
Update README.md (#331)
Jul 30, 2021
43bb3a3
Update Github Actions Workflow runner (#332)
Aug 22, 2022
2593839
Delete sphinx version. (#334)
moromimay Dec 20, 2022
35bcd30
Update files to staging (#336)
moromimay Dec 21, 2022
3a73ed8
Revert "Update files to staging (#336)" (#337)
moromimay Jan 2, 2023
6b78a50
Less strict requirements (#333)
lecardozo Aug 16, 2023
2a19009
feat: optional row count validation (#340)
ralphrass Aug 18, 2023
ca1a16d
fix: parameter, libs (#341)
ralphrass Aug 18, 2023
60c7ee4
pre-release 1.2.2.dev0 (#342)
ralphrass Aug 21, 2023
f35d665
Rebase staging (#343)
ralphrass Aug 21, 2023
97e44fa
Rebase staging from master (#345)
ralphrass Aug 21, 2023
9bcca0e
feat(MLOP-1985): optional params (#347)
ralphrass Nov 13, 2023
512a0fe
pre-release 1.2.3 (#349)
ralphrass Nov 13, 2023
688a5b3
feat(MLOP-2145): add feature set creation script (#351)
ralphrass Apr 11, 2024
da91b49
Rebase staging from master (#354)
ralphrass Apr 25, 2024
887fbb2
feat(mlop-2269): bump versions (#355)
ralphrass May 29, 2024
5af8a05
fix: sphinx version (#356)
ralphrass Jun 3, 2024
cbda73d
fix: publish and dev versions (#359)
ralphrass Jun 7, 2024
2a5a6e8
feat(MLOP-2236): add NTZ (#360)
ralphrass Jun 14, 2024
6363e03
fix: cassandra configs (#364)
ralphrass Jun 20, 2024
81c2c17
fix: Cassandra config keys (#366)
ralphrass Jun 28, 2024
b1949cd
fix: new type (#368)
ralphrass Jun 28, 2024
12d5e98
Delete .checklist.yaml (#371)
fernandrone Aug 16, 2024
35dd929
Add Delta support (#370)
ralphrass Aug 19, 2024
f6c5db6
Fix dup code (#373)
ralphrass Aug 21, 2024
11cc5d5
fix: performance improvements (#374)
ralphrass Sep 16, 2024
5f7028b
fix: version, format (#376)
ralphrass Oct 3, 2024
52d4911
fix: performance adjustments, migrate (#378)
ralphrass Oct 8, 2024
7f65873
chore: level (#382)
ralphrass Oct 10, 2024
ab551c0
feat(mlop-2456): add protection to host setting on cassandra_client (…
albjoaov Oct 11, 2024
a11a699
fix: rollback repartition (#386)
ralphrass Oct 14, 2024
b802f69
fix: move incremental filter (#388)
ralphrass Oct 18, 2024
51c4aed
Revert "fix: move incremental filter (#388)"
albjoaov Dec 4, 2024
5825634
Revert "fix: rollback repartition (#386)"
albjoaov Dec 4, 2024
dc1647b
Revert "chore: level (#382)"
albjoaov Dec 4, 2024
a6a6615
Revert "fix: performance adjustments, migrate (#378)"
albjoaov Dec 4, 2024
40f7766
Revert "fix: performance improvements (#374)"
albjoaov Dec 4, 2024
b7c7d48
fix(MLOP-2519): avoid configuring logger at lib level (#393)
lecardozo Jan 6, 2025
4e4293d
Merge branch 'staging' into stable-release
michellyrds Jan 6, 2025
5e6f416
fix: Rollback to latest stable release (#391)
michellyrds Jan 6, 2025
b84ac5f
pre-release 1.4.6
michellyrds Jan 6, 2025
9d0cd17
pre-release 1.4.6 (#394)
michellyrds Jan 6, 2025
8a3f703
release 1.4.6
michellyrds Jan 6, 2025
8ec18f0
Merge branch 'master' of github.com:quintoandar/butterfree into relea…
michellyrds Jan 6, 2025
6b1b6e7
chore: remove unnecessary files
michellyrds Jan 6, 2025
38509be
docs: make update-docs
michellyrds Jan 6, 2025
9b2b65d
fix: merge conflict with staging
michellyrds Jan 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 0 additions & 30 deletions .checklist.yaml

This file was deleted.

17 changes: 0 additions & 17 deletions .github/workflows/skip_lint.yml

This file was deleted.

6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,11 @@ All notable changes to this project will be documented in this file.

Preferably use **Added**, **Changed**, **Removed** and **Fixed** topics in each release or unreleased log for a better organization.

## [Unreleased]
## [1.4.6](https://github.com/quintoandar/butterfree/releases/tag/1.4.6)

### Fixed
* [MLOP-2519] avoid configuring logger at lib level ([#393](https://github.com/quintoandar/butterfree/pull/393))
* Rollback to latest stable release ([#391](https://github.com/quintoandar/butterfree/pull/391))

## [1.4.5](https://github.com/quintoandar/butterfree/releases/tag/1.4.5)
* Rollback repartitions ([#386](https://github.com/quintoandar/butterfree/pull/386))
Expand Down
18 changes: 4 additions & 14 deletions butterfree/_cli/migrate.py
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
import datetime
import importlib
import inspect
import logging
import os
import pkgutil
import sys
from typing import Set, Type
from typing import Set

import boto3
import setuptools
import typer
from botocore.exceptions import ClientError

from butterfree.configs import environment
from butterfree.configs.logger import __logger
from butterfree.migrations.database_migration import ALLOWED_DATABASE
from butterfree.pipelines import FeatureSetPipeline

app = typer.Typer(
help="Apply the automatic migrations in a database.", no_args_is_help=True
)

logger = __logger("migrate", True)
logger = logging.getLogger(__name__)


def __find_modules(path: str) -> Set[str]:
Expand Down Expand Up @@ -90,18 +90,8 @@ def __fs_objects(path: str) -> Set[FeatureSetPipeline]:

instances.add(value)

def create_instance(cls: Type[FeatureSetPipeline]) -> FeatureSetPipeline:
sig = inspect.signature(cls.__init__)
parameters = sig.parameters

if "run_date" in parameters:
run_date = datetime.datetime.today().strftime("%Y-%m-%d")
return cls(run_date)

return cls()

logger.info("Creating instances...")
return set(create_instance(value) for value in instances) # type: ignore
return set(value() for value in instances) # type: ignore


PATH = typer.Argument(
Expand Down
6 changes: 6 additions & 0 deletions butterfree/clients/cassandra_client.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""CassandraClient entity."""

import logging
from ssl import CERT_REQUIRED, PROTOCOL_TLSv1
from typing import Dict, List, Optional, Union

Expand All @@ -23,6 +24,11 @@
EMPTY_STRING_HOST_ERROR = "The value of Cassandra host is empty. Please fill correctly with your endpoints" # noqa: E501
GENERIC_INVALID_HOST_ERROR = "The Cassandra host must be a valid string, a string that represents a list or list of strings" # noqa: E501

logger = logging.getLogger(__name__)

EMPTY_STRING_HOST_ERROR = "The value of Cassandra host is empty. Please fill correctly with your endpoints" # noqa: E501
GENERIC_INVALID_HOST_ERROR = "The Cassandra host must be a valid string, a string that represents a list or list of strings" # noqa: E501


class CassandraColumn(TypedDict):
"""Type for cassandra columns.
Expand Down
14 changes: 4 additions & 10 deletions butterfree/extract/source.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
from typing import List, Optional

from pyspark.sql import DataFrame
from pyspark.storagelevel import StorageLevel

from butterfree.clients import SparkClient
from butterfree.extract.readers.reader import Reader
Expand Down Expand Up @@ -96,21 +95,16 @@ def construct(
DataFrame with the query result against all readers.

"""
# Step 1: Build temporary views for each reader
for reader in self.readers:
reader.build(client=client, start_date=start_date, end_date=end_date)
reader.build(
client=client, start_date=start_date, end_date=end_date
) # create temporary views for each reader

# Step 2: Execute SQL query on the combined readers
dataframe = client.sql(self.query)

# Step 3: Cache the dataframe if necessary, using memory and disk storage
if not dataframe.isStreaming and self.eager_evaluation:
# Persist to ensure the DataFrame is stored in mem and disk (if necessary)
dataframe.persist(StorageLevel.MEMORY_AND_DISK)
# Trigger the cache/persist operation by performing an action
dataframe.count()
dataframe.cache().count()

# Step 4: Run post-processing hooks on the dataframe
post_hook_df = self.run_post_hooks(dataframe)

return post_hook_df
5 changes: 3 additions & 2 deletions butterfree/load/writers/delta_writer.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
import logging

from delta.tables import DeltaTable
from pyspark.sql.dataframe import DataFrame

from butterfree.clients import SparkClient
from butterfree.configs.logger import __logger

logger = __logger("delta_writer", True)
logger = logging.getLogger(__name__)


class DeltaWriter:
Expand Down
20 changes: 2 additions & 18 deletions butterfree/migrations/database_migration/cassandra_migration.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,9 +78,6 @@ def _get_alter_table_add_query(self, columns: List[Diff], table_name: str) -> st
def _get_alter_column_type_query(self, column: Diff, table_name: str) -> str:
"""Creates CQL statement to alter columns' types.

In Cassandra 3.4.x to 3.11.x alter type is not allowed.
This method creates a temp column to comply.

Args:
columns: list of Diff objects with ALTER_TYPE kind.
table_name: table name.
Expand All @@ -89,23 +86,10 @@ def _get_alter_column_type_query(self, column: Diff, table_name: str) -> str:
Alter column type query.

"""
temp_column_name = f"{column.column}_temp"

add_temp_column_query = (
f"ALTER TABLE {table_name} ADD {temp_column_name} {column.value};"
)
copy_data_to_temp_query = (
f"UPDATE {table_name} SET {temp_column_name} = {column.column};"
)

drop_old_column_query = f"ALTER TABLE {table_name} DROP {column.column};"
rename_temp_column_query = (
f"ALTER TABLE {table_name} RENAME {temp_column_name} TO {column.column};"
)
parsed_columns = self._get_parsed_columns([column])

return (
f"{add_temp_column_query} {copy_data_to_temp_query} "
f"{drop_old_column_query} {rename_temp_column_query};"
f"ALTER TABLE {table_name} ALTER {parsed_columns.replace(' ', ' TYPE ')};"
)

@staticmethod
Expand Down
Original file line number Diff line number Diff line change
@@ -1,16 +1,16 @@
"""Migration entity."""

import logging
from abc import ABC, abstractmethod
from dataclasses import dataclass
from enum import Enum, auto
from typing import Any, Dict, List, Optional, Set

from butterfree.clients import AbstractClient
from butterfree.configs.logger import __logger
from butterfree.load.writers.writer import Writer
from butterfree.transform import FeatureSet

logger = __logger("database_migrate", True)
logger = logging.getLogger(__name__)


@dataclass
Expand Down
31 changes: 8 additions & 23 deletions butterfree/pipelines/feature_set_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,6 @@

from typing import List, Optional

from pyspark.storagelevel import StorageLevel

from butterfree.clients import SparkClient
from butterfree.dataframe_service import repartition_sort_df
from butterfree.extract import Source
Expand Down Expand Up @@ -211,48 +209,35 @@ def run(
soon. Use only if strictly necessary.

"""
# Step 1: Construct input dataframe from the source.
dataframe = self.source.construct(
client=self.spark_client,
start_date=self.feature_set.define_start_date(start_date),
end_date=end_date,
)

# Step 2: Repartition and sort if required, avoid if not necessary.
if partition_by:
order_by = order_by or partition_by
current_partitions = dataframe.rdd.getNumPartitions()
optimal_partitions = num_processors or current_partitions
if current_partitions != optimal_partitions:
dataframe = repartition_sort_df(
dataframe, partition_by, order_by, num_processors
)

# Step 3: Construct the feature set dataframe using defined transformations.
transformed_dataframe = self.feature_set.construct(
dataframe = repartition_sort_df(
dataframe, partition_by, order_by, num_processors
)

dataframe = self.feature_set.construct(
dataframe=dataframe,
client=self.spark_client,
start_date=start_date,
end_date=end_date,
num_processors=num_processors,
)

if transformed_dataframe.storageLevel != StorageLevel(
False, False, False, False, 1
):
dataframe.unpersist() # Clear the data from the cache (disk and memory)

# Step 4: Load the data into the configured sink.
self.sink.flush(
dataframe=transformed_dataframe,
dataframe=dataframe,
feature_set=self.feature_set,
spark_client=self.spark_client,
)

# Step 5: Validate the output if not streaming and data volume is reasonable.
if not transformed_dataframe.isStreaming:
if not dataframe.isStreaming:
self.sink.validate(
dataframe=transformed_dataframe,
dataframe=dataframe,
feature_set=self.feature_set,
spark_client=self.spark_client,
)
Expand Down
Loading
Loading