Completed

Refactor ProxyStorage-Environment overall
Make write operations respect current model schema
Improve logging.
Come up with new pattern for storage uri + readdf + write+df abstraction, probably a mixin could do: since most SQL databases will share the same mixin
Fix how uri is generated
Make it so we can store newly generated data in different storages
Investigate whether ModelBase metaclass' function columns, calculate_data, exists, ~~ensure_exists~~ ...etc should be out of the metaclass
Allow for a model df to be created from a dictionary
Improve class Meta implementation.
Make it so we can store new generated data in different FORMATS
Create new constructor to build a model from a dictionary
Raise exception when no Meta is present.
Allow for inheritance where fields and Meta options are gotten from bases.
Improve Model.save api: allow always for table_name override.
Improve Model.save api: allow always for environment (as string) override.
Add mysql support
Add column_name option to Columns
Add data types enforcement / casting from the Columns
Add more fields types
Change everything to columns (We now use the words fields/columns in several parts of the code)
Add more options to Columns depending on the column Type, like 'unique', 'column_name'
Make read/write operations to respect Fields options, like 'column_name'
Create factory for models (like factory-boy + faker)
Make factory be able to create dataframe as well
Give the ability to factory to use multiprocessing to create larges amount of data quick
Change _from_dict constructor into something more generic such as _from_data
Ability to create dummy data from a given Model, (maybe implement with faker boy)
[Bug] When you create a df from_dict, if you access to model.df twice, the second time raises and error (not a bug, its expected behaviour)
Make Model.from_dict constructor raise exception when None is passed
Add auto_id_add
Perhaps recalculate = 'Always', 'Once', 'If not storage data' where 'if not storage data' is default could be interesting for entities flow
From storages, change 'environment init param' into 'environment_name' for better clarity
Model: 'missing_columns' could accept columns like this: df.columns: ['col1', 'col2', 'col3'], models: ['col1', 'col2'] because there will be times when if the source data has too many columns, and we just don't want to write them.
Remove options in Columns such as 'unique', I tried in a ETL and did not like it at all working with it.
Make columns importable from datasaurus.core.models
Unit test model inheritance.

Models

Come up with a pattern for 1:n and n:n model transformations.
Cannot write from one mixin to another (create df from one storage and save it to another if it's of different type)

Columns

Add validations.

Storages + IO

Change string SQL queries to something that build SQL queries safely.
Add support for ndjson format
Investigate and choose the default mysql/mariadb driver https://docs.sqlalchemy.org/en/20/dialects/mysql.html (mysqlclient can be used) to read but not to write.
Add support to modify read/write options, like Models.with_options(environment=whatever).save() (That'd modify the read from that's defined in Meta)
Add support to read/write compressed files like gz/zip ..etc

General Stuff

Create custom exceptions, we are currently using Exception and ValueError in many places.
Add CI pipeline with unittests and linting.
Investigate a pattern for ETL/Pipeline creation
Investigate how are we going to manage dependencies between models
Add more debugging logs
Add header to debug logs, ex: datasaurus - DEBUG - [ModelFactory] Execution strategy will be PythonMultiprocessing(processes=24) datasaurus - DEBUG - [ModelIO] Trying to read Model from storage {storageinfo} datasaurus - DEBUG - [StorageIO] Executing SQL {...} to see if table exists
Refactor TransformationMetaOptions with ModelMetaOptions, anything we can elegantly abstract?

Things where unittests are missing

Add unit tests for enforce_dtype
Unit test dataframe creation (1)
Unit tests for factory (once the three above are done)

Nice things to have

Support Azure blob storage read/write
Support S3 storage read/write
Create utility that can give you Model code from inferred data like django inspectdb
Unit tests features like chispa
Automatic data lineage from instrospection from model inheritance, simple model calculate_data and transformation pattern.

For the future

Support delta tables
Fully support streaming
Validations and transformations with 2 different patterns. 1. We support inline basic validations/transformation like: even_field = forms.IntegerField(validators=[validate_even]) 2. Like https://django-filter.readthedocs.io/en/stable/guide/usage.html where rules are outside the model, in a different class and linked in the Model's Meta.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

todo.MD

todo.MD

Completed

Models

Columns

Storages + IO

General Stuff

Things where unittests are missing

Nice things to have

For the future

Files

todo.MD

Latest commit

History

todo.MD

File metadata and controls

Completed

Models

Columns

Storages + IO

General Stuff

Things where unittests are missing

Nice things to have

For the future