- Refactor ProxyStorage-Environment overall
- Make write operations respect current model schema
- Improve logging.
- Come up with new pattern for storage uri + readdf + write+df abstraction, probably a mixin could do: since most SQL databases will share the same mixin
- Fix how uri is generated
- Make it so we can store newly generated data in different storages
- Investigate whether ModelBase metaclass' function columns, calculate_data, exists,
ensure_exists...etc should be out of the metaclass - Allow for a model df to be created from a dictionary
- Improve class Meta implementation.
- Make it so we can store new generated data in different FORMATS
- Create new constructor to build a model from a dictionary
- Raise exception when no Meta is present.
- Allow for inheritance where fields and Meta options are gotten from bases.
- Improve Model.save api: allow always for table_name override.
- Improve Model.save api: allow always for environment (as string) override.
- Add mysql support
- Add column_name option to Columns
- Add data types enforcement / casting from the Columns
- Add more fields types
- Change everything to
columns
(We now use the words fields/columns in several parts of the code) - Add more options to Columns depending on the column Type, like 'unique', 'column_name'
- Make read/write operations to respect Fields options, like 'column_name'
- Create factory for models (like factory-boy + faker)
- Make factory be able to create dataframe as well
- Give the ability to factory to use multiprocessing to create larges amount of data quick
- Change _from_dict constructor into something more generic such as _from_data
- Ability to create dummy data from a given Model, (maybe implement with faker boy)
- [Bug] When you create a df from_dict, if you access to model.df twice, the second time raises and error (not a bug, its expected behaviour)
- Make Model.from_dict constructor raise exception when None is passed
- Add auto_id_add
- Perhaps recalculate = 'Always', 'Once', 'If not storage data' where 'if not storage data' is default could be interesting for entities flow
- From storages, change 'environment init param' into 'environment_name' for better clarity
- Model: 'missing_columns' could accept columns like this: df.columns: ['col1', 'col2', 'col3'], models: ['col1', 'col2'] because there will be times when if the source data has too many columns, and we just don't want to write them.
- Remove options in Columns such as 'unique', I tried in a ETL and did not like it at all working with it.
- Make columns importable from datasaurus.core.models
- Unit test model inheritance.
- Come up with a pattern for 1:n and n:n model transformations.
- Cannot write from one mixin to another (create df from one storage and save it to another if it's of different type)
- Add validations.
- Change string SQL queries to something that build SQL queries safely.
- Add support for ndjson format
- Investigate and choose the default mysql/mariadb driver https://docs.sqlalchemy.org/en/20/dialects/mysql.html (mysqlclient can be used) to read but not to write.
- Add support to modify read/write options, like Models.with_options(environment=whatever).save() (That'd modify the read from that's defined in Meta)
- Add support to read/write compressed files like gz/zip ..etc
- Create custom exceptions, we are currently using Exception and ValueError in many places.
- Add CI pipeline with unittests and linting.
- Investigate a pattern for ETL/Pipeline creation
- Investigate how are we going to manage dependencies between models
- Add more debugging logs
- Add header to debug logs, ex: datasaurus - DEBUG - [ModelFactory] Execution strategy will be PythonMultiprocessing(processes=24) datasaurus - DEBUG - [ModelIO] Trying to read Model from storage {storageinfo} datasaurus - DEBUG - [StorageIO] Executing SQL {...} to see if table exists
- Refactor TransformationMetaOptions with ModelMetaOptions, anything we can elegantly abstract?
- Add unit tests for enforce_dtype
- Unit test dataframe creation (1)
- Unit tests for factory (once the three above are done)
- Support Azure blob storage read/write
- Support S3 storage read/write
- Create utility that can give you Model code from inferred data like django inspectdb
- Unit tests features like chispa
- Automatic data lineage from instrospection from model inheritance, simple model calculate_data and transformation pattern.
- Support delta tables
- Fully support streaming
- Validations and transformations with 2 different patterns. 1. We support inline basic validations/transformation like: even_field = forms.IntegerField(validators=[validate_even]) 2. Like https://django-filter.readthedocs.io/en/stable/guide/usage.html where rules are outside the model, in a different class and linked in the Model's Meta.