Releases: MobileTeleSystems/onetl
0.7.2 (2023-05-24)
Dependencies
-
Limited
typing-extensions
version.typing-extensions==4.6.0
release contains some breaking changes causing errors like:Traceback (most recent call last): File "/Users/project/lib/python3.9/typing.py", line 852, in __subclasscheck__ return issubclass(cls, self.__origin__) TypeError: issubclass() arg 1 must be a class
typing-extensions==4.6.1
was causing another error:Traceback (most recent call last): File "/home/maxim/Repo/typing_extensions/1.py", line 33, in <module> isinstance(file, ContainsException) File "/home/maxim/Repo/typing_extensions/src/typing_extensions.py", line 599, in __instancecheck__ if super().__instancecheck__(instance): File "/home/maxim/.pyenv/versions/3.7.8/lib/python3.7/abc.py", line 139, in __instancecheck__ return _abc_instancecheck(cls, instance) File "/home/maxim/Repo/typing_extensions/src/typing_extensions.py", line 583, in __subclasscheck__ return super().__subclasscheck__(other) File "/home/maxim/.pyenv/versions/3.7.8/lib/python3.7/abc.py", line 143, in __subclasscheck__ return _abc_subclasscheck(cls, subclass) File "/home/maxim/Repo/typing_extensions/src/typing_extensions.py", line 661, in _proto_hook and other._is_protocol AttributeError: type object 'PathWithFailure' has no attribute '_is_protocol'
We updated requirements with
typing-extensions<4.6
until fixing compatibility issues.
Full Changelog: 0.7.1...0.7.2
0.7.1 (2023-05-23)
Bug Fixes
-
Fixed
setup_logging
function.In onETL==0.7.0 calling
onetl.log.setup_logging()
broke the logging:Traceback (most recent call last): File "/opt/anaconda/envs/py39/lib/python3.9/logging/__init__.py", line 434, in format return self._format(record) File "/opt/anaconda/envs/py39/lib/python3.9/logging/__init__.py", line 430, in _format return self._fmt % record.dict KeyError: 'levelname:8s'
-
Fixed installation examples.
In onETL==0.7.0 there are examples of installing onETL with extras:
pip install onetl[files, kerberos, spark]
But pip fails to install such package:
ERROR: Invalid requirement: 'onet[files,'
This is because of spaces in extras clause. Fixed:
pip install onetl[files,kerberos,spark]
Full Changelog: 0.7.0...0.7.1
0.7.0 (2023-05-15)
🎉 onETL is now open source 🎉
That was long road, but we finally did it!
Breaking Changes
-
Changed installation method.
TL;DR What should I change to restore previous behavior
Simple way:
onETL < 0.7.0 onETL >= 0.7.0 pip install onetl pip install onetl[files,kerberos] Right way - enumerate connectors should be installed:
pip install onetl[hdfs,ftp,kerberos] # except DB connections
Details
In onetl<0.7 the package installation looks like:
pip install onetl
But this includes all dependencies for all connectors, even if user does not use them. This caused some issues, for example user had to install Kerberos libraries to be able to install onETL, even if user uses only S3 (without Kerberos support).
Since 0.7.0 installation process was changed:
pip install onetl # minimal installation, only onETL core # there is no extras for DB connections because they are using Java packages which are installed in runtime pip install onetl[ftp,ftps,hdfs,sftp,s3,webdav] # install dependencies for specified file connections pip install onetl[files] # install dependencies for all file connections pip install onetl[kerberos] # Kerberos auth support pip install onetl[spark] # install PySpark to use DB connections pip install onetl[spark,kerberos,files] # all file connections + Kerberos + PySpark pip install onetl[all] # alias for previous case
There are corresponding documentation items for each extras.
Also onETL checks that some requirements are missing, and raises exception with recommendation how to install them:
Cannot import module "pyspark". Since onETL v0.7.0 you should install package as follows: pip install onetl[spark] or inject PySpark to sys.path in some other way BEFORE creating MongoDB instance.
Cannot import module "ftputil". Since onETL v0.7.0 you should install package as follows: pip install onetl[ftp] or pip install onetl[files]
-
Added new
cluster
argument toHive
andHDFS
connections.Hive
qualified name (used in HWM) contains cluster name. But in onETL<0.7.0 cluster name had hard coded valuernd-dwh
which was not OK for some users.HDFS
connection qualified name contains host (active namenode of Hadoop cluster), but its value can change over time, leading to creating of new HWM.Since onETL 0.7.0 both
Hive
andHDFS
connections havecluster
attribute which can be set to a specific cluster name. ForHive
it is mandatory, forHDFS
it can be omitted (using host as a fallback).But passing cluster name every time could lead to errors.
Now
Hive
andHDFS
have nested class namedslots
with methods:normalize_cluster_name
get_known_clusters
get_current_cluster
normalize_namenode_host
(onlyHDFS
)get_cluster_namenodes
(onlyHDFS
)get_webhdfs_port
(onlyHDFS
)is_namenode_active
(onlyHDFS
)
And new method
HDFS.get_current
/Hive.get_current
.Developers can implement hooks validating user input or substituting values for automatic cluster detection. This should improve user experience while using these connectors.
See slots documentation.
-
Update JDBC connection drivers.
-
Greenplum
2.1.3
→2.1.4
. -
MSSQL
10.2.1.jre8
→12.2.0.jre8
. Minimal supported version of MSSQL is now 2014 instead 2021. -
MySQL
8.0.30
→8.0.33
: -
- Package was renamed
mysql:mysql-connector-java
→com.mysql:mysql-connector-j
.
- Package was renamed
-
- Driver class was renamed
com.mysql.jdbc.Driver
→com.mysql.cj.jdbc.Driver
.
- Driver class was renamed
-
Oracle
21.6.0.0.1
→23.2.0.0
. -
Postgres
42.4.0
→42.6.0
. -
Teradata
17.20.00.08
→17.20.00.15
: -
- Package was renamed
com.teradata.jdbc:terajdbc4
→com.teradata.jdbc:terajdbc
.
- Package was renamed
-
- Teradata driver is now published to Maven.
See #31.
-
Features
-
Added MongoDB connection.
Using official MongoDB connector for Spark v10. Only Spark 3.2+ is supported.
There are some differences between MongoDB and other database sources:
- Instead of
mongodb.sql
method there ismongodb.pipeline
. - No methods
mongodb.fetch
andmongodb.execute
. DBReader.hint
andDBReader.where
have different types than in SQL databases:
where = { "col1": { "$eq": 10, }, } hint = { "col1": 1, }
- Because MongoDB does not have schemas of collections, but Spark cannot create dataframe with dynamic schema, new option
DBReader.df_schema
was introduced. It is mandatory for MongoDB, but optional for other sources. - Currently DBReader cannot be used with MongoDB and hwm expression, e.g.
hwm_column=("mycolumn", {"$cast": {"col1": "date"}})
Because there are no tables in MongoDB, some options were renamed in core classes:
DBReader(table=...)
→DBReader(source=...)
DBWriter(table=...)
→DBWriter(target=...)
Old names can be used too, they are not deprecated (#30).
- Instead of
-
Added option for disabling some plugins during import.
Previously if some plugin were failing during the import, the only way to import onETL would be to disable all plugins using environment variable.
Now there are several variables with different behavior:
ONETL_PLUGINS_ENABLED=false
- disable all plugins autoimport. Previously it was namedONETL_ENABLE_PLUGINS
.ONETL_PLUGINS_BLACKLIST=plugin-name,another-plugin
- set list of plugins which should NOT be imported automatically.ONETL_PLUGINS_WHITELIST=plugin-name,another-plugin
- set list of plugins which should ONLY be imported automatically.
Also we improved exception message with recommendation how to disable a failing plugin:
Error while importing plugin 'mtspark' from package 'mtspark' v4.0.0. Statement: import mtspark.onetl Check if plugin is compatible with current onETL version 0.7.0. You can disable loading this plugin by setting environment variable: ONETL_PLUGINS_BLACKLIST='mtspark,failing-plugin' You can also define a whitelist of packages which can be loaded by onETL: ONETL_PLUGINS_WHITELIST='not-failing-plugin1,not-failing-plugin2' Please take into account that plugin name may differ from package or module name. See package metadata for more details
Improvements
-
Added compatibility with Python 3.11 and PySpark 3.4.0.
File connections were OK, but
jdbc.fetch
andjdbc.execute
were failing. Fixed in #28. -
Added check for missing Java packages.
Previously if DB connection tried to use some Java class which were not loaded into Spark version, it raised an exception with long Java stacktrace. Most users failed to interpret this trace.
Now onETL shows the following error message:
|Spark| Cannot import Java class 'com.mongodb.spark.sql.connector.MongoTableProvider'. It looks like you've created Spark session without this option: SparkSession.builder.config("spark.jars.packages", MongoDB.package_spark_3_2) Please call `spark.stop()`, restart the interpreter, and then create new SparkSession with proper options.
-
Documentation improvements.
-
Changed documentation site theme - using furo instead of default ReadTheDocs.
New theme supports wide screens and dark mode. See #10.
-
Now each connection class have compatibility table for Spark + Java + Python.
-
Added global compatibility table for Spark + Java + Python + Scala.
-
Bug Fixes
-
Fixed several SFTP issues.
-
If SSH config file
~/.ssh/config
contains some options not recognized by Paramiko (unknown syntax, unknown option name), previous versions were raising exception until fixing or removing this file. Since 0.7.0 exception is replaced with warning. -
If user passed
host_key_check=False
but server changed SSH keys, previous versions raised exception until new key is accepted. Since 0.7.0 exception is replaced with warning if option value isFalse
.Fixed in #19.
-
-
Fixed several S3 issues.
There was a bug in S3 connection which prevented handling files in the root of a bucket - they were invisible for the connector. Fixed in #29.
Full Changelog: 0.6.4...0.7.0