GitHub - troxil/aws-data-wrangler: Pandas on AWS

Pandas on AWS

IMPORTANT NOTE: Version 1.0.0 coming soon with several breaking changes.

Please, pin the version you are using on your environment.

AWS Data Wrangler is completing 1 year, and the team is working to collect feedbacks and features requests to put in our 1.0.0 version. By now we have 3 major changes listed:

API redesign
Nested data types support
Deprecation of PySpark support
- PySpark support takes considerable part of the development time and it has not been reflected in user adoption. Only 2 of our 66 issues on GitHub are related to Spark.
- In addition, the integration between PySpark and PyArrow/Pandas remains in experimental stage and we have been experiencing tough times to keep it stable.

PyPI:

Conda:

Resources

Use Cases

PySpark

FROM	TO	Features
PySpark DataFrame	Amazon Redshift	Blazing fast using parallel parquet on S3 behind the scenesAppend/Overwrite/Upsert modes
PySpark DataFrame	Glue Catalog	Register Parquet or CSV DataFrame on Glue Catalog
Nested PySpark DataFrame	Flat PySpark DataFrames	Flatten structs and break up arrays in child tables

Pandas

FROM	TO	Features
Pandas DataFrame	Amazon S3	Parquet, CSV, Partitions, Parallelism, Overwrite/Append/Partitions-Upsert modes, KMS Encryption, Glue Metadata (Athena, Spectrum, Spark, Hive, Presto)
Amazon S3	Pandas DataFrame	Parquet (Pushdown filters), CSV, Fixed-width formatted, Partitions, Parallelism, KMS Encryption, Multiple files
Amazon Athena	Pandas DataFrame	Workgroups, S3 output path, Encryption, and two different engines: - ctas_approach=False -> Batching and restrict memory environments - ctas_approach=True -> Blazing fast, parallelism and enhanced data types
Pandas DataFrame	Amazon Redshift	Blazing fast using parallel parquet on S3 behind the scenes Append/Overwrite/Upsert modes
Amazon Redshift	Pandas DataFrame	Blazing fast using parallel parquet on S3 behind the scenes
Pandas DataFrame	Amazon Aurora	Supported engines: MySQL, PostgreSQL Blazing fast using parallel CSV on S3 behind the scenes Append/Overwrite modes
Amazon Aurora	Pandas DataFrame	Supported engines: MySQL Blazing fast using parallel CSV on S3 behind the scenes
CloudWatch Logs Insights	Pandas DataFrame	Query results
Glue Catalog	Pandas DataFrame	List and get Tables details. Good fit with Jupyter Notebooks.

General

Feature	Details
List S3 objects	e.g. wr.s3.list_objects("s3://...")
Delete S3 objects	Parallel
Delete listed S3 objects	Parallel
Delete NOT listed S3 objects	Parallel
Copy listed S3 objects	Parallel
Get the size of S3 objects	Parallel
Get CloudWatch Logs Insights query results
Load partitions on Athena/Glue table	Through "MSCK REPAIR TABLE"
Create EMR cluster	"For humans"
Terminate EMR cluster	"For humans"
Get EMR cluster state	"For humans"
Submit EMR step(s)	"For humans"
Get EMR step state	"For humans"
Query Athena to receive python primitives	Returns Iterable[Dict[str, Any]
Load and Unzip SageMaker jobs outputs
Dump Amazon Redshift as Parquet files on S3
Dump Amazon Aurora as CSV files on S3	Only for MySQL engine

Name		Name	Last commit message	Last commit date
Latest commit History 404 Commits
.github		.github
awswrangler		awswrangler
building		building
data_samples		data_samples
docs		docs
testing		testing
tutorials		tutorials
.flake8		.flake8
.gitignore		.gitignore
.style.yapf		.style.yapf
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE		NOTICE
README.md		README.md
THIRD_PARTY		THIRD_PARTY
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup-dev-env.sh		setup-dev-env.sh
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IMPORTANT NOTE: Version 1.0.0 coming soon with several breaking changes.

Resources

Use Cases

PySpark

Pandas

General

About

Releases

Packages

Languages

License

troxil/aws-data-wrangler

Folders and files

Latest commit

History

Repository files navigation

IMPORTANT NOTE: Version 1.0.0 coming soon with several breaking changes.

Resources

Use Cases

PySpark

Pandas

General

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages