Pandas on AWS
Please, pin the version you are using on your environment.
AWS Data Wrangler is completing 1 year, and the team is working to collect feedbacks and features requests to put in our 1.0.0 version. By now we have 3 major changes listed:
- API redesign
- Nested data types support
- Deprecation of PySpark support
- PySpark support takes considerable part of the development time and it has not been reflected in user adoption. Only 2 of our 66 issues on GitHub are related to Spark.
- In addition, the integration between PySpark and PyArrow/Pandas remains in experimental stage and we have been experiencing tough times to keep it stable.
FROM | TO | Features |
---|---|---|
PySpark DataFrame | Amazon Redshift | Blazing fast using parallel parquet on S3 behind the scenesAppend/Overwrite/Upsert modes |
PySpark DataFrame | Glue Catalog | Register Parquet or CSV DataFrame on Glue Catalog |
Nested PySpark DataFrame |
Flat PySpark DataFrames |
Flatten structs and break up arrays in child tables |
FROM | TO | Features |
---|---|---|
Pandas DataFrame | Amazon S3 | Parquet, CSV, Partitions, Parallelism, Overwrite/Append/Partitions-Upsert modes, KMS Encryption, Glue Metadata (Athena, Spectrum, Spark, Hive, Presto) |
Amazon S3 | Pandas DataFrame | Parquet (Pushdown filters), CSV, Fixed-width formatted, Partitions, Parallelism, KMS Encryption, Multiple files |
Amazon Athena | Pandas DataFrame | Workgroups, S3 output path, Encryption, and two different engines: - ctas_approach=False -> Batching and restrict memory environments - ctas_approach=True -> Blazing fast, parallelism and enhanced data types |
Pandas DataFrame | Amazon Redshift | Blazing fast using parallel parquet on S3 behind the scenes Append/Overwrite/Upsert modes |
Amazon Redshift | Pandas DataFrame | Blazing fast using parallel parquet on S3 behind the scenes |
Pandas DataFrame | Amazon Aurora | Supported engines: MySQL, PostgreSQL Blazing fast using parallel CSV on S3 behind the scenes Append/Overwrite modes |
Amazon Aurora | Pandas DataFrame | Supported engines: MySQL Blazing fast using parallel CSV on S3 behind the scenes |
CloudWatch Logs Insights | Pandas DataFrame | Query results |
Glue Catalog | Pandas DataFrame | List and get Tables details. Good fit with Jupyter Notebooks. |
Feature | Details |
---|---|
List S3 objects | e.g. wr.s3.list_objects("s3://...") |
Delete S3 objects | Parallel |
Delete listed S3 objects | Parallel |
Delete NOT listed S3 objects | Parallel |
Copy listed S3 objects | Parallel |
Get the size of S3 objects | Parallel |
Get CloudWatch Logs Insights query results | |
Load partitions on Athena/Glue table | Through "MSCK REPAIR TABLE" |
Create EMR cluster | "For humans" |
Terminate EMR cluster | "For humans" |
Get EMR cluster state | "For humans" |
Submit EMR step(s) | "For humans" |
Get EMR step state | "For humans" |
Query Athena to receive python primitives | Returns Iterable[Dict[str, Any] |
Load and Unzip SageMaker jobs outputs | |
Dump Amazon Redshift as Parquet files on S3 | |
Dump Amazon Aurora as CSV files on S3 | Only for MySQL engine |