Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COPY command relies on column order in DataFrame #340

Open
kyrozetera opened this issue Mar 31, 2017 · 2 comments · May be fixed by #343
Open

COPY command relies on column order in DataFrame #340

kyrozetera opened this issue Mar 31, 2017 · 2 comments · May be fixed by #343

Comments

@kyrozetera
Copy link

The write operation currently generates a COPY command like the following:

COPY "PUBLIC"."some_table" FROM 's3://some-bucket/tmp/manifest.json' CREDENTIALS 'aws_access_key_id=__;aws_secret_access_key=__' FORMAT AS CSV NULL AS '@NULL@' manifest 

This relies on the DataFrame to have the columns in the same order as the table if it already exists. However, the COPY command supports specifying column lists or JSONPath expressions to map columns (documentation). It would be nice to at least support the column list, potentially as an option on the write operation like:

df.write
  .format("com.databricks.spark.redshift")
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
  .option("dbtable", "my_table_copy")
  .option("tempdir", "s3n://path/for/temp/data")
  .option("include_column_list", "true")
  .mode("error")
  .save()

Looks like this should be fairly straightforward to add here.

@kyrozetera kyrozetera linked a pull request Apr 28, 2017 that will close this issue
@farshidz
Copy link

So this hasn't been merged yet. Is there any other solution (short of making sure column order in the DataFrame matches the table)? I'd like to use CSV GZIP instead of AVRO, but can't append to existing tables because the CSV has a different order than the table.

@kyrozetera
Copy link
Author

Nope, I've just been making sure to run a final .select() to specify the column order on any of my DataFrame operations before writing to Redshift. Unfortunately, the connector is no longer being maintained, so it doesn't look like this will be merged... 😞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants