Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add parquet file output to the data command #5

Merged
merged 1 commit into from
Jul 30, 2024

Conversation

rinaldodev
Copy link
Contributor

@rinaldodev rinaldodev commented Jul 29, 2024

Screenshot from 2024-07-29 15-23-03

We can add some flags later on to:

  • Generate a single output file
  • Choose the output format (json, parquet, ...)
  • Choose compression options
  • Optimize things like row group size

This currently outputs one file per component and that might be harder to distribute.

Copy link
Contributor

@orpiske orpiske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks great.

I'm wondering if we should save the files in different directories so we have two datasets (1 in Alpaca Json and the other in parquet) or whether we should keep everything on the same one.

I'm leaning separate, as I am afraid (out of caution) that it could cause trouble setting up some tools.

@orpiske orpiske merged commit 2c3303a into megacamelus:main Jul 30, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants