Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be able to specify the maximum number of unique values for an enum #91

Open
spullara opened this issue Jan 25, 2022 · 2 comments
Open

Comments

@spullara
Copy link

I am getting text search instead of an enum by default for a column that has 117 unique values (out of the 18k or so samples provided).

@isabella
Copy link
Contributor

hi @spullara. You should definitely be able to configure the max unique values so that your column with 117 unique values would be an enum column. Currently, the only way to do that is to pass a config file with the column name, type, and a list of all of the variants. There are two potential implementations that would achieve what you want:

  1. In the config file, allow passing a json object that includes the csv infer options:
#[derive(Clone)]
pub struct FromCsvOptions<'a> {
	pub column_types: Option<BTreeMap<String, TableColumnType>>,
	pub infer_options: InferOptions,
	pub invalid_values: &'a [&'a str],
}

impl<'a> Default for FromCsvOptions<'a> {
	fn default() -> FromCsvOptions<'a> {
		FromCsvOptions {
			column_types: None,
			infer_options: InferOptions::default(),
			invalid_values: DEFAULT_INVALID_VALUES,
		}
	}
}

#[derive(Clone, Debug)]
pub struct InferOptions {
	pub enum_max_unique_values: usize,
}

impl Default for InferOptions {
	fn default() -> InferOptions {
		InferOptions {
			enum_max_unique_values: 100,
		}
	}
}
  1. Allow passing the column name and type but not force the user to pass the all unique variants in a list.

I think option 2 is probably closer to the interface might be looking for? This way you get to configure the type per column but don't have to pass all of the variants (which for enums with high numbers of options is cumbersome).

@spullara
Copy link
Author

I think just labelling the column an enum without having to list the values would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants