Skip to content

Conversation

hayesgb
Copy link
Contributor

@hayesgb hayesgb commented Oct 3, 2022

Adds a test for filtering a dataframe by columns on a large list

@hayesgb hayesgb requested a review from ncclementi October 4, 2022 16:55
ddf = timeseries(end="2000-05-01", dtypes={"A": float, "B": int}, seed=42)
ddf.A = ddf.A.mul(1e7)
ddf.A = ddf.A.astype(int).persist()
a_column_unique_values = np.arange(1, n // 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, it looks like we only use n once, do we need to create a variable (line 71), is this a number that could potentially change? Or did we choose this number arbitrarily?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleared up by algning 1e7 to N. Yes the value could change.

n = 10_000_000
rs = np.random.RandomState(42)
ddf = timeseries(end="2000-05-01", dtypes={"A": float, "B": int}, seed=42)
ddf.A = ddf.A.mul(1e7)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth a comment here on why we need these next two lines. Is it a cardinality issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments.

@ncclementi
Copy link
Contributor

Thanks, @hayesgb !
When CI finishes, would you uncomment the test.yaml and adding [skip ci] on the commit message

@ncclementi
Copy link
Contributor

I'll merge main and uncomment the test.yaml code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants