Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preview: Indexer groupby #92

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

RoriCremer
Copy link

No description provided.

Copy link
Contributor

@melissachang melissachang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this PR ready for review?

I want to confirm that this PR improves performance. Can you do the following:

  • Create a view of broad-gdr-encode-storage.encode_2018_10_06.files. Either only keep donors with < 500 files, or only keep files with path (WHERE NOT path IS NULL).
  • Delete index
  • Run indexer at head. What happens -- does indexer crash? How long until crash, and how many documents are indexed at that point?
  • Run indexer with this PR. How long does indexing take?

participant_row_dicts = []
for _, row in participant_group.iterrows():
sample_index = 0
for sample_id, row in participant_group.iloc[0:5000].iterrows(): #
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be hard-coded here.

@melissachang
Copy link
Contributor

Never mind about running indexer at head -- I remember I tried that.

Just let me know when this is ready for review.

@melissachang
Copy link
Contributor

Sorry, I realized I hadn't tried head with the smaller BQ table. So please do this after all.

  • Create a view of broad-gdr-encode-storage.encode_2018_10_06.files. Either only keep donors with < 500 files, or only keep files with path (WHERE NOT path IS NULL).
  • Delete index
  • Run indexer at head. What happens -- does indexer crash? How long until crash, and how many documents are indexed at that point?
  • Run indexer with this PR. How long does indexing take?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants