Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EVA-3571 New stats calculator #196

Merged
merged 4 commits into from
Aug 12, 2024

Conversation

nitin-ebi
Copy link
Contributor

No description provided.

@nitin-ebi nitin-ebi self-assigned this Aug 2, 2024
Copy link
Member

@tcezard tcezard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Left a few comments mostly for my own benefits.

private int chunkSize;
private String studyId;

private static Map<String, Integer> filesIdNumberOfSamplesMap = new HashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static Map<String, Integer> filesIdNumberOfSamplesMap = new HashMap<>();
\\ Store the map of files to number of sample from the file_2_0 collection
private static Map<String, Integer> filesIdNumberOfSamplesMap = new HashMap<>();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

}

private MongoCursor<Document> initializeCursor() {
Bson query = Filters.elemMatch(VariantDocument.FILES_FIELD, Filters.eq(VariantSourceEntryMongo.STUDYID_FIELD, studyId));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the indexes in the variant warehouse and there is no index on the StudyId alone.
Instead there is one on study + analysis.

db.variants_2_0.getIndexes()
...
  {
    v: 2,
    key: { 'files.sid': 1, 'files.fid': 1 },
    name: 'files.sid_1_files.fid_1',
    ns: 'eva_cfamiliaris_31.variants_2_0'
  },
...

I thought selecting by study could potentially be slower than retrieving by study+analysis but it turns out that compound indexes support search by the firsts elements of the index as well as the whole.

import java.util.Set;
import java.util.stream.Collectors;

public class VariantStatsProcessor implements ItemProcessor<VariantDocument, VariantDocument> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this class work if no genotypes are present ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case any of the variant does not have the field "files" or if the field "files" is empty, the variant will not be processed and the variant will be returned as it is.
Same will happen in case any object in the files array does not have sample data, calculation of stats for that object will be skipped.

Comment on lines +35 to +36
Update update = new Update();
update.set("st", variant.getVariantStatsMongo());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential optimisation: Compare the stats present before with the result of calculation and only update if the two are different.
Maybe a in a different ticket.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of our discussion this morning, the approach we take will depend on how we decide to handle objects in the files array that share the same sid and fid. Specifically, whether we plan to merge these objects or keep them separate.

If we decide to keep them separate, it will become challenging to accurately map which stats belong to which files object, especially since multiple objects might have the same sid and fid

@nitin-ebi nitin-ebi merged commit 022ca8f into EBIvariation:master Aug 12, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants