-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More efficient search #89
Comments
Searching from the homepage is done using the endpoint
which, as a massive The links on the homepage eg: https://www.mavedb.org/#/search?search=Homo+sapiens also use score set search and so are very slow. There's a few approaches possible here:
just as a little example, at an SQL level, something like:
results in a fast bitmap heap scan instead of a slow sequential scan.
|
(one year later) Perhaps a combination of full text indexing w/ GIN index and some hooks in the API level to denormalize genome name and whatever else into field(s) on scoresets: that would get us the multi-table-spanning search without having to mess around with materialized views and should end up being very fast. Related: #94 |
At an SQL level, a query for Homo sapiens ends up with: SELECT experiments.id, experiments.urn, experiments.title, experiments.short_description, experiments.abstract_text, experiments.method_text, experiments.extra_metadata, experiments.private, experiments.approved, experiments.published_date, experiments.processing_state, experiments.num_scoresets, experiments.experiment_set_id, experiments.created_by_id, experiments.modified_by_id, experiments.creation_date, experiments.modification_date, scoresets.id AS id_1, scoresets.urn AS urn_1, scoresets.title AS title_1, scoresets.method_text AS method_text_1, scoresets.abstract_text AS abstract_text_1, scoresets.short_description AS short_description_1, scoresets.extra_metadata AS extra_metadata_1, scoresets.dataset_columns, scoresets.external_links, scoresets.normalised, scoresets.private AS private_1, scoresets.approved AS approved_1, scoresets.published_date AS published_date_1, scoresets.processing_state AS processing_state_1, scoresets.processing_errors, scoresets.data_usage_policy, scoresets.num_variants, scoresets.experiment_id, scoresets.licence_id, scoresets.replaces_id, scoresets.created_by_id AS created_by_id_1, scoresets.modified_by_id AS modified_by_id_1, scoresets.creation_date AS creation_date_1, scoresets.modification_date AS modification_date_1, experiment_sets_1.id AS id_2, experiment_sets_1.urn AS urn_2, experiment_sets_1.extra_metadata AS extra_metadata_2, experiment_sets_1.private AS private_2, experiment_sets_1.approved AS approved_2, experiment_sets_1.published_date AS published_date_2, experiment_sets_1.processing_state AS processing_state_2, experiment_sets_1.num_experiments, experiment_sets_1.created_by_id AS created_by_id_2, experiment_sets_1.modified_by_id AS modified_by_id_2, experiment_sets_1.creation_date AS creation_date_2, experiment_sets_1.modification_date AS modification_date_2, users_1.id AS id_3, users_1.username, users_1.first_name, users_1.last_name, users_1.email, users_1.is_superuser, users_1.is_staff, users_1.is_active, users_1.date_joined, users_1.is_first_login, users_1.last_login, users_2.id AS id_4, users_2.username AS username_1, users_2.first_name AS first_name_1, users_2.last_name AS last_name_1, users_2.email AS email_1, users_2.is_superuser AS is_superuser_1, users_2.is_staff AS is_staff_1, users_2.is_active AS is_active_1, users_2.date_joined AS date_joined_1, users_2.is_first_login AS is_first_login_1, users_2.last_login AS last_login_1, keywords_1.id AS id_5, keywords_1.text, keywords_1.creation_date AS creation_date_3, keywords_1.modification_date AS modification_date_3, doi_identifiers_1.id AS id_6, doi_identifiers_1.identifier, doi_identifiers_1.db_name, doi_identifiers_1.db_version, doi_identifiers_1.url, doi_identifiers_1.creation_date AS creation_date_4, doi_identifiers_1.modification_date AS modification_date_4, publication_identifiers_1.id AS id_7, publication_identifiers_1.identifier AS identifier_1, publication_identifiers_1.db_name AS db_name_1, publication_identifiers_1.db_version AS db_version_1, publication_identifiers_1.title AS title_2, publication_identifiers_1.abstract, publication_identifiers_1.authors, publication_identifiers_1.doi, publication_identifiers_1.publication_year, publication_identifiers_1.publication_journal, publication_identifiers_1.url AS url_1, publication_identifiers_1.reference_html, publication_identifiers_1.creation_date AS creation_date_5, publication_identifiers_1.modification_date AS modification_date_5, experiment_publication_identifiers_1.experiment_id AS experiment_id_1, experiment_publication_identifiers_1.publication_identifier_id, experiment_publication_identifiers_1."primary", sra_identifiers_1.id AS id_8, sra_identifiers_1.identifier AS identifier_2, sra_identifiers_1.db_name AS db_name_2, sra_identifiers_1.db_version AS db_version_2, sra_identifiers_1.url AS url_2, sra_identifiers_1.creation_date AS creation_date_6, sra_identifiers_1.modification_date AS modification_date_6, licenses_1.id AS id_9, licenses_1.long_name, licenses_1.short_name, licenses_1.text AS text_1, licenses_1.link, licenses_1.version, licenses_1.creation_date AS creation_date_7, licenses_1.modification_date AS modification_date_7, keywords_2.id AS id_10, keywords_2.text AS text_2, keywords_2.creation_date AS creation_date_8, keywords_2.modification_date AS modification_date_8, doi_identifiers_2.id AS id_11, doi_identifiers_2.identifier AS identifier_3, doi_identifiers_2.db_name AS db_name_3, doi_identifiers_2.db_version AS db_version_3, doi_identifiers_2.url AS url_3, doi_identifiers_2.creation_date AS creation_date_9, doi_identifiers_2.modification_date AS modification_date_9, publication_identifiers_2.id AS id_12, publication_identifiers_2.identifier AS identifier_4, publication_identifiers_2.db_name AS db_name_4, publication_identifiers_2.db_version AS db_version_4, publication_identifiers_2.title AS title_3, publication_identifiers_2.abstract AS abstract_1, publication_identifiers_2.authors AS authors_1, publication_identifiers_2.doi AS doi_1, publication_identifiers_2.publication_year AS publication_year_1, publication_identifiers_2.publication_journal AS publication_journal_1, publication_identifiers_2.url AS url_4, publication_identifiers_2.reference_html AS reference_html_1, publication_identifiers_2.creation_date AS creation_date_10, publication_identifiers_2.modification_date AS modification_date_10, scoreset_publication_identifiers_1.scoreset_id, scoreset_publication_identifiers_1.publication_identifier_id AS publication_identifier_id_1, scoreset_publication_identifiers_1."primary" AS primary_1, taxonomies_1.id AS id_13, taxonomies_1.tax_id, taxonomies_1.organism_name, taxonomies_1.common_name, taxonomies_1.rank, taxonomies_1.has_described_species_name, taxonomies_1.url AS url_5, taxonomies_1.article_reference, taxonomies_1.genome_identifier_id, taxonomies_1.creation_date AS creation_date_11, taxonomies_1.modification_date AS modification_date_11, target_sequences_1.id AS id_14, target_sequences_1.sequence_type, target_sequences_1.sequence, target_sequences_1.label, target_sequences_1.taxonomy_id, target_sequences_1.creation_date AS creation_date_12, target_sequences_1.modification_date AS modification_date_12, target_accessions_1.id AS id_15, target_accessions_1.assembly, target_accessions_1.accession, target_accessions_1.gene, target_accessions_1.creation_date AS creation_date_13, target_accessions_1.modification_date AS modification_date_13, ensembl_identifiers_1.id AS id_16, ensembl_identifiers_1.identifier AS identifier_5, ensembl_identifiers_1.db_name AS db_name_5, ensembl_identifiers_1.db_version AS db_version_5, ensembl_identifiers_1.url AS url_6, ensembl_identifiers_1.creation_date AS creation_date_14, ensembl_identifiers_1.modification_date AS modification_date_14, ensembl_offsets_1.identifier_id, ensembl_offsets_1.target_gene_id, ensembl_offsets_1."offset", refseq_identifiers_1.id AS id_17, refseq_identifiers_1.identifier AS identifier_6, refseq_identifiers_1.db_name AS db_name_6, refseq_identifiers_1.db_version AS db_version_6, refseq_identifiers_1.url AS url_7, refseq_identifiers_1.reference_html AS reference_html_2, refseq_identifiers_1.creation_date AS creation_date_15, refseq_identifiers_1.modification_date AS modification_date_15, refseq_offsets_1.identifier_id AS identifier_id_1, refseq_offsets_1.target_gene_id AS target_gene_id_1, refseq_offsets_1."offset" AS offset_1, uniprot_identifiers_1.id AS id_18, uniprot_identifiers_1.identifier AS identifier_7, uniprot_identifiers_1.db_name AS db_name_7, uniprot_identifiers_1.db_version AS db_version_7, uniprot_identifiers_1.url AS url_8, uniprot_identifiers_1.reference_html AS reference_html_3, uniprot_identifiers_1.creation_date AS creation_date_16, uniprot_identifiers_1.modification_date AS modification_date_16, uniprot_offsets_1.identifier_id AS identifier_id_2, uniprot_offsets_1.target_gene_id AS target_gene_id_2, uniprot_offsets_1."offset" AS offset_2, target_genes_1.id AS id_19, target_genes_1.name, target_genes_1.category, target_genes_1.scoreset_id AS scoreset_id_1, target_genes_1.target_sequence_id, target_genes_1.accession_id, target_genes_1.creation_date AS creation_date_17, target_genes_1.modification_date AS modification_date_17
FROM scoresets JOIN experiments ON experiments.id = scoresets.experiment_id LEFT OUTER JOIN experiment_sets AS experiment_sets_1 ON experiment_sets_1.id = experiments.experiment_set_id LEFT OUTER JOIN users AS users_1 ON users_1.id = experiments.created_by_id LEFT OUTER JOIN users AS users_2 ON users_2.id = experiments.modified_by_id LEFT OUTER JOIN (experiment_keywords AS experiment_keywords_1 JOIN keywords AS keywords_1 ON keywords_1.id = experiment_keywords_1.keyword_id) ON experiments.id = experiment_keywords_1.experiment_id LEFT OUTER JOIN (experiment_doi_identifiers AS experiment_doi_identifiers_1 JOIN doi_identifiers AS doi_identifiers_1 ON doi_identifiers_1.id = experiment_doi_identifiers_1.doi_identifier_id) ON experiments.id = experiment_doi_identifiers_1.experiment_id LEFT OUTER JOIN experiment_publication_identifiers AS experiment_publication_identifiers_1 ON experiments.id = experiment_publication_identifiers_1.experiment_id LEFT OUTER JOIN publication_identifiers AS publication_identifiers_1 ON publication_identifiers_1.id = experiment_publication_identifiers_1.publication_identifier_id LEFT OUTER JOIN (experiment_sra_identifiers AS experiment_sra_identifiers_1 JOIN sra_identifiers AS sra_identifiers_1 ON sra_identifiers_1.id = experiment_sra_identifiers_1.sra_identifier_id) ON experiments.id = experiment_sra_identifiers_1.experiment_id LEFT OUTER JOIN licenses AS licenses_1 ON licenses_1.id = scoresets.licence_id LEFT OUTER JOIN (scoreset_keywords AS scoreset_keywords_1 JOIN keywords AS keywords_2 ON keywords_2.id = scoreset_keywords_1.keyword_id) ON scoresets.id = scoreset_keywords_1.scoreset_id LEFT OUTER JOIN (scoreset_doi_identifiers AS scoreset_doi_identifiers_1 JOIN doi_identifiers AS doi_identifiers_2 ON doi_identifiers_2.id = scoreset_doi_identifiers_1.doi_identifier_id) ON scoresets.id = scoreset_doi_identifiers_1.scoreset_id LEFT OUTER JOIN scoreset_publication_identifiers AS scoreset_publication_identifiers_1 ON scoresets.id = scoreset_publication_identifiers_1.scoreset_id LEFT OUTER JOIN publication_identifiers AS publication_identifiers_2 ON publication_identifiers_2.id = scoreset_publication_identifiers_1.publication_identifier_id LEFT OUTER JOIN target_genes AS target_genes_1 ON scoresets.id = target_genes_1.scoreset_id LEFT OUTER JOIN target_sequences AS target_sequences_1 ON target_sequences_1.id = target_genes_1.target_sequence_id LEFT OUTER JOIN taxonomies AS taxonomies_1 ON taxonomies_1.id = target_sequences_1.taxonomy_id LEFT OUTER JOIN target_accessions AS target_accessions_1 ON target_accessions_1.id = target_genes_1.accession_id LEFT OUTER JOIN ensembl_offsets AS ensembl_offsets_1 ON target_genes_1.id = ensembl_offsets_1.target_gene_id LEFT OUTER JOIN ensembl_identifiers AS ensembl_identifiers_1 ON ensembl_identifiers_1.id = ensembl_offsets_1.identifier_id LEFT OUTER JOIN refseq_offsets AS refseq_offsets_1 ON target_genes_1.id = refseq_offsets_1.target_gene_id LEFT OUTER JOIN refseq_identifiers AS refseq_identifiers_1 ON refseq_identifiers_1.id = refseq_offsets_1.identifier_id LEFT OUTER JOIN uniprot_offsets AS uniprot_offsets_1 ON target_genes_1.id = uniprot_offsets_1.target_gene_id LEFT OUTER JOIN uniprot_identifiers AS uniprot_identifiers_1 ON uniprot_identifiers_1.id = uniprot_offsets_1.identifier_id
WHERE NOT (EXISTS (SELECT 1
FROM scoresets AS scoresets_1
WHERE scoresets.id = scoresets_1.replaces_id)) AND ((lower(scoresets.urn) LIKE '%' || lower('homo sapiens') || '%') OR (lower(scoresets.title) LIKE '%' || lower('homo sapiens') || '%') OR (lower(scoresets.short_description) LIKE '%' || lower('homo sapiens') || '%') OR (lower(scoresets.abstract_text) LIKE '%' || lower('homo sapiens') || '%') OR (EXISTS (SELECT 1
FROM target_genes
WHERE scoresets.id = target_genes.scoreset_id AND (lower(lower(target_genes.name)) LIKE '%' || lower('homo sapiens') || '%'))) OR (EXISTS (SELECT 1
FROM target_genes
WHERE scoresets.id = target_genes.scoreset_id AND (lower(lower(target_genes.category)) LIKE '%' || lower('homo sapiens') || '%'))) OR (EXISTS (SELECT 1
FROM keywords, scoreset_keywords
WHERE scoresets.id = scoreset_keywords.scoreset_id AND keywords.id = scoreset_keywords.keyword_id AND (lower(lower(keywords.text)) LIKE '%' || lower('homo sapiens') || '%'))) OR (EXISTS (SELECT 1
FROM target_genes
WHERE scoresets.id = target_genes.scoreset_id AND (EXISTS (SELECT 1
FROM target_sequences
WHERE target_sequences.id = target_genes.target_sequence_id AND (EXISTS (SELECT 1
FROM taxonomies
WHERE taxonomies.id = target_sequences.taxonomy_id AND (lower(lower(taxonomies.organism_name)) LIKE '%' || lower('homo sapiens') || '%'))))))) OR (EXISTS (SELECT 1
FROM target_genes
WHERE scoresets.id = target_genes.scoreset_id AND (EXISTS (SELECT 1
FROM target_sequences
WHERE target_sequences.id = target_genes.target_sequence_id AND (EXISTS (SELECT 1
FROM taxonomies
WHERE taxonomies.id = target_sequences.taxonomy_id AND (lower(lower(taxonomies.common_name)) LIKE '%' || lower('homo sapiens') || '%'))))))) OR (EXISTS (SELECT 1
FROM target_genes
WHERE scoresets.id = target_genes.scoreset_id AND (EXISTS (SELECT 1
FROM target_accessions
WHERE target_accessions.id = target_genes.accession_id AND (lower(lower(target_accessions.assembly)) LIKE '%' || lower('homo sapiens') || '%'))))) OR (EXISTS (SELECT 1
FROM scoreset_publication_identifiers
WHERE scoresets.id = scoreset_publication_identifiers.scoreset_id AND (EXISTS (SELECT 1
FROM publication_identifiers
WHERE publication_identifiers.id = scoreset_publication_identifiers.publication_identifier_id AND (lower(lower(publication_identifiers.identifier)) LIKE '%' || lower('homo sapiens') || '%'))))) OR (EXISTS (SELECT 1
FROM scoreset_publication_identifiers
WHERE scoresets.id = scoreset_publication_identifiers.scoreset_id AND (EXISTS (SELECT 1
FROM publication_identifiers
WHERE publication_identifiers.id = scoreset_publication_identifiers.publication_identifier_id AND (lower(lower(publication_identifiers.doi)) LIKE '%' || lower('homo sapiens') || '%'))))) OR (EXISTS (SELECT 1
FROM scoreset_publication_identifiers
WHERE scoresets.id = scoreset_publication_identifiers.scoreset_id AND (EXISTS (SELECT 1
FROM publication_identifiers
WHERE publication_identifiers.id = scoreset_publication_identifiers.publication_identifier_id AND (lower(lower(publication_identifiers.abstract)) LIKE '%' || lower('homo sapiens') || '%'))))) OR (EXISTS (SELECT 1
FROM scoreset_publication_identifiers
WHERE scoresets.id = scoreset_publication_identifiers.scoreset_id AND (EXISTS (SELECT 1
FROM publication_identifiers
WHERE publication_identifiers.id = scoreset_publication_identifiers.publication_identifier_id AND (lower(lower(publication_identifiers.title)) LIKE '%' || lower('homo sapiens') || '%'))))) OR (EXISTS (SELECT 1
FROM scoreset_publication_identifiers
WHERE scoresets.id = scoreset_publication_identifiers.scoreset_id AND (EXISTS (SELECT 1
FROM publication_identifiers
WHERE publication_identifiers.id = scoreset_publication_identifiers.publication_identifier_id AND (lower(lower(publication_identifiers.publication_journal)) LIKE '%' || lower('homo sapiens') || '%'))))) OR (EXISTS (SELECT 1
FROM scoreset_publication_identifiers
WHERE scoresets.id = scoreset_publication_identifiers.scoreset_id AND (EXISTS (SELECT 1
FROM publication_identifiers
WHERE publication_identifiers.id = scoreset_publication_identifiers.publication_identifier_id AND jsonb_path_exists(publication_identifiers.authors, '$[*].name ? (@ like_regex "homo sapiens" flag "i")'))))) OR (EXISTS (SELECT 1
FROM doi_identifiers, scoreset_doi_identifiers
WHERE scoresets.id = scoreset_doi_identifiers.scoreset_id AND doi_identifiers.id = scoreset_doi_identifiers.doi_identifier_id AND (lower(lower(doi_identifiers.identifier)) LIKE '%' || lower('homo sapiens') || '%'))) OR (EXISTS (SELECT 1
FROM target_genes
WHERE scoresets.id = target_genes.scoreset_id AND (EXISTS (SELECT 1
FROM uniprot_offsets
WHERE target_genes.id = uniprot_offsets.target_gene_id AND (EXISTS (SELECT 1
FROM uniprot_identifiers
WHERE uniprot_identifiers.id = uniprot_offsets.identifier_id AND (lower(lower(uniprot_identifiers.identifier)) LIKE '%' || lower('homo sapiens') || '%'))))))) OR (EXISTS (SELECT 1
FROM target_genes
WHERE scoresets.id = target_genes.scoreset_id AND (EXISTS (SELECT 1
FROM refseq_offsets
WHERE target_genes.id = refseq_offsets.target_gene_id AND (EXISTS (SELECT 1
FROM refseq_identifiers
WHERE refseq_identifiers.id = refseq_offsets.identifier_id AND (lower(lower(refseq_identifiers.identifier)) LIKE '%' || lower('homo sapiens') || '%'))))))) OR (EXISTS (SELECT 1
FROM target_genes
WHERE scoresets.id = target_genes.scoreset_id AND (EXISTS (SELECT 1
FROM ensembl_offsets
WHERE target_genes.id = ensembl_offsets.target_gene_id AND (EXISTS (SELECT 1
FROM ensembl_identifiers
WHERE ensembl_identifiers.id = ensembl_offsets.identifier_id AND (lower(lower(ensembl_identifiers.identifier)) LIKE '%' || lower('homo sapiens') || '%')))))))) ORDER BY experiments.title this takes about 8 seconds on my machine from HTTP and about 5.3 seconds from psql so it could use some work. |
So long the query!! |
It is indeed! Ok so https://github.com/VariantEffect/mavedb-api/tree/nick/89-efficient-search is a start on this,
What's missing (as of right now):
This is probably easiest to do using the arq / worker system we already have, although there are other ways. |
Full-text searches are currently implemented using wildcards; this approach is slow, although it may be faster than the Django version was. Full-text search should be made more efficient.
PostgreSQL's text search capabilities are likely to be sufficient. A separate full-text search engine like Elasticsearch is also a possibility.
Two challenges are anticipated:
The text was updated successfully, but these errors were encountered: