only return one document per file checked (including checks on data files) #454

jeanetteclark · 2024-10-02T20:50:29Z

The goals here are:

return more atomic documents as opposed to massive results docs with hundreds of file results
increase efficiency by parallelizing among data files
return more atomic results from solr

Current sequence looks something like this

sequenceDiagram
    participant engine as Worker
    participant dispatcher as Dispatcher

    engine->>engine: Run getDataPids()
    engine->>dispatcher: create dispatcher
    dispatcher->>dispatcher: Run checks for each pid
    dispatcher->>engine: Return result for all pids
    engine-->>engine: Index into solr

Proposed sequence would potentially look like this:

sequenceDiagram
    participant engine as Worker
    participant dispatcher1 as Dispatcher 1
    participant dispatcher2 as Dispatcher 2

    engine->>engine: Run getDataPids()

    par Parallel Thread 1
        engine->>dispatcher1: Create dispatcher for PID 1
        dispatcher1->>dispatcher1: Run checks for PID 1
        dispatcher1->>engine: Return result for PID 1
        engine-->>engine: Index into Solr
    and Parallel Thread 2
        engine->>dispatcher2: Create dispatcher for PID 2
        dispatcher2->>dispatcher2: Run checks for PID 2
        dispatcher2->>engine: Return result for PID 2
        engine-->>engine: Index into Solr
    end

In order to do this we'll need to modify the solr indexing (which I think needs help anyway), the dispatch system (maybe running in parallel?), and possibly the schema for the run document.

The text was updated successfully, but these errors were encountered:

jeanetteclark added this to the 3.2 milestone Oct 2, 2024

jeanetteclark added this to Metadig Data Quality Oct 2, 2024

jeanetteclark moved this to Backlog in Metadig Data Quality Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

only return one document per file checked (including checks on data files) #454

only return one document per file checked (including checks on data files) #454

jeanetteclark commented Oct 2, 2024 •

edited by mbjones

Loading

only return one document per file checked (including checks on data files) #454

only return one document per file checked (including checks on data files) #454

Comments

jeanetteclark commented Oct 2, 2024 • edited by mbjones Loading

jeanetteclark commented Oct 2, 2024 •

edited by mbjones

Loading