Schema integration for xDD connections #93

davenquinn · 2024-09-11T22:01:20Z

Starting point to address the schema and management elements of #90, to support

Vector embeddings for map legend data
Knowledge graph/structured data extraction from papers

davenquinn · 2024-09-12T03:32:29Z

This includes migration to the macrostrat_xdd schema, and an overall simpler table design. Major changes include:

Using integer ids instead of UUIDs
Using an extensible, foreign-keyed table for entity and relationship types (instead of a custom enum)
Terser column/table names
Model for citations
Explicit text start and end indices for entities

There are a few inconsistencies remaining to be addressed

A triangular dependency graph between model_run, source_text, and entity/relationship tables.
The source_text table contains nearly-identical source text windows that differ only by post-processing,
which makes it hard to represent different relationships together. this may be a problem with Weaviate
We haven't yet figured out how to integrate feedback and Macrostrat's larger user model

* main: pull in tile-utils migrations via a git submodule grant access on tileserver schemas to macrostrat user add import for new migration Add tileserver migrations to migration system

… macrostrat-xdd-integration * origin/macrostrat-xdd-integration: Format code and sort imports

* main: Format code and sort imports Fixed paleogeography submodule loading Change some references to macrostrat.map_scale to maps.map_scale Added migration for custom type Started the process of fixing dependency between mariadb migrations and map_scale custom type Changed the dependencies of the baseline migration Rename function for 'ad-hoc scripts' to separate more clearly from migrations

* main: Format code and sort imports Updated test subsystem help Move runtime tests to a clearer location Added a simple runtime test runner Initial runtime test runs with Pytest

sarda-devesh · 2024-09-16T18:38:36Z

A couple of thoughts and questions about the updated schema:

Currently, the entity_type and relationship_type tables have two columns: name and description. Should we add in an integer column like id as the primary key rather than using the name as the primary key/identifier.
We have a publication to represent the source of an article and we can get the necessary information using https://xdd.wisc.edu/api/articles but what does the citation represent? You mentioned that you have a script to populate this field and I think it makes sense to run when we are inserting in a record into the publication. How can my server trigger/run that script?
We are still storing a model_run_id alongside each source_text but that does make sense? This means that we have a copy of the text in the database each them we have a run that uses that piece of text. Additionally, if we have user feedback will be create a separate copy for that
Finally, I was thinking we should rename the model_run table to all_runs and in fields run_type (user or model) and feedback_run_id which is used by user runs to store the model_run that a user provided feedback on

davenquinn · 2024-09-16T18:48:36Z

@sarda-devesh to respond to these points in order:

I am fine with either using an integer or the string representation of the type field — whatever is easier to set. In fact I almost made this change but held back from it. Happy to go this direction if you want.
The citation field is just a JSONB extracted directly from the xDD articles API as such:

macrostrat/cli/macrostrat/cli/subsystems/knowledge_graph/__init__.py

Line 23 in 8788509

def cache_citations():

. It would be ideal if this citation caching happened up front (in your API) rather than as a separate step.
Yes, this is one of the major remaining issues with the current design. ideally the source_text model will be independent of an individual model run. I think this key is superfluous now, but we should probably delete it outright.
That could work, but let me think a bit more about this as I do some UI prototyping

sarda-devesh · 2024-09-16T18:55:19Z

I think this might be helpful if we want to add more types or potentially support if we want to have more fine grained definitions for these types.
I can update my server to effectively perform the cache_instruction() for each new article that it sees
I also think doing so might reveal if we are missing any additional metadata to represent a "unique piece of text"
Finally, I don't have read/write permissions to the entity_type, relationship_type, and publication tables

davenquinn · 2024-09-16T18:55:23Z

@sarda-devesh some additional complexities to think about with user feedback:

we want users to be able to provide feedback on individual entities
we want the "validated entities" to be linked back to the non-validated model output that they are based on.

I think this could be accomplished by each entity and relationship having a superseded_by field that references itself. We would need ways to mark an entity/relationship as deleted without replacing it. And of course that action (deletion or updating) would need to be tied to a changeset_id (maybe the extended model_run field you propose?) with a timestamp and username.

davenquinn · 2024-09-16T18:59:54Z

maybe the model_run field should be renamed entity_set as such:

CREATE TABLE entity_set (
   id,
  user_id,
  model_version,
  timestamp,
  CHECK user_id IS NOT NULL OR model_version IS NOT NULL
);

I think this is similar to what you were proposing/

sarda-devesh · 2024-09-16T19:02:23Z

In that case, I think that the superseded_by field makes sense as it allows us to build a chain of updates for a relationship/entitity, which we can easily traverse to build a training dataset to fine-tune the models.

Yeah - I was thinking that changset_id can just be represented by a "user run"

sarda-devesh · 2024-09-16T19:03:06Z

For the entity_set table I think we should have a entity_type field which represents if this is a user run or a model run as we could still like to store the model_version for a user run? Finally, we need to have a extraction_pipeline_id version somewhere which is used to represent which version of the Job Manager was used to produce this result

sarda-devesh · 2024-09-16T20:33:28Z

@davenquinn I added a field called run_id which of type text into the model_run table to capture the run_id outputted by the models:

{
    "run_id": "run_2024-04-29_18:56:40.697006",
    "extraction_pipeline_id": "0",
    "model_name": "example_model",
    "model_version" : "example_version", 
}

Is that fine? I still use the id primary key to reference the run in the rest of the tables

davenquinn · 2024-09-16T21:53:30Z

Hey @sarda-devesh – the run_id thing works. I didn't realize that field came from the pipeline, so sorry to have deleted. Is that reference stored anywhere else, e.g., weaviate?

The extraction_pipeline_id is fine too.

The thing that worries me about merging the model_run and entity_set tables is that model outputs will require certain metadata to be set (e.g., the extraction_pipeline_id and the run_id) while user-supplied feedback will require a different set of metadata (user_id, mostly). So we'll need a fancy check constraint or something if we want to catch bad data. But this isn't too worrisome.

sarda-devesh · 2024-09-16T21:56:15Z

The run_id is stored by the job manager to track jobs
In that case, I think having two separate tables - one for model runs and one for "user runs" makes the most sense?

davenquinn · 2024-09-16T21:56:50Z

The nice thing about a superseded_by field is that we can get the most up-to-date graph by selecting everything where superseded_by IS NULL. For deletions, I guess we can just have a "deleted" boolean flag for both relationships and entities

sarda-devesh · 2024-09-16T21:58:07Z

Just bumping that I don't have permissions for the tables entity_type, relationship_type, and publication

Starting point for xdd integration and vector embeddings

6607d14

davenquinn changed the title ~~Schema integration for Macrostrat vector embeddings and xDD knowledge graph integration~~ Schema integration for vector embeddings and xDD knowledge graph integration Sep 11, 2024

davenquinn changed the title ~~Schema integration for vector embeddings and xDD knowledge graph integration~~ Schema integration for vector embeddings and xDD knowledge graph subsystem Sep 11, 2024

Created SQL script that does part of a (mostly manual) conversion

24c34ce

davenquinn and others added 4 commits September 12, 2024 00:06

Add SQL to track start and end indices

7669d0e

Created a macrostrat xdd command

e1dcea2

Merge branch 'main' into macrostrat-xdd-integration

1d9fbd9

* main: pull in tile-utils migrations via a git submodule grant access on tileserver schemas to macrostrat user add import for new migration Add tileserver migrations to migration system

Format code and sort imports

e6affb7

davenquinn changed the title ~~Schema integration for vector embeddings and xDD knowledge graph subsystem~~ Schema integration for xDD connections Sep 12, 2024

davenquinn added 10 commits September 12, 2024 09:29

Updated xdd schema files

7378722

Merge remote-tracking branch 'origin/macrostrat-xdd-integration' into…

8788509

… macrostrat-xdd-integration * origin/macrostrat-xdd-integration: Format code and sort imports

Removed older versions of knowledge graph schemas

504d96d

Merge branch 'main' into macrostrat-xdd-integration

b336a55

* main: Format code and sort imports Updated test subsystem help Move runtime tests to a clearer location Added a simple runtime test runner Initial runtime test runs with Pytest

Basic schema for querying tree of results

e815824

Updated strat name views

259e06f

Updated xdd schema models

831e9fd

update entities and matches

9a25afd

Reformat query

ada0990

davenquinn and others added 10 commits September 16, 2024 17:50

Created a script to update database permissions

825c763

Added a script to set up permissions

f80fbf6

Further fix permissions

c02b8ca

Updated subsystem definition

8208092

Slight database system improvements

32c9b5a

Some updates to data schemas

59a78fe

Added documentation of data types

2579530

Add more schema definitions

344a0be

Create ancillary vector files

81be900

Format code and sort imports

898d23c

davenquinn merged commit cab1945 into main Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema integration for xDD connections #93

Schema integration for xDD connections #93

davenquinn commented Sep 11, 2024 •

edited

Loading

davenquinn commented Sep 12, 2024 •

edited

Loading

sarda-devesh commented Sep 16, 2024

davenquinn commented Sep 16, 2024 •

edited

Loading

sarda-devesh commented Sep 16, 2024

davenquinn commented Sep 16, 2024

davenquinn commented Sep 16, 2024

sarda-devesh commented Sep 16, 2024

sarda-devesh commented Sep 16, 2024 •

edited

Loading

sarda-devesh commented Sep 16, 2024 •

edited

Loading

davenquinn commented Sep 16, 2024

sarda-devesh commented Sep 16, 2024 •

edited

Loading

davenquinn commented Sep 16, 2024

sarda-devesh commented Sep 16, 2024

Schema integration for xDD connections #93

Schema integration for xDD connections #93

Conversation

davenquinn commented Sep 11, 2024 • edited Loading

davenquinn commented Sep 12, 2024 • edited Loading

sarda-devesh commented Sep 16, 2024

davenquinn commented Sep 16, 2024 • edited Loading

sarda-devesh commented Sep 16, 2024

davenquinn commented Sep 16, 2024

davenquinn commented Sep 16, 2024

sarda-devesh commented Sep 16, 2024

sarda-devesh commented Sep 16, 2024 • edited Loading

sarda-devesh commented Sep 16, 2024 • edited Loading

davenquinn commented Sep 16, 2024

sarda-devesh commented Sep 16, 2024 • edited Loading

davenquinn commented Sep 16, 2024

sarda-devesh commented Sep 16, 2024

davenquinn commented Sep 11, 2024 •

edited

Loading

davenquinn commented Sep 12, 2024 •

edited

Loading

davenquinn commented Sep 16, 2024 •

edited

Loading

sarda-devesh commented Sep 16, 2024 •

edited

Loading

sarda-devesh commented Sep 16, 2024 •

edited

Loading

sarda-devesh commented Sep 16, 2024 •

edited

Loading