Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema integration for xDD connections #93

Merged
merged 26 commits into from
Sep 19, 2024
Merged

Conversation

davenquinn
Copy link
Member

@davenquinn davenquinn commented Sep 11, 2024

Starting point to address the schema and management elements of #90, to support

  • Vector embeddings for map legend data
  • Knowledge graph/structured data extraction from papers

@davenquinn davenquinn changed the title Schema integration for Macrostrat vector embeddings and xDD knowledge graph integration Schema integration for vector embeddings and xDD knowledge graph integration Sep 11, 2024
@davenquinn davenquinn changed the title Schema integration for vector embeddings and xDD knowledge graph integration Schema integration for vector embeddings and xDD knowledge graph subsystem Sep 11, 2024
@davenquinn
Copy link
Member Author

davenquinn commented Sep 12, 2024

This includes migration to the macrostrat_xdd schema, and an overall simpler table design. Major changes include:

  • Using integer ids instead of UUIDs
  • Using an extensible, foreign-keyed table for entity and relationship types (instead of a custom enum)
  • Terser column/table names
  • Model for citations
  • Explicit text start and end indices for entities

There are a few inconsistencies remaining to be addressed

  • A triangular dependency graph between model_run, source_text, and entity/relationship tables.
  • The source_text table contains nearly-identical source text windows that differ only by post-processing,
    which makes it hard to represent different relationships together. this may be a problem with Weaviate
  • We haven't yet figured out how to integrate feedback and Macrostrat's larger user model

davenquinn and others added 4 commits September 12, 2024 00:06
* main:
  pull in tile-utils migrations via a git submodule
  grant access on tileserver schemas to macrostrat user
  add import for new migration
  Add tileserver migrations to migration system
@davenquinn davenquinn changed the title Schema integration for vector embeddings and xDD knowledge graph subsystem Schema integration for xDD connections Sep 12, 2024
… macrostrat-xdd-integration

* origin/macrostrat-xdd-integration:
  Format code and sort imports
* main:
  Format code and sort imports
  Fixed paleogeography submodule loading
  Change some references to macrostrat.map_scale to maps.map_scale
  Added migration for custom type
  Started the process of fixing dependency between mariadb migrations and map_scale custom type
  Changed the dependencies of the baseline migration
  Rename function for 'ad-hoc scripts' to separate more clearly from migrations
* main:
  Format code and sort imports
  Updated test subsystem help
  Move runtime tests to a clearer location
  Added a simple runtime test runner
  Initial runtime test runs with Pytest
@sarda-devesh
Copy link

A couple of thoughts and questions about the updated schema:

  • Currently, the entity_type and relationship_type tables have two columns: name and description. Should we add in an integer column like id as the primary key rather than using the name as the primary key/identifier.
  • We have a publication to represent the source of an article and we can get the necessary information using https://xdd.wisc.edu/api/articles but what does the citation represent? You mentioned that you have a script to populate this field and I think it makes sense to run when we are inserting in a record into the publication. How can my server trigger/run that script?
  • We are still storing a model_run_id alongside each source_text but that does make sense? This means that we have a copy of the text in the database each them we have a run that uses that piece of text. Additionally, if we have user feedback will be create a separate copy for that
  • Finally, I was thinking we should rename the model_run table to all_runs and in fields run_type (user or model) and feedback_run_id which is used by user runs to store the model_run that a user provided feedback on

@davenquinn
Copy link
Member Author

davenquinn commented Sep 16, 2024

@sarda-devesh to respond to these points in order:

  1. I am fine with either using an integer or the string representation of the type field — whatever is easier to set. In fact I almost made this change but held back from it. Happy to go this direction if you want.
  2. The citation field is just a JSONB extracted directly from the xDD articles API as such: . It would be ideal if this citation caching happened up front (in your API) rather than as a separate step.
  3. Yes, this is one of the major remaining issues with the current design. ideally the source_text model will be independent of an individual model run. I think this key is superfluous now, but we should probably delete it outright.
  4. That could work, but let me think a bit more about this as I do some UI prototyping

@sarda-devesh
Copy link

  1. I think this might be helpful if we want to add more types or potentially support if we want to have more fine grained definitions for these types.
  2. I can update my server to effectively perform the cache_instruction() for each new article that it sees
  3. I also think doing so might reveal if we are missing any additional metadata to represent a "unique piece of text"
  4. Finally, I don't have read/write permissions to the entity_type, relationship_type, and publication tables

@davenquinn
Copy link
Member Author

@sarda-devesh some additional complexities to think about with user feedback:

  1. we want users to be able to provide feedback on individual entities
  2. we want the "validated entities" to be linked back to the non-validated model output that they are based on.

I think this could be accomplished by each entity and relationship having a superseded_by field that references itself. We would need ways to mark an entity/relationship as deleted without replacing it. And of course that action (deletion or updating) would need to be tied to a changeset_id (maybe the extended model_run field you propose?) with a timestamp and username.

@davenquinn
Copy link
Member Author

maybe the model_run field should be renamed entity_set as such:

CREATE TABLE entity_set (
   id,
  user_id,
  model_version,
  timestamp,
  CHECK user_id IS NOT NULL OR model_version IS NOT NULL
);

I think this is similar to what you were proposing/

@sarda-devesh
Copy link

In that case, I think that the superseded_by field makes sense as it allows us to build a chain of updates for a relationship/entitity, which we can easily traverse to build a training dataset to fine-tune the models.

Yeah - I was thinking that changset_id can just be represented by a "user run"

@sarda-devesh
Copy link

sarda-devesh commented Sep 16, 2024

For the entity_set table I think we should have a entity_type field which represents if this is a user run or a model run as we could still like to store the model_version for a user run? Finally, we need to have a extraction_pipeline_id version somewhere which is used to represent which version of the Job Manager was used to produce this result

@sarda-devesh
Copy link

sarda-devesh commented Sep 16, 2024

@davenquinn I added a field called run_id which of type text into the model_run table to capture the run_id outputted by the models:

{
    "run_id": "run_2024-04-29_18:56:40.697006",
    "extraction_pipeline_id": "0",
    "model_name": "example_model",
    "model_version" : "example_version", 
}

Is that fine? I still use the id primary key to reference the run in the rest of the tables

@davenquinn
Copy link
Member Author

Hey @sarda-devesh – the run_id thing works. I didn't realize that field came from the pipeline, so sorry to have deleted. Is that reference stored anywhere else, e.g., weaviate?

The extraction_pipeline_id is fine too.

The thing that worries me about merging the model_run and entity_set tables is that model outputs will require certain metadata to be set (e.g., the extraction_pipeline_id and the run_id) while user-supplied feedback will require a different set of metadata (user_id, mostly). So we'll need a fancy check constraint or something if we want to catch bad data. But this isn't too worrisome.

@sarda-devesh
Copy link

sarda-devesh commented Sep 16, 2024

  1. The run_id is stored by the job manager to track jobs
  2. In that case, I think having two separate tables - one for model runs and one for "user runs" makes the most sense?

@davenquinn
Copy link
Member Author

The nice thing about a superseded_by field is that we can get the most up-to-date graph by selecting everything where superseded_by IS NULL. For deletions, I guess we can just have a "deleted" boolean flag for both relationships and entities

@sarda-devesh
Copy link

Just bumping that I don't have permissions for the tables entity_type, relationship_type, and publication

@davenquinn davenquinn merged commit cab1945 into main Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants