Skip to content

Data exploration scripts for oss4geo.org based on text embeddings for semantic search and topic clustering based on dimensionality reduction

License

Notifications You must be signed in to change notification settings

do-me/oss4geo-semantic

Repository files navigation

oss4geo-semantic

Data exploration scripts for oss4geo.org based on text embeddings for semantic search and topic clustering based on dimensionality reduction.

This is work in progress and needs some fine tuning but you can already explore the scatterplot based on text embeddings (bge-m3) and dimensionality reduction with tSNE.

To Do

  • improve script to access repos that are on GitHub but self-hosted GitLab instances or similar
  • before creating the embeddings it might be best to let an LLM create a summary first. If there is too much code/special formatting etc. the embeddings might become inaccurate. They work best on clear textual descriptions of what the package does
  • above method would also remove the special chars etc., alternatively just convert markdown to normal text and strip code
  • create a simple static frontend app (e.g. based on https://github.com/flekschas/regl-scatterplot/ or similar)
  • create LLM-based labels of the clusters add the labels to the plot (e.g. one can see clusters for raster processing, CAD, stac and so on)
  • improve tooltips to plot and add LLM-generated mini summary of like 1-2 sentences
  • color points based on user-specified categories etc.

Why?

About

Data exploration scripts for oss4geo.org based on text embeddings for semantic search and topic clustering based on dimensionality reduction

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published