Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: knowledge base : pgvector db structure need help #17446

Open
1 task done
naveenr-btc opened this issue Jan 7, 2025 · 6 comments
Open
1 task done

[Question]: knowledge base : pgvector db structure need help #17446

naveenr-btc opened this issue Jan 7, 2025 · 6 comments
Labels
question Further information is requested

Comments

@naveenr-btc
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

I have embeddings of text already. The embeddings are available in a table that has structure as

my_embeddings (
id BIGSERIAL PRIMARY KEY,
embedding vector(1536),
);

when using knowledge base, i'm getting error telling certain columns are missing. So need to understand what columns are missing. and also it'll be helpful if i can get an example for those columns of what it should hold.
Note: my use case is that the llms needs to understand from those knowledge base for better results not querying the actual content from it.

@naveenr-btc naveenr-btc added the question Further information is requested label Jan 7, 2025
Copy link

dosubot bot commented Jan 7, 2025

Hey @naveenr-btc! I'm here to assist you with any bugs, questions, or contributions. Let's tackle this together!

To use pgvector with LlamaIndex, your knowledge base table needs to have specific columns. The required columns are:

  • id UUID DEFAULT (gen_random_uuid())
  • embedding VECTOR({num_dimensions})
  • content TEXT
  • metadata JSONB

If tenant_aware is set to True, you also need:

  • tenant_id UUID

Your current table structure is missing the content and metadata columns, which are necessary for LlamaIndex to function correctly. You can refer to the _create_tables method in the NileVectorStore class for more details on the table structure [1].

Additionally, here's an example of setting up a knowledge base table using SQLAlchemy and pgvector:

from pgvector.sqlalchemy import Vector
from sqlalchemy import insert, create_engine, String, text, Integer
from sqlalchemy.orm import declarative_base, mapped_column

Base = declarative_base()

class SECTextChunk(Base):
    __tablename__ = "sec_text_chunk"

    id = mapped_column(Integer, primary_key=True)
    page_label = mapped_column(Integer)
    file_name = mapped_column(String)
    text = mapped_column(String)
    embedding = mapped_column(Vector(384))

Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)

# Insert into Database
for node in nodes:
    row_dict = {
        "text": node.get_content(),
        "embedding": node.embedding,
        **node.metadata,
    }
    stmt = insert(SECTextChunk).values(**row_dict)
    with engine.connect() as connection:
        cursor = connection.execute(stmt)
        connection.commit()

This example demonstrates how to define a table schema and insert data with embeddings into a PostgreSQL database with pgvector [2].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@logan-markewich
Copy link
Collaborator

The db structure is here

Haven't taken the time to add options for people to "bring their own" in the postgres vector store class.

@naveenr-btc
Copy link
Author

ok Thanks for pointing it out, Can you guys tell me what these missed columns should hold ?

@naveenr-btc
Copy link
Author

naveenr-btc commented Jan 13, 2025

Hi, can you tell me what the nodeId and metadata_ should hold ? i always get empty response

2025-01-13 17:58:38,971 INFO sqlalchemy.engine.Engine select pg_catalog.version()
2025-01-13 17:58:38,971 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-01-13 17:58:38,999 INFO sqlalchemy.engine.Engine select current_schema()
2025-01-13 17:58:38,999 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-01-13 17:58:39,009 INFO sqlalchemy.engine.Engine show standard_conforming_strings
2025-01-13 17:58:39,009 INFO sqlalchemy.engine.Engine [raw sql] {}
2025-01-13 17:58:39,020 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-01-13 17:58:39,021 INFO sqlalchemy.engine.Engine SELECT schema_name FROM information_schema.schemata WHERE schema_name = %(schema_name)s
2025-01-13 17:58:39,021 INFO sqlalchemy.engine.Engine [generated in 0.00036s] {'schema_name': 'public'}
2025-01-13 17:58:39,032 INFO sqlalchemy.engine.Engine COMMIT
2025-01-13 17:58:39,037 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-01-13 17:58:39,037 INFO sqlalchemy.engine.Engine CREATE EXTENSION IF NOT EXISTS vector
2025-01-13 17:58:39,038 INFO sqlalchemy.engine.Engine [generated in 0.00029s] {}
2025-01-13 17:58:39,047 INFO sqlalchemy.engine.Engine COMMIT
2025-01-13 17:58:39,062 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-01-13 17:58:39,069 INFO sqlalchemy.engine.Engine SELECT pg_catalog.pg_class.relname
FROM pg_catalog.pg_class JOIN pg_catalog.pg_namespace ON pg_catalog.pg_namespace.oid = pg_catalog.pg_class.relnamespace
WHERE pg_catalog.pg_class.relname = %(table_name)s AND pg_catalog.pg_class.relkind = ANY (ARRAY[%(param_1)s, %(param_2)s, %(param_3)s, %(param_4)s, %(param_5)s]) AND pg_catalog.pg_namespace.nspname = %(nspname_1)s
2025-01-13 17:58:39,070 INFO sqlalchemy.engine.Engine [generated in 0.00055s] {'table_name': 'data_icd_diag_code_list_embeddings_llama', 'param_1': 'r', 'param_2': 'p', 'param_3': 'f', 'param_4': 'v', 'param_5': 'm', 'nspname_1': 'public'}
2025-01-13 17:58:39,082 INFO sqlalchemy.engine.Engine COMMIT
2025-01-13 17:58:39,088 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2025-01-13 17:58:39,099 INFO sqlalchemy.engine.Engine SELECT public.data_icd_diag_code_list_embeddings_llama.id, public.data_icd_diag_code_list_embeddings_llama.node_id, public.data_icd_diag_code_list_embeddings_llama.text, public.data_icd_diag_code_list_embeddings_llama.metadata_, public.data_icd_diag_code_list_embeddings_llama.embedding <=> %(embedding_1)s AS distance
FROM public.data_icd_diag_code_list_embeddings_llama ORDER BY distance asc
LIMIT %(param_1)s
2025-01-13 17:58:39,100 INFO sqlalchemy.engine.Engine [generated in 0.00437s] {'embedding_1': '[-0.006608377210795879,0.009919406846165657,-0.01648673787713051,-0.01925048977136612,-0.015364819206297398,0.02609146013855934,-0.03855570778250694, ... (32642 characters truncated) ... 0.021056506782770157,0.011862242594361305,-0.017512883991003036,0.020646048709750175,-0.020755503326654434,0.012163245119154453,0.007135131862014532]', 'param_1': 2}
2025-01-13 17:58:39,120 INFO sqlalchemy.engine.Engine COMMIT
output : Empty Response
END TASK d50c8376-b66a-4c21-9cdc-238726c16bac :: end time : 1736771319.1257265 :: execution time : 1.2649040222167969
END PIPELINE Pipeline :: end time : 1736771319.1257265 :: execution time : 1.2649040222167969
Completed

Process finished with exit code 0

@dosu

@naveenr-btc
Copy link
Author

@dosu can you help me with this issue

Copy link

dosubot bot commented Jan 13, 2025

The node_id and metadata_ columns in the PGVectorStore class serve specific purposes:

  1. node_id: This column is used to uniquely identify each node in the database. It is crucial for managing and retrieving data, ensuring that each node can be uniquely identified during query execution. The node_id is also used to deduplicate results, ensuring that no duplicate nodes are included in the final query results [1].

  2. metadata_: This column stores additional information about each data entry in a JSON or JSONB format. The metadata can include any supplementary data that might be useful for reconstructing nodes or providing context during query execution. It is retrieved and included in the query results, allowing for more detailed and context-rich data retrieval [1].

If you're getting empty responses, ensure that these columns are correctly populated in your database. The node_id should be unique for each entry, and the metadata_ should contain relevant JSON data that can be used during queries. If these columns are not populated correctly, it could lead to issues with data retrieval and processing.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants