Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: RAG implementation from existing knowledge graph #17810

Open
1 task done
laurasck14 opened this issue Feb 13, 2025 · 5 comments
Open
1 task done

[Question]: RAG implementation from existing knowledge graph #17810

laurasck14 opened this issue Feb 13, 2025 · 5 comments
Labels
question Further information is requested

Comments

@laurasck14
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

I am currently trying to implement RAG from an EXISTING property graph using Llamaindex. I mean that I do already have nodes and edges and I do not need to extract paths from text or embed documents but rather embed nodes.

I have tried two methods in order to make this work:

1. define nodes using TextNode with their corresponding relationships

Nodes have a format like:
TextNode(id_='99853', embedding=None, metadata={'node_index': '99853', 'node_id': '43959', 'node_type': 'disease', 'node_name': 'pseudolymphoma', 'node_source': 'MONDO', 'mondo_id': '43959', 'mondo_name': 'pseudolymphoma'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.CHILD: '5'>: [RelatedNodeInfo(node_id='35769', node_type=None, metadata={'relation': 'disease_disease', 'display_relation': 'parent-child'}, hash=None)]}, metadata_template='{key}: {value}', metadata_separator='\n', text="A neoplastic process...(shortened)", mimetype='text/plain', start_char_idx=None, end_char_idx=None, metadata_seperator='\n', text_template='{metadata_str}\n\n{content}')

not all nodes have text but all do have metadata.

and using a property graph store from NebulaGraph to store the nodes and edges and creating the index as:

graph_store = NebulaPropertyGraphStore(
    space= "PrimeKG2_manual", 
    username = "root",
    password = "nebula",
    url = "nebula://localhost:9669",
    props_schema= "`node_index` STRING, `node_type` STRING, `node_id` STRING, `node_name` STRING, `node_source` STRING, `mondo_id` STRING, `mondo_name` STRING, `group_id_bert` STRING, `group_name_bert` STRING, `orphanet_prevalence` STRING, `umls_description` STRING, `orphanet_definition` STRING, `orphanet_epidemiology` STRING, `orphanet_clinical_description` STRING, `orphanet_management_and_treatment` STRING, `mayo_symptoms` STRING, `mayo_causes` STRING, `mayo_risk_factors` STRING, `mayo_complications` STRING, `mayo_prevention` STRING, `mayo_see_doc` STRING, `display_relation` STRING, `_node_content` STRING, `_node_type` STRING, `document_id` STRING, `doc_id` STRING, `ref_doc_id` STRING, `triplet_source_id` STRING",
)

storage_context = StorageContext.from_defaults(
    property_graph_store=graph_store, 
)

index =  PropertyGraphIndex(
    nodes=all_nodes,
    storage_context=storage_context,
    property_graph_store=graph_store,
    vector_store=vector_store,
    llm=llm,
    show_progress=True,
    embed_kg_nodes=True,
    embed_model=Settings.embed_model,
)

index.storage_context.persist(persist_dir="~/scratch-llm/storage/PrimeKG_index_2/")

however this seems like the nodes are bein treated as if they were documents as the graph_store.json appears to be empty and the nodes appear in the docstore.json from the index.

I have also tried defining the nodes as EntityNodes and ChunkNodes upserting them manually into the Nebula graph store and constructing the the index from there using the .from_existing but also doesn't work.

2. using the KnowledgeGraphIndex function although it's deprecated
This method seems to be generating and index from existing nodes without considering them as text:

from llama_index.core import KnowledgeGraphIndex
index = KnowledgeGraphIndex(
    llm=llm,
    include_embeddings=True,
    show_progress=True,
    property_graph_store=graph_store,
    storage_context=storage_context,
    objects=index_nodes,
    use_async=True,
    embed_model=Settings.embed_model,
)
index.storage_context.persist(persist_dir="~/scratch-llm/storage/PrimeKG_index_2/")

However when processing some nodes no embeddings are generated: Generating embeddings: 0it [00:00, ?it/s] and when using the index as_retriever or as chat_engine it retrieves there is no information about that in the database, but there are nodes refering to that.

My questions:

  • Is there any way to correctly implement a RAG from an existing knowledge graph using PropertyGraphIndex and NebulaGraphPropertyStore? if so, how would be the correct way to implement it? (all the documentation I found is based on generating a graph from text documents.)
  • when using KnowledgeGraphIndex how can I ensure embeddings are generated from all nodes?

I'm kind of new in NLP and implementing RAG, any help or guidance is very much appreciated! :)

@laurasck14 laurasck14 added the question Further information is requested label Feb 13, 2025
Copy link

dosubot bot commented Feb 13, 2025

Hey @laurasck14! I'm here to help you with any coding conundrums. Let's tackle this challenge together!

To implement a RAG from an existing knowledge graph using PropertyGraphIndex and NebulaGraphPropertyStore, and ensure embeddings are generated from all nodes when using KnowledgeGraphIndex, follow these steps:

  1. Set Up NebulaGraph:
    Ensure your NebulaGraph instance is running with the necessary schema. Define your space, tags, and edges appropriately.

  2. Prepare Storage Context with NebulaGraphStore:

    from llama_index.core import StorageContext
    from llama_index.graph_stores.nebula import NebulaGraphStore
    
    graph_store = NebulaGraphStore(
        space_name="PrimeKG2_manual",
        edge_types=["relationship"],  # Define your edge types
        rel_prop_names=["display_relation"],  # Define your relationship properties
        tags=["entity"],  # Define your tags
    )
    storage_context = StorageContext.from_defaults(graph_store=graph_store)
  3. Generate KnowledgeGraphIndex with Embeddings:

    Use the KnowledgeGraphIndex to create an index from your nodes, ensuring embeddings are included:

    from llama_index.core import KnowledgeGraphIndex
    
    kg_index = KnowledgeGraphIndex(
        llm=llm,
        include_embeddings=True,
        show_progress=True,
        property_graph_store=graph_store,
        storage_context=storage_context,
        objects=index_nodes,  # Your existing nodes
        embed_model=Settings.embed_model,
    )
    kg_index.storage_context.persist(persist_dir="~/scratch-llm/storage/PrimeKG_index_2/")

    Ensure that include_embeddings=True to generate embeddings for all nodes.

  4. Troubleshooting Embedding Generation:
    If embeddings are not being generated, ensure that:

    • The nodes have sufficient metadata or text content to generate embeddings.
    • The embedding model is correctly configured and accessible.
    • The embed_kg_nodes parameter is set to True if using PropertyGraphIndex.

By following these steps, you should be able to implement a RAG from an existing knowledge graph and ensure embeddings are generated for all nodes [1][2].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other | Bug Report

@logan-markewich
Copy link
Collaborator

@laurasck14 If your property graph was created WITHOUT llama-index, your options are limited. You can create a retriever from an existing graph, but likely only the TextToCypherRetriever or the CypherTemplateRetriever will work

If your graph was already created with llama-index, any retriever will work fine.

In either case, to initialize from an existing graph like nebula, all you need is

index = PropertyGraphIndex.from_existing(nebula_pg_graph_store)
retriever = index.as_retriever(sub_retrievers=[CypherTemplateRetriever(...)], llm=llm)
query_engine = index.as_query_engine(sub_retrievers=[CypherTemplateRetriever(...)], llm=llm)

@laurasck14
Copy link
Author

@dosu @logan-markewich thank you for your answer. Is there a way to generate embeddings from the existing graph? from the text associated to the nodes and the metadata.

The idea would be to retrieve relevant nodes based on a semantic similarity but also take into account the embeddings from the nodes and traverse the graph in different depths depending on the question asked.

Copy link

dosubot bot commented Feb 17, 2025

To generate embeddings from an existing graph using the text associated with the nodes and their metadata, and to retrieve relevant nodes based on semantic similarity while considering node embeddings and traversing the graph at different depths, you can use the TreeSelectLeafEmbeddingRetriever class from the LlamaIndex framework. This class allows you to traverse the index graph using embedding similarity between the query and the node text.

Here's a general approach:

  1. Initialize the Retriever: Create an instance of TreeSelectLeafEmbeddingRetriever, providing it with a TreeIndex and an optional embedding model (BaseEmbedding).

  2. Manage Node Embeddings: Use vector stores like ChromaVectorStore, ClickHouseVectorStore, or OceanBaseVectorStore to store and manage node embeddings. This allows you to add, query, and filter nodes based on embeddings and metadata.

  3. Query the Graph: Use the _query_level method to recursively query the graph. This method selects nodes based on the highest similarity to the query.

  4. Compute Similarities: The _get_query_text_embedding_similarities method computes the similarity between the query embedding and node embeddings. It caches the embeddings for efficiency.

  5. Select Nodes: The _get_most_similar_nodes method selects nodes with the highest similarity scores, considering the child_branch_factor to determine how many child nodes to explore at each level.

  6. Use Metadata Filters: Utilize metadata filters in your vector store to refine your queries based on node metadata.

Here's a simplified code snippet demonstrating the setup using OceanBaseVectorStore:

from llama_index.core.indices.tree.select_leaf_embedding_retriever import TreeSelectLeafEmbeddingRetriever
from llama_index.core.indices.tree.base import TreeIndex
from llama_index.core.base.embeddings.base import BaseEmbedding
from llama_index.vector_stores.oceanbase import OceanBaseVectorStore
from llama_index.core.vector_stores.types import VectorStoreQuery, MetadataFilters, MetadataFilter

# Assume `tree_index` is your TreeIndex and `embed_model` is your embedding model
retriever = TreeSelectLeafEmbeddingRetriever(
    index=tree_index,
    embed_model=embed_model,
    child_branch_factor=2  # Adjust based on how many child nodes to consider
)

# Assume `oceanbase_store` is your OceanBaseVectorStore instance
# Add node embeddings to the vector store
oceanbase_store.add(node_embeddings)

# Query with embedding similarity
query_embedding = [1.0, 0.0, 0.0]  # Example query embedding
result = oceanbase_store.query(VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=1))

# Query with metadata filters
filters = MetadataFilters(
    filters=[MetadataFilter(key="author", value="Marie Curie", operator="==")]
)
result_with_filters = oceanbase_store.query(VectorStoreQuery(filters=filters, similarity_top_k=1))

This setup allows you to retrieve nodes based on semantic similarity, considering both the text and metadata of the nodes, and traverse the graph at different depths depending on the query [1][2][3][4][5].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@logan-markewich
Copy link
Collaborator

@laurasck14 not really an easy way to generate embeddings for an existing graph. It would be pretty complex (need to pull down the data to embed, embed it, and either put the embeddings back into the graph (if supported) or into a standalone vector store that references IDs from your graph)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants