-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: VectorStoreIndex log status for each file #17170
Comments
To address your questions about logging and embedding with
By following these steps, you can effectively manage the embedding process and logging for each file when using |
can you write me the code |
Certainly! Here's an example of how you might implement logging for the embedding process in the import logging
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class QdrantVectorStore:
# Existing methods...
def _build_points(self, nodes, sparse_vector_name):
points = []
node_ids = []
for node in nodes:
# Assuming `node` has an `embedding` attribute
point = {
'id': node.node_id,
'vector': node.embedding,
'payload': {
'text': node.text,
'metadata': node.metadata
}
}
points.append(point)
node_ids.append(node.node_id)
# Log the embedding process
logger.info(f"Processed embedding for node ID: {node.node_id}")
return points, node_ids
# Example usage
async def create_index_nodes(nodes, embed_model):
vector_index = VectorStoreIndex(
nodes=nodes,
embed_model=embed_model,
storage_context=storage_context,
show_progress=True
)
# Assuming `reader` is an instance of a class with `iter_data` method
for doc in reader.iter_data():
# Convert documents to nodes if necessary
nodes = convert_documents_to_nodes(doc)
vector_index.insert(nodes)
# Helper function to convert documents to nodes
def convert_documents_to_nodes(documents):
# Implement conversion logic here
return nodes In this code:
Make sure to replace |
is already generating embeddings for me then i dont see the use of vector_index.insert(nodes) @dosu |
The Using In your case, since the embeddings are generated during the initialization with |
yes but i want to be able to do things such as logging for each file after it is finished. if VectorStoreIndex will embed everything, hence I won't be able to control that. what can I do in this case. @dosu , ca i just initialize a vectorstoreindex? |
To achieve logging or perform actions after each file is processed while using
By implementing one or more of these strategies, you can gain more control over the embedding process and perform logging or other actions after each file or node is processed. |
can you write me the code using the CallbackManager. I will send documents, then i need to parse it and get_nodes_from_documents and send it to VectorStoreIndex. make sure in the callbackmanager to log when all the documents has been stored. so i can log node by node but also when the whole document is finished, to log that too. I want to know at the end how many documents has been sucessfully stored. @dosu |
Here's how you can use the from llama_index import CallbackManager, VectorStoreIndex
import logging
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class CustomCallback:
def __init__(self):
self.total_documents = 0
def on_embedding(self, event):
# Log the embedding process for each node
logger.info(f"Embedding {event['num_texts']} texts.")
def on_node_parsing(self, event):
# Increment the total number of documents stored
self.total_documents += len(event['nodes'])
def on_all_documents_stored(self):
# Log when all documents have been successfully stored
logger.info(f"All documents have been stored. Total documents: {self.total_documents}")
# Create a callback manager and add the custom callback
callback_manager = CallbackManager()
custom_callback = CustomCallback()
# Register the custom callback for the EMBEDDING and NODE_PARSING events
callback_manager.register_callback('EMBEDDING', custom_callback.on_embedding)
callback_manager.register_callback('NODE_PARSING', custom_callback.on_node_parsing)
# Function to process documents
def process_documents(documents, embed_model, storage_context):
# Parse documents into nodes
nodes = get_nodes_from_documents(documents, callback_manager=callback_manager)
# Create a VectorStoreIndex with the nodes
vector_index = VectorStoreIndex(
nodes=nodes,
embed_model=embed_model,
storage_context=storage_context,
show_progress=True
)
# Log when all documents have been stored
custom_callback.on_all_documents_stored()
# Example usage
documents = [...] # Your list of documents
embed_model = ... # Your embedding model
storage_context = ... # Your storage context
process_documents(documents, embed_model, storage_context) In this code:
|
To insert nodes using from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
# Load documents and build index
documents = SimpleDirectoryReader("../../examples/data/paul_graham").load_data()
index = VectorStoreIndex.from_documents(documents) Alternatively, if you want more control over how your documents are indexed, you can use the ingestion pipeline to create nodes: from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline
# Create the pipeline with transformations
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=25, chunk_overlap=0),
TitleExtractor(),
OpenAIEmbedding(),
]
)
# Run the pipeline
nodes = pipeline.run(documents=[Document.example()])
index = VectorStoreIndex(nodes) This approach allows you to customize the chunking, metadata, and embedding of the nodes before inserting them into the index [1][2]. |
ok but how can i log after the end of each document, from a set of document that i send to vectorstoreindex using the pipeline approach. @dosu |
To log the completion of each document after processing a set of documents using the pipeline approach with import logging
import time
from urllib.parse import quote_plus
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class CustomVectorStoreIndex(VectorStoreIndex):
def _wait_for_documents_ingestion(self, doc_ids, verbose=False, raise_on_error=False):
if verbose:
print("Loading data: ", end="")
# Wait until all documents are loaded
pending_docs = set(doc_ids)
while pending_docs:
docs_to_remove = set()
for doc in pending_docs:
# Check the status of each document
status = self._client.pipelines.get_pipeline_document_status(
pipeline_id=self.pipeline.id,
document_id=quote_plus(quote_plus(doc)),
)
if status in [
ManagedIngestionStatus.NOT_STARTED,
ManagedIngestionStatus.IN_PROGRESS,
]:
continue
if status == ManagedIngestionStatus.ERROR:
if verbose:
print(f"Document ingestion failed for {doc}")
if raise_on_error:
raise ValueError(f"Document ingestion failed for {doc}")
# Log the completion of the document
logger.info(f"Document {doc} ingestion completed successfully.")
docs_to_remove.add(doc)
pending_docs -= docs_to_remove
if pending_docs:
if verbose:
print(".", end="")
time.sleep(0.5)
if verbose:
print("Done!")
# Wait for pipeline ingestion to complete
self.wait_for_completion(verbose, raise_on_error)
# Example usage
documents = [...] # Your list of documents
embed_model = ... # Your embedding model
storage_context = ... # Your storage context
# Initialize the custom index
index = CustomVectorStoreIndex(
nodes=[],
embed_model=embed_model,
storage_context=storage_context,
show_progress=True
)
# Process documents and log completion
index._wait_for_documents_ingestion(doc_ids=[doc.id for doc in documents]) In this code:
|
@logan-markewich is there a simple way to do that? |
Question Validation
Question
I would like to be able to do log what's happening to each file, what was embedded, what was not etc..
I use llamaindex VectorStoreIndex with Qdrant.
`
async def create_index_nodes(self, nodes, embed_model):
"""
Asynchronously creates an index in Qdrant based on the provided nodes.
I thought someone gave an advice to log file per file:
1)However I was kind of suprised because I use VectorStoreIndex to create embedding and store them inside QDRANT. if we call again index.insert, doesnt' mean that we will embed them again?
2) i get nodes from all documents, should i insert doc or nodes? how would i know the end node of a file?
3) what is the way to be able to store embeddings and log things for each file.
@dosu
The text was updated successfully, but these errors were encountered: