Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Document insertion fails in Weaviate Vector Store when metadata is included #14504

Open
terilias opened this issue Jul 2, 2024 · 3 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@terilias
Copy link

terilias commented Jul 2, 2024

Bug Description

Hello,
While I was trying to use the Weaviate Vector Store, I found that when I try to insert a Document with metadata to it, then it is not actually inserted into the vector store. Note that no exception or warning is raised. You can detect the failure only if you try to print the contents of the vector store or use a retriever.

If the Document does not contain metadata, then the insertion is complete and the retriever can search on the document's chunks. We have talk with @logan-markewich on Discord, and the issue is probably connected with the Weaviate issue #5202.

Version

0.10.51 (llama-index)
1.0.0 (llama-index-vector-stores-weaviate)
4.6.5 (weaviate-client)

Steps to Reproduce

Create a Weaviate vector store index and then try to insert a document with metadata and one without metadata. Then use a retriever to retrieve the nodes or use the Weaviate method for listing the collection's contents and check if the ones from the document with metadata are included in the results. The following Python code is extracted from a Jupyter Notebook to showcase the steps to reproduce.


import logging
import sys
import os
import weaviate
import tiktoken
from IPython.display import Markdown, display

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))


from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, Document
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core.vector_stores import VectorStoreQuery
from llama_index.core.schema import TextNode
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
from llama_index.core.node_parser import SimpleNodeParser


# Embeddings initialization: OpenAI
embed_model = OpenAIEmbedding(model="text-embedding-3-small", api_key=os.environ.get("OPENAI_API_KEY"))
tokenizer = tiktoken.encoding_for_model("text-embedding-3-small").encode

# Tokenizer and node parser initialization
tokenizer_obj = tokenizer
chunk_size = 450
chunk_overlap = 50
token_counter_handler = TokenCountingHandler(tokenizer=tokenizer_obj)
callback_manager = CallbackManager([token_counter_handler])
node_parser = SimpleNodeParser.from_defaults(chunk_size=chunk_size,
                                                  chunk_overlap=chunk_overlap,
                                                  callback_manager=callback_manager)

# ## Connect to local weaviate
client = weaviate.connect_to_local()
assert client.is_live() 
client.collections.delete_all()
client.is_connected()

# ## Use the client in LlamaIndex

# This will create a collection for us if it doesn't exist.
vector_store = WeaviateVectorStore(
    weaviate_client=client, index_name="Test"
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_store_index = VectorStoreIndex.from_vector_store(vector_store=vector_store,
                                                        embed_model=embed_model,
                                                        storage_context=storage_context,
                                                        transformations=[node_parser],
                                                        show_progress=True)


# #### Try to insert a document (without metadata) - Works

# load documents
text = '''Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.[2]

Deep-learning architectures such as deep neural networks, deep belief networks, recurrent neural networks, convolutional neural networks and transformers have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, climate science, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.[3][4][5]

Early forms of neural networks were inspired by information processing and distributed communication nodes in biological systems, in particular the human brain. However, current neural networks do not intend to model the brain function of organisms, and are generally seen as low quality models for that purpose.'''

llama_index_doc = Document(text=text)

vector_store_index.insert(document=llama_index_doc)

retriever = vector_store_index.as_retriever(similarity_top_k=10, 
                                            vector_store_query_mode="hybrid",
                                            alpha=0.70)
nodes = retriever.retrieve("neural")
len(nodes)


for node in nodes:
    print(node.text)


# See the contents of the collection.
# https://weaviate.io/developers/weaviate/manage-data/read-all-objects
collection = client.collections.get("Test")

i = 0
for item in collection.iterator(
    include_vector=True  
):
    print(f"{i}. {item.properties['text']}")
    # print(item.vector)
    i = i + 1


# #### Try to insert a document (with metadata) - Not works

# load documents
text = '''In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one or more independent variables (often called 'predictors', 'covariates', 'explanatory variables' or 'features'). The most common form of regression analysis is linear regression, in which one finds the line (or a more complex linear combination) that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line (or hyperplane) that minimizes the sum of squared differences between the true data and that line (or hyperplane). For specific mathematical reasons (see linear regression), this allows the researcher to estimate the conditional expectation (or population average value) of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters (e.g., quantile regression or Necessary Condition Analysis[1]) or estimate the conditional expectation across a broader collection of non-linear models (e.g., nonparametric regression). 
'''

llama_index_doc = Document(doc_id='12345',
                           text=text,
                           metadata={
                               "title of parental document": "Regression analysis",
                               "source": "https://en.wikipedia.org/wiki/Regression_analysis"
                           })

vector_store_index.insert(document=llama_index_doc)

retriever = vector_store_index.as_retriever(similarity_top_k=10, 
                                            vector_store_query_mode="hybrid",
                                            alpha=0.70)
nodes = retriever.retrieve("regression analysis")
len(nodes)

for node in nodes:
    print(node.text)

# See the contents of the collection.
# https://weaviate.io/developers/weaviate/manage-data/read-all-objects
collection = client.collections.get("Test")

i = 0
for item in collection.iterator(
    include_vector=True  
):
    print(f"{i}. {item.properties['text']}")
    # print(item.vector)
    i = i + 1
```
@terilias terilias added bug Something isn't working triage Issue needs to be triaged/prioritized labels Jul 2, 2024
Copy link

dosubot bot commented Jul 2, 2024

The issue you're encountering with inserting documents with metadata into the Weaviate Vector Store likely stems from how metadata is handled during the insertion process. Here are some steps and suggestions to help debug and resolve this issue:

  1. Check Metadata Conversion:
    Ensure that the metadata is correctly converted and structured before insertion. The node_to_metadata_dict function is responsible for this conversion. If the metadata structure is invalid, it could cause the insertion to fail silently.

    def node_to_metadata_dict(
        node: BaseNode,
        remove_text: bool = False,
        text_field: str = DEFAULT_TEXT_KEY,
        flat_metadata: bool = False,
    ) -> Dict[str, Any]:
        """Common logic for saving Node data into metadata dict."""
        node_dict = node.dict()
        metadata: Dict[str, Any] = node_dict.get("metadata", {})
    
        if flat_metadata:
            _validate_is_flat_dict(metadata)
    
        # store entire node as json string - some minor text duplication
        if remove_text:
            node_dict[text_field] = ""
    
        # remove embedding from node_dict
        node_dict["embedding"] = None
    
        # dump remainder of node_dict to json string
        metadata["_node_content"] = json.dumps(node_dict)
        metadata["_node_type"] = node.class_name()
    
        # store ref doc id at top level to allow metadata filtering
        metadata["document_id"] = node.ref_doc_id or "None"  # for Chroma
        metadata["doc_id"] = node.ref_doc_id or "None"  # for Pinecone, Qdrant, Redis
        metadata["ref_doc_id"] = node.ref_doc_id or "None"  # for Weaviate
    
        return metadata
  2. Add Logging for Debugging:
    Add logging to check the contents of the metadata before insertion. This will help identify any issues with the metadata that could be causing the silent failure.

    import logging
    
    def add_node(
        client: "WeaviateClient",
        node: BaseNode,
        class_name: str,
        batch: Optional[Any] = None,
        text_key: str = DEFAULT_TEXT_KEY,
    ) -> None:
        """Add node."""
        metadata = {}
        metadata[text_key] = node.get_content(metadata_mode=MetadataMode.NONE) or ""
    
        additional_metadata = node_to_metadata_dict(
            node, remove_text=True, flat_metadata=False
        )
        metadata.update(additional_metadata)
    
        vector = node.get_embedding()
        id = node.node_id
    
        # Log metadata for debugging
        logging.debug(f"Inserting node with metadata: {metadata}")
    
        # if batch object is provided (via a context manager), use that instead
        if batch is not None:
            batch.add_object(
                properties=metadata, collection=class_name, uuid=id, vector=vector
            )
        else:
            client.collections.get(class_name).data.insert(
                properties=metadata, uuid=id, vector=vector
            )
  3. Ensure Metadata Conforms to Weaviate's Requirements:
    The metadata dictionary must be flat, with keys as strings and values as one of the types: string, integer, float, or None. Ensure that your metadata conforms to these requirements.

  4. Check for Schema Existence:
    Ensure that the schema for the class exists before inserting documents. If the schema does not exist, it needs to be created.

    def class_schema_exists(client: Any, class_name: str) -> bool:
        """Check if class schema exists."""
        validate_client(client)
        return client.collections.exists(class_name)
    
    def create_default_schema(client: Any, class_name: str) -> None:
        """Create default schema."""
        validate_client(client)
        class_schema = {
            "class": class_name,
            "description": f"Class for {class_name}",
            "properties": NODE_SCHEMA,
        }
        client.collections.create_from_dict(class_schema)

By following these steps, you should be able to identify and resolve the issue with inserting documents with metadata into the Weaviate Vector Store. If the problem persists, further investigation into the specific metadata structure and Weaviate's handling of metadata may be necessary [1][2].

To continue talking to Dosu, mention @dosu.

@dudanogueira
Copy link

The issue here is the space on the property name.

"title of parental document"

This will work:

llama_index_doc = Document(doc_id='12345',
                           text=text,
                           metadata={
                               "title_of_parental_document": "Regression analysis",
                               "source": "https://en.wikipedia.org/wiki/Regression_analysis"
                           })

The issue with that code is the name of one of the metadata. It is not adherent to a what Weaviate expects:
https://weaviate.io/developers/weaviate/config-refs/schema#properties

for instance:
/[_A-Za-z][_0-9A-Za-z]*/

I believe we could add something here that would normalize the property name to be according to what Weaviate supports.

@terilias
Copy link
Author

terilias commented Jul 3, 2024

Thank you so much @dudanogueira!
Now it perfectly works!
My initially thought was to just raise an exception in the case of a name that is not correct for Weaviate's rules, something like the implementation for checking the collection's name (I am reffering to this code). I think that this could be enough as this way it would be clear to the user what the name must be. But of course you know what is better here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

2 participants