[FEA]: Consistent ids when connecting llamaIndex to a Milvus VDB populated by NV-Ingest #205

ChrisJar · 2024-10-30T17:16:22Z

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Significant improvement

Please provide a clear description of problem this feature solves

Currently, when I perform an extraction, upload the results to the Milvus VDB, and then query the VDB with LlamaIndex, the node Ids of the retrieved results change every retrieval, even if the content is the same. For example if I upload a document to the VDB:

from nv_ingest_client.client import NvIngestClient
from nv_ingest_client.primitives import JobSpec
from nv_ingest_client.primitives.tasks import ExtractTask
from nv_ingest_client.primitives.tasks import SplitTask
from nv_ingest_client.primitives.tasks import EmbedTask
from nv_ingest_client.primitives.tasks import VdbUploadTask
from nv_ingest_client.util.file_processing.extract import extract_file_content
import logging, time

logger = logging.getLogger("nv_ingest_client")

file_name = "data/multimodal_test.pdf"
file_content, file_type = extract_file_content(file_name)

job_spec = JobSpec(
    document_type=file_type,
    payload=file_content,
    source_id=file_name,
    source_name=file_name,
    extended_options={"tracing_options": {"trace": True, "ts_send": time.time_ns()}},
)

extract_task = ExtractTask(
    document_type=file_type,
    extract_text=True,
    extract_images=False,
    extract_tables=True,
)

embed_task = EmbedTask(
    text=True,
    tables=True,
)

vdb_upload_task = VdbUploadTask()

job_spec.add_task(extract_task)
job_spec.add_task(embed_task)
job_spec.add_task(vdb_upload_task)


client = NvIngestClient()
job_id = client.add_job(job_spec)

client.submit_job(job_id, "morpheus_task_queue")

result = client.fetch_job_result(job_id, timeout=60)

And then connect to the VDB with LlamaIndex:

embed_model = NVIDIAEmbedding(base_url="http://localhost:8012/v1", model="nvidia/nv-embedqa-e5-v5")

vector_store = MilvusVectorStore(
    uri="http://localhost:19530",
    collection_name="nv_ingest_collection",
    doc_id_field="pk",
    embedding_field="vector",
    text_key="text",
    dim=1024,
    overwrite=False
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store, embed_model=embed_model)
retriever = index.as_retriever(similarity_top_k=1)

And then retrieve a document:

res = retriever.retrieve("What was the dog doing?")

And get the id:

res[0].id_

I get:

'd87a82ea-c968-42b7-84ae-f628b759eac6'

However If i do it again:

res = retriever.retrieve("What was the dog doing?")
res[0].id_

I get:

'ab450386-7d9b-4d1f-8b58-2735d4cacd76'

Despite that the text is the same in both cases:

'locations. Animal Activity Place Giraffe Driving a car. At the beach Lion Putting on sunscreen At the park. Cat Jumping onto a laptop In a home office Dog Chasing a squirrel In the front yard'

This might be a LlamaIndex issue but when I upload documents to Milvus through LlamaIndex and set the Id with LLamaIndex I get a stable Id when retrieving.

Describe the feature, and optionally a solution or implementation and any alternatives

Ideally I would like the id to be consistent and mapped to the pk field in the nv_ingest_collection

Additional context

No response

The text was updated successfully, but these errors were encountered:

ChrisJar added the feature request New feature or request label Oct 30, 2024

ChrisJar changed the title ~~[FEA]: Consistent ids when connecting llamaIndex to Milvus~~ [FEA]: Consistent ids when connecting llamaIndex to a Milvus VDB populated by NV-Ingest Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Consistent ids when connecting llamaIndex to a Milvus VDB populated by NV-Ingest #205

[FEA]: Consistent ids when connecting llamaIndex to a Milvus VDB populated by NV-Ingest #205

ChrisJar commented Oct 30, 2024

[FEA]: Consistent ids when connecting llamaIndex to a Milvus VDB populated by NV-Ingest #205

[FEA]: Consistent ids when connecting llamaIndex to a Milvus VDB populated by NV-Ingest #205

Comments

ChrisJar commented Oct 30, 2024

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Describe the feature, and optionally a solution or implementation and any alternatives

Additional context