Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Consistent ids when connecting llamaIndex to a Milvus VDB populated by NV-Ingest #205

Open
ChrisJar opened this issue Oct 30, 2024 · 0 comments
Labels
feature request New feature or request

Comments

@ChrisJar
Copy link
Collaborator

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Significant improvement

Please provide a clear description of problem this feature solves

Currently, when I perform an extraction, upload the results to the Milvus VDB, and then query the VDB with LlamaIndex, the node Ids of the retrieved results change every retrieval, even if the content is the same. For example if I upload a document to the VDB:

from nv_ingest_client.client import NvIngestClient
from nv_ingest_client.primitives import JobSpec
from nv_ingest_client.primitives.tasks import ExtractTask
from nv_ingest_client.primitives.tasks import SplitTask
from nv_ingest_client.primitives.tasks import EmbedTask
from nv_ingest_client.primitives.tasks import VdbUploadTask
from nv_ingest_client.util.file_processing.extract import extract_file_content
import logging, time

logger = logging.getLogger("nv_ingest_client")

file_name = "data/multimodal_test.pdf"
file_content, file_type = extract_file_content(file_name)

job_spec = JobSpec(
    document_type=file_type,
    payload=file_content,
    source_id=file_name,
    source_name=file_name,
    extended_options={"tracing_options": {"trace": True, "ts_send": time.time_ns()}},
)

extract_task = ExtractTask(
    document_type=file_type,
    extract_text=True,
    extract_images=False,
    extract_tables=True,
)

embed_task = EmbedTask(
    text=True,
    tables=True,
)

vdb_upload_task = VdbUploadTask()

job_spec.add_task(extract_task)
job_spec.add_task(embed_task)
job_spec.add_task(vdb_upload_task)


client = NvIngestClient()
job_id = client.add_job(job_spec)

client.submit_job(job_id, "morpheus_task_queue")

result = client.fetch_job_result(job_id, timeout=60)

And then connect to the VDB with LlamaIndex:

embed_model = NVIDIAEmbedding(base_url="http://localhost:8012/v1", model="nvidia/nv-embedqa-e5-v5")

vector_store = MilvusVectorStore(
    uri="http://localhost:19530",
    collection_name="nv_ingest_collection",
    doc_id_field="pk",
    embedding_field="vector",
    text_key="text",
    dim=1024,
    overwrite=False
)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store, embed_model=embed_model)
retriever = index.as_retriever(similarity_top_k=1)

And then retrieve a document:

res = retriever.retrieve("What was the dog doing?")

And get the id:

res[0].id_

I get:

'd87a82ea-c968-42b7-84ae-f628b759eac6'

However If i do it again:

res = retriever.retrieve("What was the dog doing?")
res[0].id_

I get:

'ab450386-7d9b-4d1f-8b58-2735d4cacd76'

Despite that the text is the same in both cases:

'locations. Animal Activity Place Giraffe Driving a car. At the beach Lion Putting on sunscreen At the park. Cat Jumping onto a laptop In a home office Dog Chasing a squirrel In the front yard'

This might be a LlamaIndex issue but when I upload documents to Milvus through LlamaIndex and set the Id with LLamaIndex I get a stable Id when retrieving.

Describe the feature, and optionally a solution or implementation and any alternatives

Ideally I would like the id to be consistent and mapped to the pk field in the nv_ingest_collection

Additional context

No response

@ChrisJar ChrisJar added the feature request New feature or request label Oct 30, 2024
@ChrisJar ChrisJar changed the title [FEA]: Consistent ids when connecting llamaIndex to Milvus [FEA]: Consistent ids when connecting llamaIndex to a Milvus VDB populated by NV-Ingest Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant