Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: RAG Refactor #985

Merged
merged 58 commits into from
Jan 3, 2024
Merged
Show file tree
Hide file tree
Changes from 50 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
3027c50
refactor:rag refactor
Aries-ckt Dec 11, 2023
4460f15
Merge remote-tracking branch 'origin/main' into rag_sdk
Aries-ckt Dec 12, 2023
b7c3755
refactor:rag system refactor
Aries-ckt Dec 15, 2023
6fcef46
refactor:rag system refactor
Aries-ckt Dec 15, 2023
cb3ea18
refactor:rag system refactor
Aries-ckt Dec 15, 2023
1ea6b3a
Merge remote-tracking branch 'origin/main' into rag_sdk
Aries-ckt Dec 15, 2023
16cb3b1
Merge remote-tracking branch 'origin/main' into rag_sdk
Aries-ckt Dec 17, 2023
8b84e62
refactor:rag refactor
Aries-ckt Dec 18, 2023
bff483a
Merge remote-tracking branch 'origin/main' into rag_sdk
Aries-ckt Dec 18, 2023
d0daf3a
refactor:rag sdk refactor
Aries-ckt Dec 19, 2023
303c223
refactor:embeddings function refactor
Aries-ckt Dec 19, 2023
11fa7fe
Merge remote-tracking branch 'origin/main' into rag_sdk
Aries-ckt Dec 19, 2023
6a17bcb
refactor:1.rag refactor, 2.add unit tests and examples
Aries-ckt Dec 19, 2023
6d2ae22
refactor:rag knowledge refactor and add unit tests
Aries-ckt Dec 20, 2023
e6db02e
refactor:rag knowledge refactor
Aries-ckt Dec 20, 2023
fa8cce4
refactor:rag add more unit tests
Aries-ckt Dec 21, 2023
e994caf
Merge remote-tracking branch 'origin/main' into rag_sdk
Aries-ckt Dec 21, 2023
9eef7ba
refactor:update chunk manager
Aries-ckt Dec 22, 2023
0434276
Merge remote-tracking branch 'origin/main' into rag_sdk
Aries-ckt Dec 22, 2023
ba96cec
feat:add /sync_batch api
Aries-ckt Dec 22, 2023
f773a1d
Merge remote-tracking branch 'origin/main' into rag_sdk
Aries-ckt Dec 23, 2023
976a213
fix: path error
Aralhi Dec 23, 2023
fbf8fa8
feat: knowledge segment strategy
Aralhi Dec 23, 2023
f882e2b
Merge remote-tracking branch 'origin/rag_sdk' into rag_sdk
Aries-ckt Dec 24, 2023
c3bc5d1
refactor:update rag app api
Aries-ckt Dec 25, 2023
aa34d89
refactor:add documents_by_ids
Aries-ckt Dec 25, 2023
8b8560a
fix: parameter name error
Aralhi Dec 25, 2023
fa8fe8b
feat: batch sync knowledge
Aralhi Dec 25, 2023
7ff4526
Merge branch 'rag_sdk' of github.com:eosphoros-ai/DB-GPT into rag_sdk
Aralhi Dec 25, 2023
047bd7f
Merge remote-tracking branch 'origin/main' into rag_sdk
Aries-ckt Dec 25, 2023
4f41f8c
refactor:rewrite and summary refactor
Aries-ckt Dec 26, 2023
a71db8d
refactor:update app summary service
Aries-ckt Dec 26, 2023
aa78486
feat: remove sync api
Aralhi Dec 26, 2023
e335f33
fix: use async await
Aralhi Dec 26, 2023
ef6efab
feat: add default value
Aralhi Dec 26, 2023
8fbb57d
docs:add rag documents and more unit tests
Aries-ckt Dec 27, 2023
7a2a54e
Merge remote-tracking branch 'origin/main' into rag_sdk
Aries-ckt Dec 27, 2023
8409460
feat:chat knowledge web
Aries-ckt Dec 27, 2023
a3e22c4
feat: edit icon size
Aralhi Dec 27, 2023
4c3bf40
feat: sync error display
Aralhi Dec 27, 2023
b0d75c2
fix: clear interval after component destroy
Aralhi Dec 27, 2023
c51da6e
fix: add defaut_value key
Aralhi Dec 27, 2023
b453d8e
feat: use Pagination rather than load more
Aralhi Dec 27, 2023
aa91fc7
test:update unit tests
Aries-ckt Dec 27, 2023
2676069
Merge remote-tracking branch 'origin/rag_sdk' into rag_sdk
Aries-ckt Dec 28, 2023
bae7def
fix:add chunk page limit
Aries-ckt Dec 28, 2023
3bd625e
feat: ChatKnowledge add prompt token count
Aries-ckt Dec 29, 2023
dc85252
Merge remote-tracking branch 'origin/main' into rag_sdk
Aries-ckt Dec 29, 2023
b55ffc9
fix:promp context length is reached llm maximum
Aries-ckt Jan 1, 2024
73292b6
fix: solve web conflict with main branch
Aries-ckt Jan 1, 2024
bc2d482
Merge branch 'main' into rag_sdk
csunny Jan 2, 2024
7ad668c
fix: format and pytest
csunny Jan 2, 2024
c1e660b
fix:session async problem
Aries-ckt Jan 2, 2024
6ffb9e5
fix:merge rag_sdk
Aries-ckt Jan 2, 2024
77694cb
feat: use tailwindcss animate
Aralhi Jan 2, 2024
8a17007
feat: cache upload file
Aralhi Jan 2, 2024
377e2c9
fix: remove unused tsconfig include
Aralhi Jan 2, 2024
743c6d2
feat: update static file
Aralhi Jan 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion dbgpt/app/component_configs.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ def initialize_components(
system_app.register_instance(controller)

# Register global default RAGGraphFactory
# from dbgpt.graph_engine.graph_factory import DefaultRAGGraphFactory
# from dbgpt.graph.graph_factory import DefaultRAGGraphFactory

# system_app.register(DefaultRAGGraphFactory)

Expand Down
2 changes: 1 addition & 1 deletion dbgpt/app/initialization/embedding_component.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import logging
from typing import Any, Type, TYPE_CHECKING
from dbgpt.component import ComponentType, SystemApp
from dbgpt.rag.embedding_engine.embedding_factory import EmbeddingFactory
from dbgpt.rag.embedding.embedding_factory import EmbeddingFactory

if TYPE_CHECKING:
from langchain.embeddings.base import Embeddings
Expand Down
2 changes: 1 addition & 1 deletion dbgpt/app/knowledge/_cli/knowledge_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@
DocumentQueryRequest,
)

from dbgpt.rag.embedding_engine.knowledge_type import KnowledgeType
from dbgpt.app.knowledge.request.request import DocumentSyncRequest

from dbgpt.app.knowledge.request.request import KnowledgeSpaceRequest
from dbgpt.rag.knowledge.base import KnowledgeType

HTTP_HEADERS = {"Content-Type": "application/json"}

Expand Down
79 changes: 69 additions & 10 deletions dbgpt/app/knowledge/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import shutil
import tempfile
import logging
from typing import List

from fastapi import APIRouter, File, UploadFile, Form

Expand All @@ -13,10 +14,10 @@
from dbgpt.app.openapi.api_v1.api_v1 import no_stream_generator, stream_generator

from dbgpt.app.openapi.api_view_model import Result
from dbgpt.rag.embedding_engine.embedding_engine import EmbeddingEngine
from dbgpt.rag.embedding_engine.embedding_factory import EmbeddingFactory
from dbgpt.rag.embedding.embedding_factory import EmbeddingFactory

from dbgpt.app.knowledge.service import KnowledgeService
from dbgpt.rag.knowledge.factory import KnowledgeFactory
from dbgpt.app.knowledge.request.request import (
KnowledgeQueryRequest,
KnowledgeQueryResponse,
Expand All @@ -27,9 +28,14 @@
SpaceArgumentRequest,
EntityExtractRequest,
DocumentSummaryRequest,
KnowledgeSyncRequest,
)

from dbgpt.app.knowledge.request.request import KnowledgeSpaceRequest
from dbgpt.rag.knowledge.base import ChunkStrategy
from dbgpt.rag.retriever.embedding import EmbeddingRetriever
from dbgpt.storage.vector_store.base import VectorStoreConfig
from dbgpt.storage.vector_store.connector import VectorStoreConnector
from dbgpt.util.tracer import root_tracer, SpanType

logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -103,6 +109,39 @@ def document_add(space_name: str, request: KnowledgeDocumentRequest):
return Result.failed(code="E000X", msg=f"document add error {e}")


@router.get("/knowledge/document/chunkstrategies")
def chunk_strategies():
"""Get chunk strategies"""
print(f"/document/chunkstrategies:")
try:
return Result.succ(
[
{
"strategy": strategy.name,
"name": strategy.value[2],
"description": strategy.value[3],
"parameters": strategy.value[1],
"suffix": [
knowledge.document_type().value
for knowledge in KnowledgeFactory.subclasses()
if strategy in knowledge.support_chunk_strategy()
and knowledge.document_type() is not None
],
"type": set(
[
knowledge.type().value
for knowledge in KnowledgeFactory.subclasses()
if strategy in knowledge.support_chunk_strategy()
]
),
}
for strategy in ChunkStrategy
]
)
except Exception as e:
return Result.failed(code="E000X", msg=f"chunk strategies error {e}")


@router.post("/knowledge/{space_name}/document/list")
def document_list(space_name: str, query_request: DocumentQueryRequest):
print(f"/document/list params: {space_name}, {query_request}")
Expand Down Expand Up @@ -189,6 +228,18 @@ def document_sync(space_name: str, request: DocumentSyncRequest):
return Result.failed(code="E000X", msg=f"document sync error {e}")


@router.post("/knowledge/{space_name}/document/sync_batch")
def batch_document_sync(space_name: str, request: List[KnowledgeSyncRequest]):
logger.info(f"Received params: {space_name}, {request}")
try:
doc_ids = knowledge_space_service.batch_document_sync(
space_name=space_name, sync_requests=request
)
return Result.succ({"tasks": doc_ids})
except Exception as e:
return Result.failed(code="E000X", msg=f"document sync error {e}")


@router.post("/knowledge/{space_name}/chunk/list")
def document_list(space_name: str, query_request: ChunkQueryRequest):
print(f"/document/list params: {space_name}, {query_request}")
Expand All @@ -204,15 +255,23 @@ def similar_query(space_name: str, query_request: KnowledgeQueryRequest):
embedding_factory = CFG.SYSTEM_APP.get_component(
"embedding_factory", EmbeddingFactory
)
client = EmbeddingEngine(
model_name=EMBEDDING_MODEL_CONFIG[CFG.EMBEDDING_MODEL],
vector_store_config={"vector_store_name": space_name},
embedding_factory=embedding_factory,
config = VectorStoreConfig(
name=space_name,
embedding_fn=embedding_factory.create(
EMBEDDING_MODEL_CONFIG[CFG.EMBEDDING_MODEL]
),
)
vector_store_connector = VectorStoreConnector(
vector_store_type=CFG.VECTOR_STORE_TYPE,
vector_store_config=config,
)
retriever = EmbeddingRetriever(
top_k=query_request.top_k, vector_store_connector=vector_store_connector
)
docs = client.similar_search(query_request.query, query_request.top_k)
chunks = retriever.retrieve(query_request.query)
res = [
KnowledgeQueryResponse(text=d.page_content, source=d.metadata["source"])
for d in docs
KnowledgeQueryResponse(text=d.content, source=d.metadata["source"])
for d in chunks
]
return {"response": res}

Expand Down Expand Up @@ -254,7 +313,7 @@ async def entity_extract(request: EntityExtractRequest):
logger.info(f"Received params: {request}")
try:
from dbgpt.app.scene import ChatScene
from dbgpt._private.chat_util import llm_chat_response_nostream
from dbgpt.util.chat_util import llm_chat_response_nostream
import uuid

chat_param = {
Expand Down
24 changes: 24 additions & 0 deletions dbgpt/app/knowledge/document_db.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from datetime import datetime
from typing import List

from sqlalchemy import Column, String, DateTime, Integer, Text, func

Expand Down Expand Up @@ -51,6 +52,12 @@ def create_knowledge_document(self, document: KnowledgeDocumentEntity):
return doc_id

def get_knowledge_documents(self, query, page=1, page_size=20):
"""Get a list of documents that match the given query.
Args:
query: A KnowledgeDocumentEntity object containing the query parameters.
page: The page number to return.
page_size: The number of documents to return per page.
"""
session = self.get_raw_session()
print(f"current session:{session}")
knowledge_documents = session.query(KnowledgeDocumentEntity)
Expand Down Expand Up @@ -85,6 +92,23 @@ def get_knowledge_documents(self, query, page=1, page_size=20):
session.close()
return result

def documents_by_ids(self, ids) -> List[KnowledgeDocumentEntity]:
"""Get a list of documents by their IDs.
Args:
ids: A list of document IDs.
Returns:
A list of KnowledgeDocumentEntity objects.
"""
session = self.get_raw_session()
print(f"current session:{session}")
knowledge_documents = session.query(KnowledgeDocumentEntity)
knowledge_documents = knowledge_documents.filter(
KnowledgeDocumentEntity.id.in_(ids)
)
result = knowledge_documents.all()
session.close()
return result

def get_documents(self, query):
session = self.get_raw_session()
print(f"current session:{session}")
Expand Down
18 changes: 18 additions & 0 deletions dbgpt/app/knowledge/request/request.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
from dbgpt._private.pydantic import BaseModel
from fastapi import UploadFile

from dbgpt.rag.chunk_manager import ChunkParameters


class KnowledgeQueryRequest(BaseModel):
"""query: knowledge query"""
Expand Down Expand Up @@ -43,6 +45,8 @@ class DocumentQueryRequest(BaseModel):
"""doc_name: doc path"""

doc_name: str = None
"""doc_ids: doc ids"""
doc_ids: Optional[List] = None
"""doc_type: doc type"""
doc_type: str = None
"""status: status"""
Expand Down Expand Up @@ -76,6 +80,20 @@ class DocumentSyncRequest(BaseModel):
chunk_overlap: Optional[int] = None


class KnowledgeSyncRequest(BaseModel):
"""Sync request"""

"""doc_ids: doc ids"""
doc_id: int

"""model_name: model name"""
model_name: Optional[str] = None

"""chunk_parameters: chunk parameters
"""
chunk_parameters: ChunkParameters


class ChunkQueryRequest(BaseModel):
"""id: id"""

Expand Down
Loading
Loading