Skip to content

Commit 66570f9

Browse files
authored
feat: add semantic chunker & centralize embeddings in rag-core-lib; helm + deps updates (#148)
Summary - Adds an optional Semantic Chunker to the admin-api-lib and centralizes embedding implementations in rag-core-lib (rag-core-api now re-exports). - Helm chart gains chunker selection + tuning; admin container now preloads NLTK data at startup. - Dependency updates across admin libs/services; new tests for chunking logic. Motivation - Provide more accurate chunk boundaries (semantic-aware) while retaining the existing recursive splitter as the default. - Deduplicate/embedder logic across projects to reduce drift and config duplication. Key changes - Admin chunking - New `SemanticTextChunker` backed by LangChain’s `SemanticChunker`, with optional min/max enforcement via `RecursiveCharacterTextSplitter`. - Trailing undersized chunks are sentence-aware rebalanced (NLTK Punkt with regex fallback) to avoid tiny tails. - Configurable via: - `CHUNKER_CLASS_TYPE_CHUNKER_TYPE`: `recursive` (default) or `semantic` - `CHUNKER_MAX_SIZE` (default `1000`), `CHUNKER_OVERLAP` (default `100`) - Semantic-only: `CHUNKER_BREAKPOINT_THRESHOLD_TYPE` (default `percentile`), `CHUNKER_BREAKPOINT_THRESHOLD_AMOUNT` (default `95`), `CHUNKER_BUFFER_SIZE` (default `1`), `CHUNKER_MIN_SIZE` (default `200`) - DI wiring - `DependencyContainer` selects chunker (`recursive` or `semantic`) and, for semantic mode, resolves embeddings via `EmbedderClassTypeSettings`: - `stackit` → `StackitEmbedder` (with shared retry settings) - `ollama` → `LangchainCommunityEmbedder(OllamaEmbeddings)` - Container bootstrapping simplified in `main.py` (internalizes class-type wiring). - Embeddings centralization - New in `rag-core-lib`: `impl/embeddings/*` and embedder settings (`stackit`, `ollama`, `fake`), plus `EmbedderType` and base `Embedder`. - `rag-core-api` re-exports these for backward compatibility (no breaking imports). - Helm / deployment - Values (`infrastructure/rag/values.yaml`): new `adminBackend.envs.chunker.*` keys for selection & tuning (chart default `recursive`; overlap default now `100`). - Deployment: mounts NLTK data dir and fetches `punkt` + `averaged_perceptron_tagger_eng` at startup; adds `configmap.chunkerName` and `secret.stackitEmbedderName` to env sources. - Behavior fixes & docs - De-duplicate `meta["related"]` in page summaries. - Docs: libs README adds “Chunker configuration (multiple chunkers)” and updates DI tables to rag-core-lib classes; admin-backend README adds “Chunking modes”. - Tests - New `semantic_text_chunker_test.py` exercising: supported-kwargs passthrough to LC chunker, empty-input behavior, min/max enforcement + balancing, sentence-aware split. Configuration / migration - Default remains `recursive` splitter; to enable semantic chunking: 1) Set `CHUNKER_CLASS_TYPE_CHUNKER_TYPE=semantic`. 2) Choose embeddings via `EMBEDDER_CLASS_TYPE_EMBEDDER_TYPE` (`stackit` or `ollama`) and configure: - STACKIT: `STACKIT_EMBEDDER_MODEL`, `STACKIT_EMBEDDER_BASE_URL`, `STACKIT_EMBEDDER_API_KEY` (+ optional retry overrides). - Ollama: `OLLAMA_EMBEDDER_MODEL`, `OLLAMA_EMBEDDER_BASE_URL`. 3) Ensure Helm chart has corresponding ConfigMaps/Secrets (`stackitEmbedder`, etc.). - NLTK data is preloaded on container start; no runtime downloads required. Dependencies - Add: `langchain-experimental`, `nltk` (and transitive `joblib`). - Bump: `fastapi` (0.118.x), `uvicorn` (0.37.x), `langfuse` (3.6.x), `langchain`/`community`/`core` minor versions, `requests` (2.32.5). - Test note: ensure LC packages (`langchain_core`, etc.) are present to run unit tests locally. Risks & mitigations - Startup time increases slightly due to NLTK data fetch → mitigated via one-time download into an emptyDir. - Semantic mode depends on external embeddings; ensure credentials/secrets are present before switching default. - Chunk size tuning may affect vector DB costs; start with defaults and adjust based on retrieval quality. Docs - libs/README.md: “2.4 Chunker configuration (multiple chunkers)” and corrected DI references. - services/admin-backend/README.md: “Chunking modes” and Helm guidance.
1 parent 5ed9880 commit 66570f9

31 files changed

+1053
-420
lines changed

infrastructure/rag/templates/admin-backend/deployment.yaml

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ spec:
2323
emptyDir: {}
2424
- name: tmp-dir
2525
emptyDir: {}
26+
- name: nltk-data-dir
27+
emptyDir: {}
2628
{{- if .Values.shared.imagePullSecret }}
2729
imagePullSecrets:
2830
- name: {{ .Values.shared.imagePullSecret.name }}
@@ -35,10 +37,16 @@ spec:
3537
- -c
3638
- |
3739
touch /app/services/admin-backend/log/logfile.log && \
38-
chmod 600 /app/services/admin-backend/log/logfile.log
40+
chmod 600 /app/services/admin-backend/log/logfile.log;
41+
wget -q -O /tmp/punkt.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt_tab.zip && \
42+
unzip /tmp/punkt.zip -d /home/nonroot/nltk_data/tokenizers && \
43+
wget -q -O /tmp/averaged_perceptron_tagger_eng.zip https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/taggers/averaged_perceptron_tagger_eng.zip && \
44+
unzip /tmp/averaged_perceptron_tagger_eng.zip -d /home/nonroot/nltk_data/taggers;
3945
volumeMounts:
4046
- name: log-dir
4147
mountPath: /app/services/admin-backend/log
48+
- name: nltk-data-dir
49+
mountPath: /home/nonroot/nltk_data
4250
containers:
4351
- name: {{ .Values.adminBackend.name }}
4452
securityContext:
@@ -108,6 +116,8 @@ spec:
108116
name: {{ template "configmap.sourceUploaderName" . }}
109117
- configMapRef:
110118
name: {{ template "configmap.retryDecoratorName" . }}
119+
- configMapRef:
120+
name: {{ template "configmap.chunkerName" . }}
111121
- secretRef:
112122
name: {{ template "secret.langfuseName" . }}
113123
- secretRef:
@@ -116,6 +126,8 @@ spec:
116126
name: {{ template "secret.s3Name" . }}
117127
- secretRef:
118128
name: {{ template "secret.stackitVllmName" . }}
129+
- secretRef:
130+
name: {{ template "secret.stackitEmbedderName" . }}
119131
env:
120132
- name: PYTHONPATH
121133
value: {{ .Values.adminBackend.pythonPathEnv.PYTHONPATH }}

infrastructure/rag/values.yaml

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -338,8 +338,16 @@ adminBackend:
338338
ragapi:
339339
RAG_API_HOST: "http://backend:8080"
340340
chunker:
341+
# Select which chunker implementation to use. Supported values: "semantic", "recursive"
342+
# Defaults to "semantic" which leverages sentence-aware rebalancing.
343+
CHUNKER_CLASS_TYPE_CHUNKER_TYPE: "recursive"
341344
CHUNKER_MAX_SIZE: 1000
342-
CHUNKER_OVERLAP: 300
345+
CHUNKER_OVERLAP: 100
346+
# The following settings for the Chunker are only used when CHUNKER_CLASS_TYPE_CHUNKER_TYPE is set to "semantic".
347+
CHUNKER_BREAKPOINT_THRESHOLD_TYPE: "percentile"
348+
CHUNKER_BREAKPOINT_THRESHOLD_AMOUNT: 95
349+
CHUNKER_BUFFER_SIZE: 1
350+
CHUNKER_MIN_SIZE: 200
343351
keyValueStore:
344352
USECASE_KEYVALUE_PORT: 6379
345353
USECASE_KEYVALUE_HOST: "rag-keydb"

libs/README.md

Lines changed: 63 additions & 14 deletions
Large diffs are not rendered by default.

libs/admin-api-lib/poetry.lock

Lines changed: 55 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

libs/admin-api-lib/pyproject.toml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,9 @@ per-file-ignores = """
2727
./src/admin_api_lib/prompt_templates/summarize_prompt.py: E501,
2828
./src/admin_api_lib/apis/admin_api.py: B008,WOT001,
2929
./src/admin_api_lib/impl/admin_api.py: B008,
30-
./src/admin_api_lib/dependency_container.py: CCE002,CCE001,
30+
./src/admin_api_lib/dependency_container.py: CCE002,CCE001,WOT001,
3131
./src/admin_api_lib/apis/admin_api_base.py: WOT001,
32-
./tests/*: S101,S106,D100,D103,PT011,N802
32+
./tests/*: S101,S106,D100,D103,PT011,N802,E501,
3333
./src/admin_api_lib/impl/settings/confluence_settings.py: C901,N805,
3434
./src/admin_api_lib/impl/utils/comma_separated_bool_list.py: R505,
3535
./src/admin_api_lib/impl/utils/comma_separated_str_list.py: R505,
@@ -109,10 +109,11 @@ redis = "^6.0.0"
109109
pyyaml = "^6.0.2"
110110
python-multipart = "^0.0.20"
111111
starlette = ">=0.47.2,<0.49.0"
112-
langchain-text-splitters = ">=0.3.9"
112+
langchain-experimental = "^0.3.4"
113+
nltk = "^3.9.2"
113114

114115
[tool.pytest.ini_options]
115-
log_cli = 1
116+
log_cli = true
116117
log_cli_level = "DEBUG"
117118
pythonpath = "src"
118119
testpaths = "src/tests"

libs/admin-api-lib/src/admin_api_lib/dependency_container.py

Lines changed: 55 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,9 @@
22

33
from admin_api_lib.impl.api_endpoints.default_file_uploader import DefaultFileUploader
44
from dependency_injector.containers import DeclarativeContainer
5-
from dependency_injector.providers import ( # noqa: WOT001
6-
Configuration,
7-
List,
8-
Selector,
9-
Singleton,
10-
)
5+
from dependency_injector.providers import Configuration, List, Selector, Singleton
116
from langchain.text_splitter import RecursiveCharacterTextSplitter
7+
from langchain_community.embeddings import OllamaEmbeddings
128
from langfuse import Langfuse
139

1410
from admin_api_lib.extractor_api_client.openapi_client.api.extractor_api import (
@@ -29,6 +25,7 @@
2925
from admin_api_lib.impl.api_endpoints.default_documents_status_retriever import (
3026
DefaultDocumentsStatusRetriever,
3127
)
28+
from admin_api_lib.impl.chunker.semantic_text_chunker import SemanticTextChunker
3229
from admin_api_lib.impl.chunker.text_chunker import TextChunker
3330
from admin_api_lib.impl.file_services.s3_service import S3Service
3431
from admin_api_lib.impl.information_enhancer.general_enhancer import GeneralEnhancer
@@ -41,6 +38,7 @@
4138
from admin_api_lib.impl.mapper.informationpiece2document import (
4239
InformationPiece2Document,
4340
)
41+
from admin_api_lib.impl.settings.chunker_class_type_settings import ChunkerClassTypeSettings
4442
from admin_api_lib.impl.settings.chunker_settings import ChunkerSettings
4543
from admin_api_lib.impl.settings.document_extractor_settings import (
4644
DocumentExtractorSettings,
@@ -59,12 +57,21 @@
5957
from admin_api_lib.rag_backend_client.openapi_client.configuration import (
6058
Configuration as RagConfiguration,
6159
)
60+
from rag_core_lib.impl.embeddings.langchain_community_embedder import (
61+
LangchainCommunityEmbedder,
62+
)
63+
from rag_core_lib.impl.embeddings.stackit_embedder import StackitEmbedder
6264
from rag_core_lib.impl.langfuse_manager.langfuse_manager import LangfuseManager
6365
from rag_core_lib.impl.llms.llm_factory import chat_model_provider
66+
from rag_core_lib.impl.settings.embedder_class_type_settings import (
67+
EmbedderClassTypeSettings,
68+
)
6469
from rag_core_lib.impl.settings.langfuse_settings import LangfuseSettings
70+
from rag_core_lib.impl.settings.ollama_embedder_settings import OllamaEmbedderSettings
6571
from rag_core_lib.impl.settings.ollama_llm_settings import OllamaSettings
6672
from rag_core_lib.impl.settings.rag_class_types_settings import RAGClassTypeSettings
6773
from rag_core_lib.impl.settings.retry_decorator_settings import RetryDecoratorSettings
74+
from rag_core_lib.impl.settings.stackit_embedder_settings import StackitEmbedderSettings
6875
from rag_core_lib.impl.settings.stackit_vllm_settings import StackitVllmSettings
6976
from rag_core_lib.impl.tracers.langfuse_traced_runnable import LangfuseTracedRunnable
7077
from rag_core_lib.impl.utils.async_threadsafe_semaphore import AsyncThreadsafeSemaphore
@@ -74,10 +81,14 @@ class DependencyContainer(DeclarativeContainer):
7481
"""Dependency injection container for managing application dependencies."""
7582

7683
class_selector_config = Configuration()
84+
chunker_selector_config = Configuration()
7785

7886
# Settings
7987
s3_settings = S3Settings()
8088
chunker_settings = ChunkerSettings()
89+
chunker_embedder_type_settings = EmbedderClassTypeSettings()
90+
stackit_chunker_embedder_settings = StackitEmbedderSettings()
91+
ollama_chunker_embedder_settings = OllamaEmbedderSettings()
8192
ollama_settings = OllamaSettings()
8293
langfuse_settings = LangfuseSettings()
8394
stackit_vllm_settings = StackitVllmSettings()
@@ -88,6 +99,10 @@ class DependencyContainer(DeclarativeContainer):
8899
summarizer_settings = SummarizerSettings()
89100
source_uploader_settings = SourceUploaderSettings()
90101
retry_decorator_settings = RetryDecoratorSettings()
102+
chunker_type_settings = ChunkerClassTypeSettings()
103+
104+
class_selector_config.from_dict(rag_class_type_settings.model_dump() | chunker_embedder_type_settings.model_dump())
105+
chunker_selector_config.from_dict(chunker_type_settings.model_dump())
91106

92107
key_value_store = Singleton(FileStatusKeyValueStore, key_value_store_settings)
93108
file_service = Singleton(S3Service, s3_settings=s3_settings)
@@ -96,7 +111,40 @@ class DependencyContainer(DeclarativeContainer):
96111
chunk_size=chunker_settings.max_size, chunk_overlap=chunker_settings.overlap
97112
)
98113

99-
chunker = Singleton(TextChunker, text_splitter)
114+
semantic_chunker_embeddings = Selector(
115+
class_selector_config.embedder_type,
116+
stackit=Singleton(
117+
StackitEmbedder,
118+
stackit_chunker_embedder_settings,
119+
retry_decorator_settings,
120+
),
121+
ollama=Singleton(
122+
LangchainCommunityEmbedder,
123+
embedder=Singleton(
124+
OllamaEmbeddings,
125+
model=ollama_chunker_embedder_settings.model,
126+
base_url=ollama_chunker_embedder_settings.base_url,
127+
),
128+
),
129+
)
130+
131+
semantic_chunker = Singleton(
132+
SemanticTextChunker,
133+
embeddings=semantic_chunker_embeddings,
134+
breakpoint_threshold_type=chunker_settings.breakpoint_threshold_type,
135+
breakpoint_threshold_amount=chunker_settings.breakpoint_threshold_amount,
136+
buffer_size=chunker_settings.buffer_size,
137+
min_chunk_size=chunker_settings.min_size,
138+
max_chunk_size=chunker_settings.max_size,
139+
recursive_text_splitter=text_splitter,
140+
overlap=chunker_settings.overlap,
141+
)
142+
143+
chunker = Selector(
144+
chunker_selector_config.chunker_type,
145+
recursive=Singleton(TextChunker, text_splitter),
146+
semantic=semantic_chunker,
147+
)
100148
extractor_api_configuration = Singleton(ExtractorConfiguration, host=document_extractor_settings.host)
101149
document_extractor_api_client = Singleton(ApiClient, extractor_api_configuration)
102150
document_extractor = Singleton(ExtractorApi, document_extractor_api_client)
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
"""Module containing the ChunkerType enumeration."""
2+
3+
from enum import StrEnum, unique
4+
5+
6+
@unique
7+
class ChunkerType(StrEnum):
8+
"""An enumeration representing different types of chunkers."""
9+
10+
SEMANTIC = "semantic"
11+
RECURSIVE = "recursive"

0 commit comments

Comments
 (0)