Skip to content

Latest commit

 

History

History
79 lines (56 loc) · 2.5 KB

text-splitter.md

File metadata and controls

79 lines (56 loc) · 2.5 KB

Text Splitter Customizations

Updating the Model Name

The default text splitter is a SentenceTransformersTokenTextSplitter instance. The text splitter uses a pre-trained model from Hugging Face to identify sentence boundaries. You can change the model used by setting the APP_TEXTSPLITTER_MODELNAME environment variable in the chain-server service of your docker-compose.yaml file like the following example:

services:
  chain-server:
    environment:
      APP_TEXTSPLITTER_MODELNAME: intfloat/e5-large-v2

Adjusting Chunk Size and Overlap

The text splitter divides documents into smaller chunks for processing. You can control the chunk size and overlap using environment variables in chain-server service of your docker-compose.yaml file:

  • APP_TEXTSPLITTER_CHUNKSIZE: Sets the maximum number of tokens allowed in each chunk.
  • APP_TEXTSPLITTER_CHUNKOVERLAP: Defines the number of tokens that overlap between consecutive chunks.
services:
  chain-server:
    environment:
      APP_TEXTSPLITTER_CHUNKSIZE: 256
      APP_TEXTSPLITTER_CHUNKOVERLAP: 128

Using a Custom Text Splitter

While the default text splitter works well, you can also implement a custom splitter for specific needs.

  1. Modify the get_text_splitter method in RAG/src/chain_server/utils.py. Update it to incorporate your custom text splitter class.

    def get_text_splitter():
    
       from langchain.text_splitter import RecursiveCharacterTextSplitter
    
       return RecursiveCharacterTextSplitter(
           chunk_size=get_config().text_splitter.chunk_size - 2,
           chunk_overlap=get_config().text_splitter.chunk_overlap
       )

    Make sure the chunks created by the function have a smaller number of tokens than the context length of the embedding model.

Build and Start the Container

After you change the get_text_splitter function, build and start the container.

  1. Navigate to the example directory.

    cd RAG/examples/basic_rag/llamaindex
  2. Build and deploy the microservice.

    docker compose up -d --build