Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: generate stream response for chat API #23

Merged
merged 8 commits into from
Mar 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 13 additions & 31 deletions docs/development/04-session.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ Open the `chat.ts` file and let's make some significant changes to this code.
- `packages/packages/api/functions/chat.ts`:

```typescript
import { HttpRequest, HttpResponseInit, InvocationContext } from "@azure/functions";
import { badRequest, serviceUnavailable } from "../utils";
import { AzureChatOpenAI, AzureOpenAIEmbeddings } from "@langchain/azure-openai";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { HttpRequest, HttpResponseInit, InvocationContext } from '@azure/functions';
import { badRequest, serviceUnavailable } from '../utils';
import { AzureChatOpenAI, AzureOpenAIEmbeddings } from '@langchain/azure-openai';
import { ChatPromptTemplate } from '@langchain/core/prompts';

import 'dotenv/config';

Expand All @@ -41,27 +41,21 @@ export async function testChat(request: HttpRequest, context: InvocationContext)
const model = new AzureChatOpenAI();

const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
[
"system",
"Answer the user's questions based on the below context:\n\n{context}",
],
[
"human",
"{input}",
]
['system', "Answer the user's questions based on the below context:\n\n{context}"],
['human', '{input}'],
]);

return {
status: 200,
body: "Testing chat function.",
body: 'Testing chat function.',
};
} catch (error: unknown) {
const error_ = error as Error;
context.error(`Error when processing request: ${error_.message}`);

return serviceUnavailable(new Error('Service temporarily unavailable. Please try again later.'));
}
};
}
```

Let's understand what we did here:
Expand Down Expand Up @@ -112,18 +106,18 @@ Agora que já criamos um chat mais dinâmico, vamos implementar o `chain` para q

Let's understand again in each line what we did:

We created a `combineDocsChain` using the `createStuffDocumentsChain` function.
We created a `combineDocsChain` using the `createStuffDocumentsChain` function.

This function is used to create a string that passes a list of documents to a template. There a few parameters in this function. There include `llm`, which is the language model we are using, and `prompt`, which is the conversation we are having with the model. We will use them to create the chain.

Just as we did in the `upload` API, we will need to store the vectors in the database. To do this, we created a variable called `store` so that we can instantiate the `AzureCosmosDBVectorStore` class. This class is used to create a vector that can be used to store and retrieve vectors from the language model.

We create the `chain` using the `createRetrievalChain` function. This function is used precisely to create a retrieval chain that will retrieve the documents and then pass them on to the chat. That's why this function has two parameters:
We create the `chain` using the `createRetrievalChain` function. This function is used precisely to create a retrieval chain that will retrieve the documents and then pass them on to the chat. That's why this function has two parameters:

- `retriever`: which aims to return a list of documents.
- `combineDocsChain`: which will reproduce a string output.
- `combineDocsChain`: which will reproduce a string output.

Finally, we invoked the `chain` using the `invoke` method. This method is used to invoke the chain with the input question and get the response from the language model.
Finally, we invoked the `chain` using the `invoke` method. This method is used to invoke the chain with the input question and get the response from the language model.

Wow! we have completed our `chat` API. Now, let's test our API together with the `upload` API.

Expand Down Expand Up @@ -157,18 +151,6 @@ You will see the exact response requested in the `chat` request. If you want to

![chat API](./images/chat-final-result.gif)

Ainda não concluímos o nosso projeto. Ainda temos mais um item muito importante que não podemos esquecer de implementar num chat: `stream` response. Vamos aprender como fazer isso na próxima sessão.
We haven't finished our project yet. We still have one more very important item that we mustn't forget to implement in a chat: `stream` response. We'll learn how to do this in the next session.

▶ **[Next Step: Generate `stream` response](./05-session.md)**












239 changes: 238 additions & 1 deletion docs/development/05-session.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,240 @@
# Generate a stream response in the `chat` API

**todo**
In this session, we will learn how to generate a stream response in the `chat` API, using LangChain.js and including the new feature on stream available also for the v4 of the Azure Functions programming model.

## What is streaming?

Streaming is crucial for Large Language models (LLMs) for several reasons:

- **It manages memory resources efficiently**: allowing models to process long texts without overloading memory.
- **It improves scalability**: making it easier to process inputs of virtually unlimited size.
- **Reduces latency in real-time interactions**: providing faster responses in virtual assistants and dialog systems.
- **Facilitates training and inference** on large data sets, making the use of LLMs more practical and efficient.
- **It can improve the quality of the text generated**: helping models to focus on smaller pieces of text for greater cohesion and contextual relevance.
- **Supports distributed workflows**: allowing models to be scaled to meet intense processing demands.

As such, certain large language models (LLMs) have the ability to send responses sequentially. This means that you don't need to wait for the full response to be received before you can start working with it. This feature is especially advantageous if you want to show the response to the user as it is produced or if you need to analyze and use the response while it is being formed.

And LangChain.js supports the use of streaming. Making use of the `.stream()` method. If you want to know more about the `.stream()` method, you can access the **[official LangChain.js documentation](https://js.langchain.com/docs/use_cases/question_answering/streaming#chain-with-sources)**.

## Support for HTTP Streams in Azure Functions

The Azure Functions product team recently announced the availability of support for HTTP Streams in version 4 of the Azure Functions programming model. With this, it is now possible to return stream responses in HTTP APIs, which is especially useful for real-time data streaming scenarios.

To find out more about streaming support in Azure Functions v4, you can visit Microsoft's Tech Community blog by clicking **[here](https://techcommunity.microsoft.com/t5/apps-on-azure-blog/azure-functions-support-for-http-streams-in-node-js-is-now-in/ba-p/4066575)**.

## Enabling HTTP Streams support in Azure Functions

Well, now that we've understood the importance of using streaming in a chat and how useful it can be, let's learn how we can introduce it into the `chat` API.

The first thing we need to do is enable the new Azure Functions feature, which is streaming support. To do this, open the file `index.ts` and include the following code:

- `index.ts`

```typescript
app.setup({ enableHttpStream: true });
```

So the `index.ts` file will look like this:

- `index.ts`

```typescript
import { app } from '@azure/functions';
import { chat } from './functions/chat';
import { upload } from './functions/upload';

app.setup({ enableHttpStream: true });
app.post('chat', {
route: 'chat',
authLevel: 'anonymous',
handler: chat,
});

app.post('upload', {
route: 'upload',
authLevel: 'anonymous',
handler: upload,
});
```

And that's it! Azure Functions is now enabled to support streaming.

## Generating a stream response in the `chat` API

Now, let's move on and create the logic to generate a stream response in the `chat` API.

Open the `chat.ts` file and let's make some significant changes:

- `chat.ts`

```typescript
import { IterableReadableStream } from '@langchain/core/dist/utils/stream';
import { Document } from '@langchain/core/documents';
import { HttpRequest, InvocationContext, HttpResponseInit } from '@azure/functions';
import { AzureOpenAIEmbeddings, AzureChatOpenAI } from '@langchain/azure-openai';
import { ChatPromptTemplate } from '@langchain/core/prompts';
import { createStuffDocumentsChain } from 'langchain/chains/combine_documents';
import { AzureCosmosDBVectorStore } from '@langchain/community/vectorstores/azure_cosmosdb';
import { createRetrievalChain } from 'langchain/chains/retrieval';
import 'dotenv/config';
import { badRequest, serviceUnavailable, okStreamResponse } from '../utils';
import { Readable } from 'stream';

export async function chat(request: HttpRequest, context: InvocationContext): Promise<HttpResponseInit> {
context.log(`Http function processed request for url "${request.url}"`);

try {
const requestBody: any = await request.json();

if (!requestBody?.question) {
return badRequest(new Error('No question provided'));
}

const { question } = requestBody;

const embeddings = new AzureOpenAIEmbeddings();

const prompt = `Question: ${question}`;
context.log(`Sending prompt to the model: ${prompt}`);

const model = new AzureChatOpenAI();

const questionAnsweringPrompt = ChatPromptTemplate.fromMessages([
['system', "Answer the user's questions based on the below context:\n\n{context}"],
['human', '{input}'],
]);

const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: questionAnsweringPrompt,
});

const store = new AzureCosmosDBVectorStore(embeddings, {});

const chain = await createRetrievalChain({
retriever: store.asRetriever(),
combineDocsChain,
});

const response = await chain.stream({
input: question,
});

return {
body: createStream(response),
headers: {
'Content-Type': 'text/plain',
},
};
} catch (error: unknown) {
const error_ = error as Error;
context.error(`Error when processing chat request: ${error_.message}`);

return serviceUnavailable(new Error('Service temporarily unavailable. Please try again later.'));
}
}

function createStream(
chunks: IterableReadableStream<
{
context: Document[];
answer: string;
} & {
[key: string]: unknown;
}
>,
) {
const buffer = new Readable({
read() {},
});

const stream = async () => {
for await (const chunk of chunks) {
buffer.push(chunk.answer);
}

buffer.push(null);
};

stream();

return buffer;
}
```

Several changes here, right? But let's understand what has been changed and included here:

```typescript
const response = await chain.stream({
input: question,
});
```

Before, the `chain` variable was using the `invoke()` method. However, as we now want to generate a stream response, we are using the `stream()` method. And passing the `input` parameter with the question the user asked.

After that, we're returning the stream response, using the `createStream()` function.

```typescript
function createStream(
chunks: IterableReadableStream<
{
context: Document[];
answer: string;
} & {
[key: string]: unknown;
}
>,
) {
const buffer = new Readable({
read() {},
});

const stream = async () => {
for await (const chunk of chunks) {
buffer.push(chunk.answer);
}

buffer.push(null);
};

stream();

return buffer;
}
```

The `createStream()` function is responsible for creating a response stream. It receives as a parameter `chunks`, which is an `IterableReadableStream` that contains the chunks of the response. And for each chunk, it is added to the buffer, which is a read stream. And at the end, the buffer is returned.

Note that we are importing:

- `IterableReadableStream` from the `@langchain/core/dist/utils/stream` package: which is an iterable stream that can be used to generate stream responses.
- `Document` from the `@langchain/core/documents` package: which is an interface for interacting with a document.
- `Readable` from the `node:stream` package: class that belongs to the `stream` module of Node.js, which is an interface for reading data from a stream.

```typescript
return {
headers: { 'Content-Type': 'text/plain' },
body: createStream(response),
};
```

And finally, we're returning the stream response using the `createStream()` function. And setting the `Content-Type` header to `text/plain`.

And that's it! Now the `chat` API is ready to generate stream responses.

Let's test the `chat` API and see how it behaves when generating a stream response. To do this, open the terminal again in the `api` folder and run the command:

```bash
npm run start
```

And then open the `api.http` file and send the `chat` API request, and now see the return of the response using the stream in the gif below:

![chat-stream](./images/stream-response.gif)

Note that when we send the request, the `Response` header shows `Transfer-Encoding: chunked`, which indicates that the response is being sent in chunks. And the response is displayed sequentially, i.e. as the response is generated, it is displayed.

![chat-stream-response](./images/stream-response.png)

And that's it! You've now learned how to generate a stream response in the `chat` API using LangChain.js and the new stream feature that is also available for v4 of the Azure Functions programming model.
Binary file added docs/development/images/chat-stream-response.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/development/images/stream-response.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading