Skip to content

Commit ad6a166

Browse files
Added semantic kernel tutorial
1 parent ce09d28 commit ad6a166

File tree

1 file changed

+387
-0
lines changed

1 file changed

+387
-0
lines changed
Lines changed: 387 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,387 @@
1+
---
2+
# frontmatter
3+
path: "/tutorial-csharp-semantic-kernel-vector-search"
4+
# title and description do not need to be added to markdown, start with H2 (##)
5+
title: Build Vector Search with Couchbase .NET Semantic Kernel Connector and OpenAI
6+
short_title: Vector Search with Semantic Kernel
7+
description:
8+
- Build a semantic search application using Couchbase BHIVE vector index with Semantic Kernel.
9+
- Learn to use the Couchbase .NET Vector Store Connector for Microsoft Semantic Kernel.
10+
- Discover how to generate embeddings with OpenAI and store them in Couchbase.
11+
- Perform vector similarity searches with filtering using SQL++ and ANN_DISTANCE.
12+
content_type: tutorial
13+
filter: sdk
14+
technology:
15+
- vector-search
16+
- kv
17+
tags:
18+
- Semantic Kernel
19+
- OpenAI
20+
- Vector Search
21+
sdk_language:
22+
- csharp
23+
length: 30 Mins
24+
---
25+
26+
## Repository Links
27+
28+
- **Connector Repository**: [couchbase-semantic-kernel](https://github.com/Couchbase-Ecosystem/couchbase-semantic-kernel) - The official Couchbase .NET Vector Store Connector for Microsoft Semantic Kernel
29+
- **This Example**: [CouchbaseVectorSearchDemo](https://github.com/Couchbase-Ecosystem/couchbase-semantic-kernel/tree/Support-Bhive-and-Composite-Index/CouchbaseVectorSearchDemo) - Complete working example demonstrating vector search with Couchbase
30+
31+
## Introduction
32+
33+
This demo showcases the **Semantic Kernel Couchbase connector** - a .NET library that bridges Microsoft's Semantic Kernel framework with Couchbase's vector search capabilities. The connector provides a seamless integration that allows developers to build AI-powered applications using familiar Semantic Kernel abstractions while leveraging Couchbase's vector indexing for high-performance semantic search.
34+
35+
The connector supports three index types:
36+
- **BHIVE** (Hyperscale Vector Index) - for pure vector search at scale ← *Used in this demo*
37+
- **Composite Vector Index** - for vector search with heavy scalar filtering
38+
- **FTS** (Full-Text Search) - for hybrid text + semantic search
39+
40+
This makes the connector ideal for RAG (Retrieval-Augmented Generation) applications, semantic search engines, hybrid search, and recommendation systems.
41+
42+
## Prerequisites
43+
44+
### 1. Couchbase Server Setup
45+
- **Couchbase Server 8.0+**
46+
- Local installation or Couchbase Cloud/Capella
47+
- Bucket with proper read/write permissions
48+
- Query service enabled for SQL++ operations
49+
50+
### 2. OpenAI API Access
51+
- **OpenAI API Key** - Get one from: https://platform.openai.com/api-keys
52+
- Used for generating text embeddings with `text-embedding-ada-002` model
53+
- Ensure you have sufficient API quota for embedding generation
54+
55+
### 3. Development Environment
56+
- **.NET 8.0** or later
57+
- Visual Studio, VS Code, or JetBrains Rider
58+
- Basic understanding of C# and vector databases
59+
60+
61+
## Setting Up the Environment
62+
63+
### 1. Clone and Navigate
64+
```bash
65+
git clone https://github.com/Couchbase-Ecosystem/couchbase-semantic-kernel.git
66+
cd couchbase-semantic-kernel/CouchbaseVectorSearchDemo
67+
```
68+
69+
### 2. Install Dependencies
70+
```bash
71+
dotnet restore
72+
```
73+
74+
### 3. Configuration Setup
75+
76+
Update `appsettings.Development.json` with your credentials:
77+
78+
```json
79+
{
80+
"OpenAI": {
81+
"ApiKey": "your-openai-api-key-here",
82+
"EmbeddingModel": "text-embedding-ada-002"
83+
},
84+
"Couchbase": {
85+
"ConnectionString": "couchbase://localhost",
86+
"Username": "Administrator",
87+
"Password": "your-password",
88+
"BucketName": "demo",
89+
"ScopeName": "semantic-kernel",
90+
"CollectionName": "glossary"
91+
}
92+
}
93+
```
94+
95+
## Understanding the Data Model
96+
97+
The demo uses a `Glossary` class that demonstrates Semantic Kernel's vector store data model. The model uses attributes to define how properties are stored and indexed in the vector database.
98+
99+
For a comprehensive guide on data modeling in Semantic Kernel, refer to [Defining your data model](https://learn.microsoft.com/en-us/semantic-kernel/concepts/vector-store-connectors/defining-your-data-model?pivots=programming-language-csharp) in the official documentation.
100+
101+
### The Glossary Model
102+
103+
```csharp
104+
internal sealed class Glossary
105+
{
106+
[VectorStoreKey]
107+
public string Key { get; set; }
108+
109+
[VectorStoreData(IsIndexed = true)]
110+
public string Category { get; set; }
111+
112+
[VectorStoreData]
113+
public string Term { get; set; }
114+
115+
[VectorStoreData]
116+
public string Definition { get; set; }
117+
118+
[VectorStoreVector(Dimensions: 1536)]
119+
public ReadOnlyMemory<float> DefinitionEmbedding { get; set; }
120+
}
121+
```
122+
123+
## Step-by-Step Tutorial
124+
125+
### Step 1: Prepare Couchbase
126+
127+
Ensure you have the bucket, scope, and collection ready in Couchbase:
128+
- **Bucket**: `demo`
129+
- **Scope**: `semantic-kernel`
130+
- **Collection**: `glossary`
131+
132+
### Step 2: Data Ingestion and Embedding Generation
133+
134+
This step demonstrates how the connector works with Semantic Kernel's vector store abstractions:
135+
136+
**Getting the Collection** - The demo uses `CouchbaseVectorStore.GetCollection<TKey, TRecord>()` to obtain a collection reference configured for BHIVE index:
137+
```csharp
138+
var vectorStore = new CouchbaseVectorStore(scope);
139+
var collection = vectorStore.GetCollection<string, Glossary>(
140+
"glossary",
141+
new CouchbaseQueryCollectionOptions
142+
{
143+
IndexName = "bhive_glossary_index", // BHIVE index name
144+
SimilarityMetric = "cosine"
145+
}
146+
);
147+
```
148+
149+
The `CouchbaseQueryCollectionOptions` works with both BHIVE and composite indexes - simply specify the appropriate index name. For FTS indexes, use `CouchbaseSearchCollection` with `CouchbaseSearchCollectionOptions` instead.
150+
151+
**Automatic Embedding Generation** - The connector integrates with Semantic Kernel's `IEmbeddingGenerator` interface to automatically generate embeddings from text. When you provide an embedding generator (in this case, OpenAI's `text-embedding-ada-002`), the text is automatically converted to vectors:
152+
153+
```csharp
154+
// Generate embedding from text
155+
var embedding = await embeddingGenerator.GenerateAsync(glossary.Definition);
156+
glossary.DefinitionEmbedding = embedding.Vector;
157+
```
158+
159+
For more details on embedding generation in Semantic Kernel, see [Embedding Generation Documentation](https://learn.microsoft.com/en-us/semantic-kernel/concepts/vector-store-connectors/embedding-generation?pivots=programming-language-csharp).
160+
161+
**Upserting Records** - The demo uses the connector's `UpsertAsync()` method to insert or update records in the collection:
162+
```csharp
163+
await collection.UpsertAsync(glossaryEntries);
164+
```
165+
166+
This creates 6 sample glossary entries with technical terms, generates embeddings for each definition, and stores them in Couchbase with the following structure:
167+
168+
**Document ID:** `"1"` (from Key field)
169+
**Document Content:**
170+
```json
171+
{
172+
"Category": "Software",
173+
"Term": "API",
174+
"Definition": "Application Programming Interface. A set of rules...",
175+
"DefinitionEmbedding": [0.123, -0.456, 0.789, ...] // 1536 floats
176+
}
177+
```
178+
179+
### Step 3: BHIVE Index Creation
180+
181+
This demo uses a **BHIVE (Hyperscale Vector Index)** - optimized for pure vector searches without heavy scalar filtering. After documents are inserted, the demo creates the BHIVE index:
182+
183+
```sql
184+
CREATE VECTOR INDEX `bhive_glossary_index`
185+
ON `demo`.`semantic-kernel`.`glossary` (DefinitionEmbedding VECTOR)
186+
INCLUDE (Category, Term, Definition)
187+
USING GSI WITH {
188+
"dimension": 1536,
189+
"similarity": "cosine",
190+
"description": "IVF,SQ8"
191+
}
192+
```
193+
194+
**BHIVE Index Configuration:**
195+
- **Index Type**: BHIVE (Hyperscale Vector Index) - best for pure vector similarity searches
196+
- **Vector Field**: `DefinitionEmbedding` (1536 dimensions)
197+
- **Similarity**: `cosine` (optimal for OpenAI embeddings)
198+
- **Include Fields**: Non-vector fields for faster retrieval
199+
- **Quantization**: `IVF,SQ8` (Inverted File with 8-bit scalar quantization)
200+
201+
> **Note**: Composite vector indexes can be created similarly by adding scalar fields to the index definition. Use composite indexes when your queries frequently filter on scalar values before vector comparison. For this demo, we use BHIVE since we're demonstrating pure semantic search capabilities.
202+
203+
### Step 4: Vector Search Operations
204+
205+
The demo performs two types of searches using the connector's `SearchAsync()` method with the BHIVE index:
206+
207+
#### Pure Vector Search
208+
209+
Using the connector's search API:
210+
```csharp
211+
// Generate embedding from search query
212+
var searchVector = (await embeddingGenerator.GenerateAsync(
213+
"What is an Application Programming Interface?")).Vector;
214+
215+
// Search using the connector
216+
var results = await collection.SearchAsync(searchVector, top: 1)
217+
.ToListAsync();
218+
```
219+
220+
Behind the scenes, this executes a SQL++ query with `ANN_DISTANCE`:
221+
```sql
222+
SELECT META().id AS _id, Category, Term, Definition,
223+
ANN_DISTANCE(DefinitionEmbedding, [0.1,0.2,...], 'cosine') AS _distance
224+
FROM `demo`.`semantic-kernel`.`glossary`
225+
ORDER BY _distance ASC
226+
LIMIT 1
227+
```
228+
229+
**Expected Result**: Finds "API" entry with high similarity
230+
231+
#### Filtered Vector Search
232+
233+
Even with a BHIVE index (designed for pure vector search), the connector supports filtering using LINQ expressions with `VectorSearchOptions`:
234+
```csharp
235+
// Search with scalar filter
236+
var results = await collection.SearchAsync(
237+
searchVector,
238+
top: 1,
239+
new VectorSearchOptions<Glossary>
240+
{
241+
Filter = g => g.Category == "AI"
242+
}).ToListAsync();
243+
```
244+
245+
This translates to SQL++ with a WHERE clause:
246+
```sql
247+
SELECT META().id AS _id, Category, Term, Definition,
248+
ANN_DISTANCE(DefinitionEmbedding, [0.1,0.2,...], 'cosine') AS _distance
249+
FROM `demo`.`semantic-kernel`.`glossary`
250+
WHERE Category = 'AI'
251+
ORDER BY _distance ASC
252+
LIMIT 1
253+
```
254+
255+
**Query**: *"How do I provide additional context to an LLM?"*
256+
**Expected Result**: Finds "RAG" entry within AI category
257+
258+
> **Note**: While BHIVE indexes support filtering as shown above, for scenarios where you frequently filter on scalar values with highly selective filters, consider using a **composite vector index** instead. The index creation syntax is similar - just add the scalar fields to the index definition. The connector's `SearchAsync()` method works identically with both index types.
259+
260+
## Understanding Vector Index Configuration
261+
262+
Couchbase offers three types of vector indexes optimized for different use cases:
263+
264+
### Index Types
265+
266+
**1. Hyperscale Vector Indexes (BHIVE)***This demo uses BHIVE*
267+
- Uses SQL++ queries via `CouchbaseQueryCollection`
268+
- Best for pure vector searches without complex scalar filtering
269+
- Designed to scale to billions of vectors with low memory footprint
270+
- Optimized for high-performance concurrent operations
271+
- Ideal for: Large-scale semantic search, recommendations, content discovery
272+
- **Creation**: Using SQL++ `CREATE VECTOR INDEX` as shown in Step 3
273+
274+
**2. Composite Vector Indexes**
275+
- Uses SQL++ queries via `CouchbaseQueryCollection`
276+
- Best for filtered vector searches combining vector similarity with scalar filters
277+
- Efficient when scalar filters significantly reduce the search space
278+
- Ideal for: Compliance filtering, user-specific searches, time-bounded queries
279+
- **Creation**: Similar to BHIVE but includes scalar fields in the index definition
280+
281+
**3. FTS (Full-Text Search) Indexes**
282+
- Uses Couchbase Search API via `CouchbaseSearchCollection`
283+
- Best for hybrid search scenarios combining full-text search with vector similarity
284+
- Supports text search, faceting, and vector search in a single query
285+
- Ideal for: Hybrid search, text + semantic search, moderate scale deployments
286+
- **Creation**: Using Search Service index configuration with vector field support
287+
288+
289+
All three index types work with the same Semantic Kernel abstractions (`SearchAsync()`, `UpsertAsync()`, etc.). The main difference is which collection class you instantiate and the underlying query engine.
290+
291+
**Choosing the Right Type**:
292+
- Start with **BHIVE** for pure vector searches and large datasets
293+
- Use **Composite** when scalar filters eliminate large portions of data before vector comparison
294+
- Use **FTS** when you need hybrid search combining full-text and semantic search
295+
296+
For more details, see the [Couchbase Vector Index Documentation](https://preview.docs-test.couchbase.com/docs-server-DOC-12565_vector_search_concepts/server/current/vector-index/use-vector-indexes.html).
297+
298+
299+
### Index Configuration (Couchbase 8.0+)
300+
301+
The `description` parameter in the index definition controls vector storage optimization through centroids and quantization:
302+
303+
**Format**: `IVF[<centroids>],{PQ|SQ}<settings>`
304+
305+
**Centroids (IVF - Inverted File)**
306+
- Controls dataset subdivision for faster searches
307+
- More centroids = faster search, slower training
308+
- If omitted (e.g., `IVF,SQ8`), Couchbase auto-selects based on dataset size
309+
310+
**Quantization Options**
311+
- **SQ** (Scalar Quantization): `SQ4`, `SQ6`, `SQ8` (4, 6, or 8 bits per dimension)
312+
- **PQ** (Product Quantization): `PQx` (e.g., `PQ32x8`)
313+
- Higher values = better accuracy, larger index size
314+
315+
**Common Examples**:
316+
- `IVF,SQ8` - Auto centroids, 8-bit quantization (good default)
317+
- `IVF1000,SQ6` - 1000 centroids, 6-bit quantization (faster, less accurate)
318+
- `IVF,PQ32x8` - Auto centroids, product quantization (better accuracy)
319+
320+
For detailed configuration options, see the [Quantization & Centroid Settings](https://preview.docs-test.couchbase.com/docs-server-DOC-12565_vector_search_concepts/server/current/vector-index/hyperscale-vector-index.html#algo_settings) documentation.
321+
322+
## Running the Demo
323+
324+
### Build and Execute
325+
```bash
326+
cd CouchbaseVectorSearchDemo
327+
dotnet build
328+
dotnet run
329+
```
330+
331+
### Expected Output
332+
```
333+
Couchbase BHIVE Vector Search Demo
334+
====================================
335+
Using OpenAI model: text-embedding-ada-002
336+
Step 1: Ingesting data into Couchbase vector store...
337+
Data ingestion completed
338+
339+
Step 2: Creating BHIVE vector index manually...
340+
Executing BHIVE index creation query...
341+
BHIVE vector index 'bhive_glossary_index' already exists.
342+
343+
Step 3: Performing vector search...
344+
Found: API
345+
Definition: Application Programming Interface. A set of rules and specifications that allow software components to communicate and exchange data.
346+
Score: 0.1847
347+
348+
Step 4: Performing filtered vector search...
349+
Found (AI category only): RAG
350+
Definition: Retrieval Augmented Generation - a term that refers to the process of retrieving additional data to provide as context to an LLM to use when generating a response (completion) to a user's question (prompt).
351+
Score: 0.4226
352+
353+
Demo completed successfully!
354+
```
355+
356+
## How the Connector Works
357+
358+
The Couchbase Semantic Kernel connector provides a seamless integration between Semantic Kernel's vector store abstractions and Couchbase's vector search capabilities:
359+
360+
### Data Flow
361+
1. **Initialize** - Create a `CouchbaseVectorStore` instance using a Couchbase scope
362+
2. **Get Collection** - Use `GetCollection<TKey, TRecord>()` to get a typed collection reference
363+
3. **Generate Embeddings** - Use Semantic Kernel's `IEmbeddingGenerator` to convert text to vectors
364+
4. **Upsert Records** - Call `UpsertAsync()` to insert/update records with embeddings
365+
5. **Create Index** - Set up a vector index using SQL++ for optimal search performance
366+
6. **Search** - Use `SearchAsync()` with optional `VectorSearchOptions` for filtered searches
367+
7. **Results** - Receive ranked results with similarity scores (lower = more similar)
368+
369+
### Key Connector Classes & Methods
370+
371+
**Vector Store Classes:**
372+
- **`CouchbaseVectorStore`** - Main entry point for vector store operations
373+
- **`CouchbaseQueryCollection`** - Collection class for BHIVE and Composite indexes (SQL++)
374+
- **`CouchbaseSearchCollection`** - Collection class for FTS indexes (Search API)
375+
376+
**Common Methods (all index types):**
377+
- **`GetCollection<TKey, TRecord>()`** - Returns a typed collection for CRUD operations
378+
- **`UpsertAsync()`** - Inserts or updates records in the collection
379+
- **`SearchAsync()`** - Performs vector similarity search with optional filters
380+
- **`VectorSearchOptions`** - Configures search behavior including filters and result count
381+
382+
**Configuration Options:**
383+
- **`CouchbaseQueryCollectionOptions`** - For BHIVE and Composite indexes
384+
- **`CouchbaseSearchCollectionOptions`** - For FTS indexes
385+
386+
For more documentation, visit the [connector repository](https://github.com/Couchbase-Ecosystem/couchbase-semantic-kernel).
387+

0 commit comments

Comments
 (0)