diff --git a/_community_members/wronghuy.md b/_community_members/wronghuy.md index c44125956..56a80d0d8 100644 --- a/_community_members/wronghuy.md +++ b/_community_members/wronghuy.md @@ -21,4 +21,4 @@ personas: redirect_from: '/authors/wronghuy/' --- -`Huy Nguyen` is a Software Engineer at Amazon Web Services working on the OpenSearch Project. \ No newline at end of file +**Huy Nguyen** is a Software Engineer at Amazon Web Services working on the OpenSearch Project. \ No newline at end of file diff --git a/_posts/2024-12-17-vscode-osd-setup.markdown b/_posts/2024-12-17-vscode-osd-setup.markdown new file mode 100644 index 000000000..8fb73fec1 --- /dev/null +++ b/_posts/2024-12-17-vscode-osd-setup.markdown @@ -0,0 +1,283 @@ +--- +layout: post +title: "Streamline OpenSearch Dashboards development with VS Code" +authors: + - wronghuy + - kolchfa +date: 2024-12-27 +categories: + - technical-post +meta_keywords: OpenSearch Dashboards, VS Code, development tools, Jest integration, ESLint, multi-root workspaces, unit testing, OSD codebase +meta_description: Boost OpenSearch Dashboards development with VS Code. Learn to set up efficient workflows for testing, linting, and debugging. Enhance productivity using Jest, ESLint, and multi-root workspaces. +excerpt: OpenSearch Dashboards can be a challenge to set up. This blog post shows you how to use VS Code with OpenSearch Dashboards to make development easier. +--- + +Developing OpenSearch Dashboards (OSD) can feel overwhelming, whether you're setting up your environment for the first time or starting work on a significant feature. This blog post introduces developer tools and workflows in [VS Code](https://code.visualstudio.com/) that make OSD development more manageable and efficient. + +VS Code provides many built-in capabilities, such as IntelliSense, code search, and Git integration. In this post, you'll learn how to configure the following: + +* Unit tests that are easy to run and debug at the individual test level. +* A linter that runs automatically on save using OSD rules, eliminating the need to run the linter at commit time and recommit changes (though some linting errors will still require manual fixes). +* A one-click OSD server startup and server-side debugging. +* Multi-root workspaces for development on *both* individual plugins *and* OSD Core. + +## Setting up Jest integration + +OSD uses [Jest](https://jestjs.io/) as the testing framework. With over 2,000 test suites in OSD alone, finding and running specific tests can be a hassle. + +The [vscode-jest](https://marketplace.visualstudio.com/items?itemName=Orta.vscode-jest) extension provides a quick and easy Jest integration in VS Code, as shown in the following image. + +![jest-integration](/assets/media/blog-images/2024-12-17-vscode-osd-setup/jest-integration.gif) + +Here are a few Jest features: + +* A graphical interface used to select which tests to run: you can run all project tests, tests inside a specific directory, specific test files, individual test suites, or even individual test methods. + ![test-suites](/assets/media/blog-images/2024-12-17-vscode-osd-setup/test-suites.png) +* A UI icon that displays passed or failed tests. +* An inline `Test Run` button in your test files so you don't have to run tests in the command line. + ![inline-tests](/assets/media/blog-images/2024-12-17-vscode-osd-setup/inline-tests.png) +* The ability to run a test with debug options. + ![debug-tests](/assets/media/blog-images/2024-12-17-vscode-osd-setup/debug-tests.png) + +**Note:** Do not run all tests under `src/plugins*`. There are over 2,000 test cases, so running all tests will consume lots of resources. Instead, run tests on demand for specific subdirectories. + +### Configuring Jest + +1. Install the `vscode-jest` extension. +2. Open your workspace settings by pressing `Cmd` (or `Ctrl`) + `Shift` + `P` and selecting **Open Workspace Settings (JSON)**. If you want the settings to persist globally across all your projects, select **Open User Settings (JSON)**. Because the following step will only apply to the `OpenSearch-Dashboards` repo, we recommend making these changes a **Workspace Setting** (internally, VS Code will create a `.vscode` directory in the workspace root and place a `settings.json` file inside `.vscode`). +3. Add the following settings to the `settings.json` file: + + ```json + { + "jest.jestCommandLine": "yarn test:jest", + "jest.runMode": "on-demand" + } + ``` + + The first line will use the custom `jest.js` OSD script to load the custom `config.js` file and pass any arguments directly to `jest`. The second line prevents the test runner from running every test on file save, which helps to save computing resources. + + + +After performing these steps, you should see the **Test Explorer** ({::nomarkdown}flask icon{:/}) icon in the **Extensions** sidebar. Select this icon to view all detected test files. + + +## Configuring Prettier and ESLint integration + +The OSD repo includes [Husky](https://typicode.github.io/husky/), a precommit hook that runs scripts before commits are made (scripts may include linting or running unit tests). However, rerunning the linter and recommitting your changes may still be time-consuming. VS Code provides the ability to lint on save, but OSD has specific linting rules that may not work out of the box. + +You can use the [Prettier](https://marketplace.visualstudio.com/items?itemName=esbenp.prettier-vscode) and [ESLint](https://marketplace.visualstudio.com/items?itemName=dbaeumer.vscode-eslint) VS Code plugins to lint files automatically on save. Note that some rules that cannot be fixed automatically will still require manual corrections. The Prettier plugin is configured using the `.prettierrc` config file and will execute the Prettier rules in VS Code. ESLint is configured using the workspace's ESLInt configuration to enforce its rules. The linting integration is shown in the following image. + +![linting-integration](/assets/media/blog-images/2024-12-17-vscode-osd-setup/linting-integration.gif) + +### Prerequisites + +Before you start, configure linting on save by pressing `Cmd` (or `Ctrl`) + `Shift` + `P` and selecting **Open User Settings (JSON)** so you can save this setting across all your projects: + +```json +{ + "editor.formatOnSave": true +} +``` + +### Setup + +1. Install the Prettier and ESLint extensions. +2. Press `Cmd` (or `Ctrl`) + `Shift` + `P` and select **Open Workspace Settings (JSON)**. +3. Add the following settings to the `settings.json` file: + + ```json + { + "prettier.configPath": ".prettierrc", + "editor.defaultFormatter": "esbenp.prettier-vscode", + "eslint.autoFixOnSave": true, + "editor.codeActionsOnSave": { + "source.fixAll.eslint": true + }, + } + ``` +**Note**: If you already have some settings in the `editor.codeActionsOnSave` setting, append `"source.fixAll.eslint": true` to the existing settings. + +Now, when you save a file, the file is linted automatically. + +## Configuring OSD server run tasks + +Starting the OSD server typically requires running commands in two terminals: + +* In the first terminal, run `yarn opensearch snapshot`. +* After several seconds, in the second terminal, run `yarn run start --no-base-path`. + +This approach works for many use cases but doesn't allow you to debug server-side changes. To simplify this process, you can set up a series of launch tasks, turning server startup into a one-click operation: + +1. Start the OpenSearch server. +2. Start the Dashboards server. Starting the Dashboards server requires a wait period while the server boots up; this can be accomplished with a VS Code task that specifies to wait for a certain period of time. + +Making this task a VS Code run configuration provides the following benefits: + +* One-click OSD development server startup, eliminating the need to repeatedly use the CLI. +* The ability to debug server-side code (for client-side code [public], you can use your preferred browser's developer tools). + +The run configuration is presented in the following image. + +![run-configurations](/assets/media/blog-images/2024-12-17-vscode-osd-setup/run-configurations.gif) + +### Prerequisites + +Make sure [NVM](https://github.com/nvm-sh/nvm) and Node 18.9.0 are installed on your system: + +```sh +# With Homebrew +brew install nvm + +# Normal +/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" + +# Install 18.9.0 +nvm install 18.9.0 +``` + +### Setup + +1. If the `launch.json` file does not already exist in the `.vscode` directory at the project root, create it and add the following configuration: + + ```json + { + "version": "0.2.0", + "configurations": [ + /** + * This starts the Dashboards server + * - Will wait 13 seconds before starting so OpenSearch server can run + */ + { + "name": "Start Dashboards Server", + "type": "node", + "request": "launch", + // Alternatively, run "which yarn" and change the path of yarn + "program": "~/.nvm/versions/node/v18.19.0/bin/yarn", + "args": ["run", "start", "--no-base-path"], + "cwd": "${workspaceFolder}", + "runtimeExecutable": null, + "runtimeArgs": [], + "env": { + "NODE_ENV": "development" + }, + "console": "integratedTerminal", + "internalConsoleOptions": "openOnSessionStart", + "preLaunchTask": "13 Second Delay Command" + }, + /** + * This starts the OpenSearch snapshot server + * - This will be ran first + * - Everytime the server is stopped, data will NOT be persisted. Thus, if you need to persist data, you can configure this to run your own local OpenSearch server + */ + { + "name": "Start OpenSearch Snapshot", + "type": "node", + "request": "launch", + // Alternatively, run "which yarn" and change the path of yarn + "program": "~/.nvm/versions/node/v18.19.0/bin/yarn", + "args": ["run", "opensearch", "snapshot"], + "cwd": "${workspaceFolder}", + "runtimeExecutable": null, + "runtimeArgs": [], + "env": { + "NODE_ENV": "development" + }, + "console": "integratedTerminal", + "internalConsoleOptions": "openOnSessionStart" + } + ], + "compounds": [ + /** + * This is the run configuration to startup OSD + * 1. Starts up OpenSearch server + * 2. Waits 13 seconds + * 3. Starts up Dashboards server + */ + { + "name": "Start Dashboards", + "configurations": ["Start OpenSearch Snapshot", "Start Dashboards Server"], + // If either Dashboards or OpenSearch server is stopped, both servers will be stopped (set this to false to individually turn off a server without turning all off) + "stopAll": true + } + ] + } + ``` + +1. If the `tasks.json` file does not already exist in the `.vscode` directory at the project root, create it and add the following configuration: + + ```json + { + "version": "2.0.0", + "tasks": [ + // Silent task to sleep for 13 seconds; this is the upper limit on the time taken for OpenSearch to boot up + { + "label": "13 Second Delay Command", + "type": "shell", + "command": "sleep 13", + "group": "none", + "presentation": { + "reveal": "silent", + "panel": "new", + "close": true + } + } + ] + } + ``` + +1. On the **Run and Debug** tab, select **Start Dashboards** from the dropdown menu and press the play icon. OSD should start after a period of time. + +Now you can set any breakpoint in `server` code. + +The **Debug** toolbar should appear under the **CALL STACK** in a specific worker node, as shown in the following image. + +![debug-toolbar](/assets/media/blog-images/2024-12-17-vscode-osd-setup/debug-toolbar.png) + +Using this toolbar, you can resume breakpoints, step into code, and execute the next instruction. + +## Configuring multi-root workspace integration + +For most development scenarios, working within the OSD Core should be sufficient. However, if you need to develop code for a plugin, the [multi-root workspaces feature](https://code.visualstudio.com/docs/editor/multi-root-workspaces) is a useful option. The setup for Jest, linting, and run configurations for plugins is similar to that of the OSD Core, so we won't cover it in detail here. + +In summary, in order for OSD to recognize plugins during development, the plugin's project root must be located within the `OpenSearch-Dashboards/plugins/` directory. The following image shows multi-root workspace integration. + +![multi-root-workspaces](/assets/media/blog-images/2024-12-17-vscode-osd-setup/multi-root-workspaces.gif) + +### Setup + +In this example, assume you're developing code for the [anomaly-detection-dashboards-plugin](https://github.com/opensearch-project/anomaly-detection-dashboards-plugin) and that the project has been checked out in the `plugins` directory. + +1. Navigate to one directory above the `OpenSearch-Dashboards` project folder and create a file called `OpenSearch-Dashboards.code-workspace`. Add the following configuration to this file: + + ```json + { + "folders": [ + { + // Names are configurable; they will show up in the EXPLORER tab + "name": "OSD Core", + // Path to the project root + "path": "OpenSearch-Dashboards" + }, + { + "name": "Anomaly Detection Plugin", + "path": "OpenSearch-Dashboards/plugins/anomaly-detection-dashboards-plugin" + } + ], + // Define workspace-specific settings here + "settings": {}, + "launch": { + // Specify workspace-specific launch configurations + "configurations": [], + // Specify workspace-specific launch compounds + "compounds": [] + } + } + ``` + +1. Select the **Open Workspace** button to view your new workspace. + +In this workspace, you will have access to launch configurations, Jest test suites, code search, file search, and many other features. + +## Wrapping up + +While these tools aren't a replacement for a thorough understanding of the OSD codebase, they can help to streamline your development workflow. By automating tasks like server startup, debugging, and linting, they reduce time spent on configuration and allow you to focus on writing code. These tools make the development and PR process more efficient, saving you time and boosting productivity. diff --git a/_posts/2024-12-30-hybrid-search-optimization.md b/_posts/2024-12-30-hybrid-search-optimization.md new file mode 100644 index 000000000..eb72ae07a --- /dev/null +++ b/_posts/2024-12-30-hybrid-search-optimization.md @@ -0,0 +1,342 @@ +--- +layout: post +title: "Optimizing hybrid search in OpenSearch" +authors: + - dwrigley +date: 2024-12-30 +categories: + - technical-posts + - community +meta_keywords: hybrid query, hybrid search, neural query, lexical search, search relevancy, search result quality optimization +meta_description: Tackle the optimization of hybrid search in a systematic way and train models that dynamically predict the best way to run hybrid search in your search application. +--- + +# Introduction + +[Hybrid search combines lexical and neural search to improve search relevance](https://opensearch.org/docs/latest/search-plugins/hybrid-search); this combination shows promising results across industries and [in benchmarks](https://opensearch.org/blog/semantic-science-benchmarks/). + +In OpenSearch 2.18, [hybrid search](https://opensearch.org/docs/latest/search-plugins/hybrid-search/) is an arithmetic combination of the lexical (match query) and neural (k-NN) search scores. It first normalizes the scores and then combines them with one of three techniques (arithmetic, harmonic, or geometric mean), each of which includes weighting parameters. + +The search pipeline configuration is how OpenSearch users define score normalization, combination, and weighting. + +# Finding the right hybrid search configuration can be difficult + +The primary question for a user of hybrid search in OpenSearch is how to choose the normalization and combination techniques and the weighting parameters for their application. + +What is best depends strongly on the corpus, on user behavior, and on the application domain---there is no one-size-fits-all solution. + +However, there is a systematic way to arrive at this ideal set of parameters. We call identifying the best set of parameters *global hybrid search optimization*: we identify the best parameter set for all incoming queries; it is "global" because it doesn't depend on per-query factors. We will cover this approach first before moving on to a dynamic approach that takes into account per-query signals. + +# Global hybrid search optimizer + +We treat hybrid search configuration as a parameter optimization problem. The parameters and combinations are: + +* Two [normalization techniques: `l2` and `min_max`](https://opensearch.org/blog/How-does-the-rank-normalization-work-in-hybrid-search/). +* Three combination techniques: arithmetic mean, harmonic mean, geometric mean. +* The lexical and neural search weights, which are values ranging from 0 to 1. + + +With this knowledge we can define a collection of parameter combinations to try out and compare. To follow this path we need three things: + +1. Query set: A collection of queries. +2. Judgments: A collection of ratings that indicate the relevance of a result for a given query. +3. Search quality metrics: A numeric expression indicating how well the search system performs in returning relevant documents for queries. + +## Query set + +A query set is a collection of queries. Ideally, query sets contain a representative set of queries. "Representative" means that different query classes are included in this query set: + +* Very frequent queries (head queries) but also queries that are rarely used (tail queries) +* Queries that are important to the business +* Queries that express different user intent classes (for example, searching for a product category, searching for product category \+ color, searching for a brand) +* Other classes, depending on the individual search application + +These different queries are best sourced from a query log that captures all queries your users send to your system. One way of sampling these efficiently is [Probability-Proportional-to-Size Sampling](https://opensourceconnections.com/blog/2022/10/13/how-to-succeed-with-explicit-relevance-evaluation-using-probability-proportional-to-size-sampling/) (PPTSS). This method can generate a frequency-weighted sample. + +We will first run each query in the query set against a baseline to determine our search result quality at the beginning of this experimentation phase. + +## Judgments + +Once a query set is available, judgments come next. A judgment describes how relevant a particular document is for a given query. A judgment consists of three parts: the query, the document, and a (typically) numerical rating. + +Ratings can be binary (0 or 1, that is, irrelevant or relevant) or graded (for example, 0 to 3, definitely irrelevant to definitely relevant). In the case of explicit judgments, human raters review query-document pairs and assign these ratings. Implicit judgments, on the other hand, are derived from user behavior: user queries and viewed and clicked documents. Implicit judgments can be modeled with [click models that emerged from web search](https://clickmodels.weebly.com/) in the early 2010s and range from simple click-through rates to more [complex approaches](https://www.youtube.com/watch?v=wa88XShl7hs). All come with limitations and/or deal differently with biases like position bias. + +Recently, a third category of judgment generation has emerged: LLM-as-a-judge. Here a large language model like GPT-4o judges query-doc pairs. + +All three categories have different strengths and weaknesses. Whichever you choose, you need to have a decent amount of judgments. Twice the depth of your default search result page per query is usually a good starting point for explicit judgments. So if you show your users 24 results per result page, you should rate the first 48 results for each query. + +Implicit judgments have the advantage of scale: when already collecting user events (like queries, viewed documents, and clicked documents), this is an enabling step for calculating thousands of judgments by modeling these events as judgments. + +## Search metrics + +With a query set and the corresponding judgments, we can calculate search quality metrics. Widely used [search metrics are Precision, DCG, or NDCG](https://opensourceconnections.com/blog/2020/02/28/choosing-your-search-relevance-metric/). + +Search metrics provide a way of measuring the search result quality of a search system numerically. We calculate search metrics for each configuration, and this enables us to compare them objectively against each other. As a result we know which configuration scored best. + +If you're looking for guidance and support in generating a query set, creating implicit judgments based on user behavior signals, or calculating metrics based on these signals, feel free to [check out the search result quality evaluation framework](https://github.com/o19s/opensearch-search-quality-evaluation/). + +## Create a baseline with the ESCI dataset + +Let's put all the pieces together and calculate search metrics for one particular example: in the [hybrid search optimizer repository](https://github.com/o19s/opensearch-hybrid-search-optimization/) we use the [ESCI dataset](https://github.com/amazon-science/esci-data), and in [notebooks 1--3](https://github.com/o19s/opensearch-hybrid-search-optimization/tree/main/notebooks) we configure OpenSearch to run hybrid queries, index the products of the ESCI dataset, create a query set, and execute each of the queries in a lexical search setting that we assume to be our baseline. The search metrics can be calculated because the ESCI dataset comes not only with products and queries but also with judgments. + +We chose a `multi_match` query of the type `best_fields` as our baseline. We search in the different dataset fields with "best guess" fields weights. In a real-world scenario we recommend techniques like learning to boost based on Bayesian optimization to figure out the best field and field weight combination. + +``` +{ + "_source": { + "excludes": [ + "title_embedding" + ] + }, + "query": { + "multi_match" : { + "type": "best_fields", + "fields": [ + "product_id^100", + "product_bullet_point^3", + "product_color^2", + "product_brand^5", + "product_description", + "product_title^10" + ], + "operator": "and", + "query": query[2] + } + } +} +``` + +To arrive at a query set, we used two random samples: a small one containing 250 queries and a large one containing 5,000 queries. Unfortunately, the ESCI dataset does not contain any information about the frequency of queries, which excludes frequency-weighted approaches like the above-mentioned PPTSS. + +The following are the results of running the test set of both query sets independently. + +| Metric | Baseline BM25 – Small | Baseline BM25 – Large | +| :---: | :---: | :---: | +| DCG@10 | 9.65 | 8.82 | +| NDCG@10 | 0.24 | 0.23 | +| Precision@10 | 0.27 | 0.24 | + +We applied an 80/20 split on the query sets to arrange for a training and test dataset. Every optimization step uses the queries of the training set, whereas search metrics are calculated and compared for the test set. For the baseline, we calculated the metrics for only the test set because there is no actual training occurring. + +These numbers are now the starting point for our optimization journey. We want to maximize these metrics and see how far we get when looking for the best global hybrid search configuration in the next step. + +## Identifying the best hybrid search configuration + +With this starting point, we can explore the parameter space that hybrid search offers. Our global hybrid search optimization notebook tries out 66 parameter combinations for hybrid search with the following set: + +* Normalization technique: [`l2`, `min_max`] +* Combination technique: [`arithmetic_mean`, `harmonic_mean`, `geometric_mean`] +* Lexical search weight: [`0.0`, `0.1`, `0.2`, `0.3`, `0.4`, `0.5`, `0.6`, `0.7`, `0.8`, `0.9`, `1.0`] +* Neural search weight: [`1.0`, `0.9`, `0.8`, `0.7`, `0.6`, `0.5`, `0.4`, `0.3`, `0.2`, `0.1`, `0.0`] + +Neural and lexical search weights always add up to 1.0, so we don't need to choose them independently. + +This leaves us with 66 combinations to test: 2 normalization techniques * 3 combination techniques * 11 lexical/neural search weight combinations. + +For each of these combinations, we run the queries of the training set. To do so we use OpenSearch's [temporary search pipeline capability](https://opensearch.org/docs/latest/search-plugins/search-pipelines/using-search-pipeline/#using-a-temporary-search-pipeline-for-a-request), making it unnecessary to pre-create all pipelines for the 66 parameter combinations. + +Here is a template of the temporary search pipelines we use for our hybrid search queries: + +``` +"search_pipeline": { + "request_processors": [ + { + "neural_query_enricher" : { + "description": "one of many search pipelines for experimentation", + "default_model_id": model_id, + "neural_field_default_id": { + "title_embeddings": model_id + } + } + } + ], + "phase_results_processors": [ + { + "normalization-processor": { + "normalization": { + "technique": norm + }, + "combination": { + "technique": combi, + "parameters": { + "weights": [ + lexicalness, + neuralness + ] + } + } + } + } + ] +} +``` + +`norm` is the variable for the normalization technique, `combi` is the variable for the combination technique, `lexicalness` is the lexical search weight, and `neuralness` is the neural search weight. + +The neural part of the hybrid query searches in a field with embeddings that were created based on the title of a product with the model `all-MiniLM-L6-v2`: + +``` +{ + "neural": { + "title_embedding": { + "query_text": query[2], + "k": 100 + } + } +} +``` + +Using the queries of the training dataset and retrieving the results, we calculate the three search metrics DCG@10, NDCG@10, and Precision@10. For the small dataset, there is one pipeline configuration that scores best for all three metrics. The pipeline uses the l2 norm, arithmetic mean, a lexical search weight of 0.4, and a neural search weight of 0.6. + +The following metrics are calculated: + +* DCG: 9.99 +* NDCG: 0.26 +* Precision: 0.29 + +Applying the potentially best hybrid search parameter combination to the test set and calculating the metrics for these queries results in the following numbers. + +| Metric | Baseline BM25 – Small | Global Hybrid Search Optimizer – Small | Baseline BM25 – Large | Global Hybrid Search Optimizer – Large | +| :---: | :---: | :---: | :---: | :---: | +| DCG@10 | 9.65 | 9.99 | 8.82 | 9.30 | +| NDCG@10 | 0.24 | 0.26 | 0.23 | 0.25 | +| Precision@10 | 0.27 | 0.29 | 0.24 | 0.27 | + +Improvements are seen across all metrics for both datasets. To recap, up to this point, we performed the following steps: + +* Create a query set by randomly sampling. +* Generate judgments (to be precise, we only used the existing judgments of the ESCI dataset). +* Calculate search metrics for a baseline. +* Try out several hybrid search combinations. +* Compare search metrics. + +Two things are important to note: + +* While the systematic approach can be transferred to other applications, the experiment results cannot. It is necessary to always evaluate and experiment with your own data. +* The ESCI dataset does not provide 100% judgment coverage. On average we saw roughly 35% judgment coverage among the top 10 retrieved results per query. This leaves us with some uncertainty. + +The improvements tell us that we optimize our metrics on average when switching to hybrid search with the above parameter values. But of course there are queries that benefit (as in their search quality metrics improve) and queries that do not benefit (as in their search quality metrics decrease) when conducting this switch. This is something we can virtually always observe when comparing two search configurations with each other. While one configuration outperforms the other on average, not every query will profit from the configuration. + +The following chart shows the DCG@10 values of the training queries of the small query set. The x-axis represents the search pipeline with l2 norm, arithmetic mean, 0.1 lexical search weight, and 0.9 neural search weight (configuration A). The y-axis represents the search pipeline with an identical normalization and combination technique but switched weights: 0.9 lexical search weight and 0.1 neural search weight (configuration B). + +Scatter Plot of DCG values for lexical-heavy search configuration and Neural-heavy search configuration{:style="width: 100%; max-width: 800px; height: auto; text-align: center"} + +The queries with the highest search quality metric improvements of configuration B are those that are located on the y-axis: they have a DCG score of 0 for this configuration. And for configuration A some even score above 15. + +Striving to improve the search quality metrics for all queries raises the following question: improvements on average are fine, but how can we tackle this in a more targeted way to come up with an approach that provides the best configuration per query instead of one good configuration for all queries? + +# Dynamic hybrid search optimizer + +We call identifying a suitable configuration individually per hybrid search query *dynamic hybrid search optimization*. To move in that direction we treat hybrid search as a query understanding challenge: by understanding certain features of the query, we develop an approach to predict the "neuralness" of a query. "Neuralness" is used to describe the neural search weight for the hybrid search queries. + +You may ask: Why predict only the "neuralness" and none of the other parameter values? The results of the global hybrid search optimizer (large query set) showed us that the majority of search configurations share two parameter values: the l2 normalization technique and the arithmetic mean as the combination technique. + +Looking at the top 5 configurations per search metric (DCG@10, NDCG@10, and Precision@10), only 5 out of the 15 pipelines have `min_max` as an alternative normalization technique, and none of these configurations has another combination technique. + +With this knowledge we assume the l2 normalization and the arithmetic mean combination technique to be best suited throughout the whole dataset. + +That leaves us with the parameter values for the neural search weight and the lexical search weight. By predicting one we can calculate the other by subtracting the prediction from 1: by predicting the "neuralness" we can calculate the "lexicalness" by 1 - "neuralness". + +To validate our hypothesis, we created a couple of feature groups and features within these groups. Afterwards we trained machine learning models to predict an expected NDCG value for the given "neuralness" of a query. + +## Feature groups and features + +We divide the features into three groups: query features, lexical search result features, and neural search result features: + +* Query features: These features describe the user query string. +* Lexical search result features: These features describe the results that the user query retrieves when executed as a lexical search. +* Neural search result features: These features describe the results that the user query retrieves when executed as a neural search. + +### Query features + +* Number of terms: How many terms does the user query have? +* Query length: How long is the user query (measured in characters)? +* Contains number: Does the query contain one or more numbers? +* Contains special character: Does the query contain one or more special characters (non-alphanumeric characters)? + +### Lexical search result features + +* Number of results: The number of results for the lexical query. +* Maximum title score: The maximum score of the titles of the retrieved top 10 documents. The scores are BM25 scores calculated individually per result set. That means that the BM25 score is not calculated on the whole index but only on the retrieved subset for the query, making the scores more comparable to each other and less prone to outliers that could result from high IDF values for very rare query terms. +* Sum of title scores: The sum of the title scores of the top 10 documents, again calculated per result set. We use the sum of the scores (and no average value) as an aggregate to measure how relevant all retrieved top 10 titles are. BM25 scores are not normalized, so using the sum instead of the average seemed reasonable. + +### Neural search result features + +* Maximum semantic score: The maximum semantic score of the retrieved top 10 documents. This is the score we receive for a neural query based on the query's similarity to the title. +* Average semantic score: In contrast to BM25 scores, the semantic scores are normalized and in the range of 0 to 1. Using the average score seems more reasonable than attempting to calculate the sum. + +## Feature engineering + +We used the output of the global hybrid search optimizer as training data. As part of this process, we ran every query 66 times: once per hybrid search configuration. For each query we calculated the search metrics, so we know which pipeline worked best per query and thus also which "neuralness" (neural search weight) worked best. We used the best NDCG@10 value per query as the metric to decide the ideal "neuralness." + +That leaves us with 250 queries (small query set) or 5,000 queries (large query set) together with their "neuralness" values for which they achieved the best NDCG@10 values. Next, we engineered the nine features for each query. This constitutes the training and test data. + +## Model training and evaluation findings + +With the appropriate data at hand, we explored different algorithms and experimented with different model fitting settings to identify patterns and evaluate whether our approach was suitable. +We used two relatively simple algorithms: linear regression and random forest regression. +We applied cross-validation, regularization, and tried out all different feature combinations. This resulted in interesting findings that are summarized in the following section. + +**Dataset size matters**: Working with the differently sized datasets revealed that the amount of data matters when training and evaluating the models. The larger dataset reported a smaller Root Mean Squared Error compared to the smaller dataset. The larger dataset also showed less variation of the RMSE scores within the cross-validation runs (that is, when comparing the RMSE scores within one cross-validation run for one feature combination). + +**Model performance differs among the different algorithms**: The best RMSE score for the random forest regressor was 0.18 compared to 0.22 for the best linear regression model (large dataset)---both with different feature combinations, though. The more complex model (random forest) performs better. However, better performance comes with the trade-off of longer training times for this more complex model. + +**Feature combinations of all groups have the lowest RMSE**: The lowest error scores can be achieved when combining features from all three feature groups (query, lexical search result, and neural search result). Looking at RMSE scores for feature combinations within the feature groups shows that working with lexical search result feature combinations serves as the best alternative. + +This is particularly interesting when thinking about productionizing this: putting an approach like this in production means that features need to be calculated per query during query time. Getting lexical search result features and neural search result features requires running these queries, which would add significant latency to the overall query even prior to inference time. + +The following image shows the distribution of RMSE scores within one cross-validation run when fitting random forest regression models with feature combinations within one group (blue: neural search features, red: lexical result features, green: query features) and across the groups (purple: features from all groups). The feature mix (purple) scores lowest (best), followed by training on lexical search result features only (red). +Box plot showing the distribution of RMSE scores within one cross validation run when fitting the random forest regression model{:style="width: 100%; max-width: 800px; height: auto; text-align: center"} + +The overall picture does not change when looking at the numbers for the linear model. +Box plot showing the distribution of RMSE scores within one cross validation run when fitting the linear regression model + +## Model testing + +Let's look at how the trained models perform when applying them dynamically to our test set. +For each query of the test set we engineer the features and let the model make the inference for the "neuralness" values between 0.0 and 1.0 because "neuralness" is also a feature that we pass into the model. We then take the "neuralness" value that resulted in the highest prediction, which is the best NDCG value. By knowing the "neuralness" we can calculate the "lexicalness" by subtracting the "neuralness" from 1. + +We again use the l2 norm and arithmetic mean as our hybrid search normalization and combination parameter values because they scored best in the global hybrid search optimizer experiment. With that, we build the hybrid query, execute it, retrieve the results, and calculate the search metrics like with the baseline and global hybrid search optimizer. + +The following are the metrics for the small dataset. + +| Metric | Baseline BM25 | Global Hybrid Search Optimizer | Dynamic Hybrid Search Optimizer – Linear Model | Dynamic Hybrid Search Optimizer – Random Forest Model | +| :---: | :---: | :---: | :---: | :---: | +| DCG@10 | 9.65 | 9.99 | 10.92 | 10.92 | +| NDCG@10 | 0.24 | 0.26 | 0.28 | 0.28 | +| Precision@10 | 0.27 | 0.29 | 0.32 | 0.32 | + +The following are the metrics for the large dataset. + +| Metric | Baseline BM25 | Global Hybrid Search Optimizer | Dynamic Hybrid Search Optimizer – Linear Model | Dynamic Hybrid Search Optimizer – Random Forest Model | +| :---: | :---: | :---: | :---: | :---: | +| DCG@10 | 8.82 | 9.30 | 10.13 | 10.13 | +| NDCG@10 | 0.23 | 0.25 | 0.27 | 0.27 | +| Precision@10 | 0.24 | 0.27 | 0.29 | 0.29 | + +Looking at these numbers shows us a steady positive trend starting from the baseline and going all the way to the dynamic predictions of "lexicalness" and "neuralness" per query. The large dataset shows a DCG increase of 8.9%, rising from 9.3 to 10.13, and the small dataset shows an increase of 9.3%. The other metrics increase as well: NDCG shows an improvement of 7.4% for the large dataset and 10.3% for the small dataset, and Precision shows an improvement of 8% for the large dataset and 7.7% for the small dataset. + +Interestingly, both models score exactly equally. The reason for this is that while they both predict different NDCG values, they predict the best ones with the same "neuralness" as an input feature. So while the models may differ in RMSE scores during the evaluation phase, they provide equal results when applied to the test set. + +Despite the low judgement coverage, we see improvements for all metrics. This gives us confidence that this approach can provide value not only for search systems switching from lexical to hybrid search but also for those that are already are in production but have never used any systematic process to evaluate and identify the best settings. + +# Conclusion + +We provide a systematic approach to optimizing hybrid search in OpenSearch based on its current state and capabilities (normalization and combination techniques). The results look promising, especially given the low judgment coverage provided by the ESCI dataset. + +We encourage everyone to adopt the approach and explore its usefulness with their dataset. We look forward to hearing the community's feedback on the provided approach on the [OpenSearch forum](https://forum.opensearch.org/). + +# Future work + +The currently planned next steps include replicating the approach with a dataset that has higher judgment coverage and covers a different domain in order to determine its generalizability. + +Optimizing hybrid search is not typically the first step in search result quality optimization. Optimizing lexical search results first is especially important because the lexical search query is part of the hybrid search query. Bayesian optimization is an efficient technique for efficiently identifying the best set of fields and field weights, sometimes also referred to as "learning to boost." + +The straightforward approach of trying out 66 different combinations can be performed more elegantly by applying a technique like Bayesian optimization as well. In particular, we expect this to result in a performance improvement for large search indexes and large numbers of queries. + +Reciprocal rank fusion, currently under active development, is another way of combining lexical search and neural search: + +* [https://github.com/opensearch-project/neural-search/issues/865](https://github.com/opensearch-project/neural-search/issues/865) +* [https://github.com/opensearch-project/neural-search/issues/659](https://github.com/opensearch-project/neural-search/issues/659) + +We also plan to include this technique and to identify the best way of running hybrid search dynamically per query. diff --git a/_solutionsProviders/O11y.md b/_solutionsProviders/O11y.md new file mode 100644 index 000000000..d81d7fd35 --- /dev/null +++ b/_solutionsProviders/O11y.md @@ -0,0 +1,65 @@ +--- + +name: O11y +name_long: O11y.io Ltd + +main_office_location: | + 71-75 Shelton Street, + Covent Garden, + London WC2H 9JQ + United Kingdom + +description: | + O11y Consulting is your trusted partner in transforming observability into a strategic advantage. We help organizations implement, optimize, and scale observability frameworks to improve system reliability, reduce operational costs, and unlock actionable insights.

+ + Our approach focuses on empowering your teams with tailored solutions, actionable guidance, and the tools to monitor, optimize, and innovate effectively. From setup to long-term optimization, we’re here to make sure your observability investments deliver measurable business value. O11y Consulting is a UK-based consultancy with a global reach, serving industries that demand reliability, scalability, and speed. + +link: https://o11y.co/ + +contact: https://o11y.co/get-in-touch/ + +logo: /assets/media/partners/O11y/O11y-logo-large.png +logo_large: /assets/media/partners/O11y/O11y-logo-large.png + + + + + + + +business_type: Consultancy, Platform Integrator, Professional Services, Support, Systems Integrator, Training +opensearch_tech: Analytics, Logs and Metrics, Machine Learning and AI, Observability, Search, Security +region: Global +industries: Software and Technology, Business Services, Consumer Services, Education, Financial Services, Energy and Utilities, Government, Public Sector, Nonprofit, Healthcare, Media and Entertainment, Retail, Telecommunications + + +resources: + - url: 'https://blog.o11y.co/blog/top-10-observability-tools-for-modern-devops-teams/' + title: 'Top 10 Observability Tools for Modern DevOps Teams in 2024' + thumbnail: '/assets/media/partners/O11y/blog-1.png' + type: 'blog' + - url: 'https://blog.o11y.co/blog/maximizing-roi-in-observability-a-guide-to-managing-costs-without/' + title: 'Maximizing ROI in Observability: A Guide to Managing Costs Without Compromising Visibility' + thumbnail: '/assets/media/partners/O11y/blog-2.png' + type: 'blog' + - url: ' https://blog.o11y.co/blog/maximizing-roi-with-observability-best-practices-for-success/' + title: 'Maximizing ROI with Observability: Best Practices for Success' + thumbnail: '/assets/media/partners/O11y/blog-3.png' + type: 'blog' + +product_image: '' +products: + - url: 'https://o11y.co/' + name: 'Observability Framework Implementation' + description: 'End-to-end integration of tools like OpenSearch®, Elastic, New Relic, Grafana®, and more with your systems. Includes SLO design, custom dashboards, and alerting tailored to your business objectives.' + - url: 'https://o11y.co/' + name: 'Cost Management and Optimization' + description: 'Leverage FinOps best practices to optimize data ingestion, manage operational costs, and maximize the value of your observability tools.' + - url: 'https://o11y.co/' + name: 'Observability Training and Support' + description: 'Upskill your teams with hands-on training sessions, certifications, and workshops designed to embed observability practices into your organization.' + - url: 'https://o11y.co/' + name: 'Observability Centre of Excellence' + description: 'Quarterly on-site sessions to review your observability roadmap, ensure alignment with business goals, and increase team engagement.' + +--- diff --git a/assets/media/blog-images/2024-12-17-vscode-osd-setup/debug-tests.png b/assets/media/blog-images/2024-12-17-vscode-osd-setup/debug-tests.png new file mode 100644 index 000000000..31798ffdb Binary files /dev/null and b/assets/media/blog-images/2024-12-17-vscode-osd-setup/debug-tests.png differ diff --git a/assets/media/blog-images/2024-12-17-vscode-osd-setup/debug-toolbar.png b/assets/media/blog-images/2024-12-17-vscode-osd-setup/debug-toolbar.png new file mode 100644 index 000000000..d04b75844 Binary files /dev/null and b/assets/media/blog-images/2024-12-17-vscode-osd-setup/debug-toolbar.png differ diff --git a/assets/media/blog-images/2024-12-17-vscode-osd-setup/inline-tests.png b/assets/media/blog-images/2024-12-17-vscode-osd-setup/inline-tests.png new file mode 100644 index 000000000..0f732e097 Binary files /dev/null and b/assets/media/blog-images/2024-12-17-vscode-osd-setup/inline-tests.png differ diff --git a/assets/media/blog-images/2024-12-17-vscode-osd-setup/jest-integration.gif b/assets/media/blog-images/2024-12-17-vscode-osd-setup/jest-integration.gif new file mode 100644 index 000000000..02a515152 Binary files /dev/null and b/assets/media/blog-images/2024-12-17-vscode-osd-setup/jest-integration.gif differ diff --git a/assets/media/blog-images/2024-12-17-vscode-osd-setup/linting-integration.gif b/assets/media/blog-images/2024-12-17-vscode-osd-setup/linting-integration.gif new file mode 100644 index 000000000..c6f873500 Binary files /dev/null and b/assets/media/blog-images/2024-12-17-vscode-osd-setup/linting-integration.gif differ diff --git a/assets/media/blog-images/2024-12-17-vscode-osd-setup/multi-root-workspaces.gif b/assets/media/blog-images/2024-12-17-vscode-osd-setup/multi-root-workspaces.gif new file mode 100644 index 000000000..7b218c7b4 Binary files /dev/null and b/assets/media/blog-images/2024-12-17-vscode-osd-setup/multi-root-workspaces.gif differ diff --git a/assets/media/blog-images/2024-12-17-vscode-osd-setup/open-workspace.png b/assets/media/blog-images/2024-12-17-vscode-osd-setup/open-workspace.png new file mode 100644 index 000000000..ff684f735 Binary files /dev/null and b/assets/media/blog-images/2024-12-17-vscode-osd-setup/open-workspace.png differ diff --git a/assets/media/blog-images/2024-12-17-vscode-osd-setup/run-configurations.gif b/assets/media/blog-images/2024-12-17-vscode-osd-setup/run-configurations.gif new file mode 100644 index 000000000..f185bfd74 Binary files /dev/null and b/assets/media/blog-images/2024-12-17-vscode-osd-setup/run-configurations.gif differ diff --git a/assets/media/blog-images/2024-12-17-vscode-osd-setup/test-icon.png b/assets/media/blog-images/2024-12-17-vscode-osd-setup/test-icon.png new file mode 100644 index 000000000..9c358fdc5 Binary files /dev/null and b/assets/media/blog-images/2024-12-17-vscode-osd-setup/test-icon.png differ diff --git a/assets/media/blog-images/2024-12-17-vscode-osd-setup/test-suites.png b/assets/media/blog-images/2024-12-17-vscode-osd-setup/test-suites.png new file mode 100644 index 000000000..5e2c57dd2 Binary files /dev/null and b/assets/media/blog-images/2024-12-17-vscode-osd-setup/test-suites.png differ diff --git a/assets/media/blog-images/2024-12-xx-optimizing-hybrid-search/1_search_config_comparison.png b/assets/media/blog-images/2024-12-xx-optimizing-hybrid-search/1_search_config_comparison.png new file mode 100644 index 000000000..3a6bd77a9 Binary files /dev/null and b/assets/media/blog-images/2024-12-xx-optimizing-hybrid-search/1_search_config_comparison.png differ diff --git a/assets/media/blog-images/2024-12-xx-optimizing-hybrid-search/2_random_forest_best_feature_combinations.png b/assets/media/blog-images/2024-12-xx-optimizing-hybrid-search/2_random_forest_best_feature_combinations.png new file mode 100644 index 000000000..4fded6861 Binary files /dev/null and b/assets/media/blog-images/2024-12-xx-optimizing-hybrid-search/2_random_forest_best_feature_combinations.png differ diff --git a/assets/media/blog-images/2024-12-xx-optimizing-hybrid-search/3_linear_model_best_feature_combinations.png b/assets/media/blog-images/2024-12-xx-optimizing-hybrid-search/3_linear_model_best_feature_combinations.png new file mode 100644 index 000000000..04056046a Binary files /dev/null and b/assets/media/blog-images/2024-12-xx-optimizing-hybrid-search/3_linear_model_best_feature_combinations.png differ diff --git a/assets/media/partners/O11y/blog-1.png b/assets/media/partners/O11y/blog-1.png new file mode 100644 index 000000000..69c2476d7 Binary files /dev/null and b/assets/media/partners/O11y/blog-1.png differ diff --git a/assets/media/partners/O11y/blog-2.png b/assets/media/partners/O11y/blog-2.png new file mode 100644 index 000000000..bdab66f7f Binary files /dev/null and b/assets/media/partners/O11y/blog-2.png differ diff --git a/assets/media/partners/O11y/blog-3.png b/assets/media/partners/O11y/blog-3.png new file mode 100644 index 000000000..06031dd95 Binary files /dev/null and b/assets/media/partners/O11y/blog-3.png differ diff --git a/assets/media/partners/O11y/o11y-logo-large.png b/assets/media/partners/O11y/o11y-logo-large.png new file mode 100644 index 000000000..f536f3dca Binary files /dev/null and b/assets/media/partners/O11y/o11y-logo-large.png differ