Add TopicNodeParser based on MedGraphRAG paper (#16131)

run-llama · Sep 20, 2024 · 81ecb2a · 81ecb2a
1 parent 41166df
commit 81ecb2a
Show file tree

Hide file tree

Showing 12 changed files with 973 additions and 0 deletions.
diff --git a/docs/docs/examples/node_parsers/topic_parser.ipynb b/docs/docs/examples/node_parsers/topic_parser.ipynb
diff --git a/llama-index-integrations/node_parser/llama-index-node-parser-topic/.gitignore b/llama-index-integrations/node_parser/llama-index-node-parser-topic/.gitignore
@@ -0,0 +1,153 @@
+llama_index/_static
+.DS_Store
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+bin/
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+etc/
+include/
+lib/
+lib64/
+parts/
+sdist/
+share/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+.ruff_cache
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+notebooks/
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+pyvenv.cfg
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# Jetbrains
+.idea
+modules/
+*.swp
+
+# VsCode
+.vscode
+
+# pipenv
+Pipfile
+Pipfile.lock
+
+# pyright
+pyrightconfig.json
diff --git a/llama-index-integrations/node_parser/llama-index-node-parser-topic/BUILD b/llama-index-integrations/node_parser/llama-index-node-parser-topic/BUILD
@@ -0,0 +1,3 @@
+poetry_requirements(
+    name="poetry",
+)
diff --git a/llama-index-integrations/node_parser/llama-index-node-parser-topic/Makefile b/llama-index-integrations/node_parser/llama-index-node-parser-topic/Makefile
@@ -0,0 +1,17 @@
+GIT_ROOT ?= $(shell git rev-parse --show-toplevel)
+
+help:	## Show all Makefile targets.
+	@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[33m%-30s\033[0m %s\n", $$1, $$2}'
+
+format:	## Run code autoformatters (black).
+	pre-commit install
+	git ls-files | xargs pre-commit run black --files
+
+lint:	## Run linters: pre-commit (black, ruff, codespell) and mypy
+	pre-commit install && git ls-files | xargs pre-commit run --show-diff-on-failure --files
+
+test:	## Run tests via pytest.
+	pytest tests
+
+watch-docs:	## Build and watch documentation.
+	sphinx-autobuild docs/ docs/_build/html --open-browser --watch $(GIT_ROOT)/llama_index/
diff --git a/llama-index-integrations/node_parser/llama-index-node-parser-topic/README.md b/llama-index-integrations/node_parser/llama-index-node-parser-topic/README.md
@@ -0,0 +1,42 @@
+# LlamaIndex Node_Parser Integration: TopicNodeParser
+
+Implements the topic node parser described in the paper [MedGraphRAG](https://arxiv.org/html/2408.04187), which aims to improve the capabilities of LLMs in the medical domain by generating evidence-based results through a novel graph-based Retrieval-Augmented Generation framework, improving safety and reliability in handling private medical data.
+
+`TopicNodeParser` implements an approximate version of the chunking technique described in the paper.
+
+Here is the technique as outlined in the paper:
+
+```
+Large medical documents often contain multiple themes or diverse content. To process these effectively, we first segment the document into data chunks that conform to the context limitations of Large Language Models (LLMs). Traditional methods such as chunking based on token size or fixed characters typically fail to detect subtle shifts in topics accurately. Consequently, these chunks may not fully capture the intended context, leading to a loss in the richness of meaning.
+
+To enhance accuracy, we adopt a mixed method of character separation coupled with topic-based segmentation. Specifically, we utilize static characters (line break symbols) to isolate individual paragraphs within the document. Following this, we apply a derived form of the text for semantic chunking. Our approach includes the use of proposition transfer, which extracts standalone statements from a raw text Chen et al. (2023). Through proposition transfer, each paragraph is transformed into self-sustaining statements. We then conduct a sequential analysis of the document to assess each proposition, deciding whether it should merge with an existing chunk or initiate a new one. This decision is made via a zero-shot approach by an LLM. To reduce noise generated by sequential processing, we implement a sliding window technique, managing five paragraphs at a time. We continuously adjust the window by removing the first paragraph and adding the next, maintaining focus on topic consistency. We set a hard threshold that the longest chunk cannot excess the context length limitation of LLM. After chunking the document, we construct graph on each individual of the data chunk.
+```
+
+## Installation
+
+```
+pip install llama-index-node-parser-topic
+```
+
+## Usage
+
+```python
+from llama_index.core import Document
+from llama_index.node_parser.topic import TopicNodeParser
+
+node_parser = TopicNodeParser.from_defaults(
+    llm=llm,
+    max_chunk_size=1000,
+    similarity_method="llm",  # can be "llm" or "embedding"
+    # embed_model=embed_model,  # used for "embedding" similarity_method
+    # similarity_threshold=0.8,  # used for "embedding" similarity_method
+    window_size=2,  # paper suggests window_size=5
+)
+
+nodes = node_parser(
+    [
+        Document(text="document text 1"),
+        Document(text="document text 2"),
+    ],
+)
+```
diff --git a/...ntegrations/node_parser/llama-index-node-parser-topic/llama_index/node_parser/topic/BUILD b/...ntegrations/node_parser/llama-index-node-parser-topic/llama_index/node_parser/topic/BUILD
@@ -0,0 +1 @@
+python_sources()
diff --git a/...tions/node_parser/llama-index-node-parser-topic/llama_index/node_parser/topic/__init__.py b/...tions/node_parser/llama-index-node-parser-topic/llama_index/node_parser/topic/__init__.py
@@ -0,0 +1,6 @@
+from llama_index.node_parser.topic.base import (
+    TopicNodeParser,
+)
+
+
+__all__ = ["TopicNodeParser"]