Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add raptor #11527

Merged
merged 11 commits into from
Mar 1, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 153 additions & 0 deletions llama-index-packs/llama-index-packs-raptor/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
llama_index/_static
.DS_Store
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
bin/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
etc/
include/
lib/
lib64/
parts/
sdist/
share/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
.ruff_cache

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints
notebooks/

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
pyvenv.cfg

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# Jetbrains
.idea
modules/
*.swp

# VsCode
.vscode

# pipenv
Pipfile
Pipfile.lock

# pyright
pyrightconfig.json
4 changes: 4 additions & 0 deletions llama-index-packs/llama-index-packs-raptor/BUILD
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
poetry_requirements(
name="poetry",
module_mappings={"umap-learn": ["umap"], "scikit-learn": ["sklearn"]}
)
17 changes: 17 additions & 0 deletions llama-index-packs/llama-index-packs-raptor/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
GIT_ROOT ?= $(shell git rev-parse --show-toplevel)

help: ## Show all Makefile targets.
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[33m%-30s\033[0m %s\n", $$1, $$2}'

format: ## Run code autoformatters (black).
pre-commit install
git ls-files | xargs pre-commit run black --files

lint: ## Run linters: pre-commit (black, ruff, codespell) and mypy
pre-commit install && git ls-files | xargs pre-commit run --show-diff-on-failure --files

test: ## Run tests via pytest.
pytest tests

watch-docs: ## Build and watch documentation.
sphinx-autobuild docs/ docs/_build/html --open-browser --watch $(GIT_ROOT)/llama_index/
80 changes: 80 additions & 0 deletions llama-index-packs/llama-index-packs-raptor/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Raptor Retriever LlamaPack

## Embedded Tables Retriever Pack w/ Unstructured.io
logan-markewich marked this conversation as resolved.
Show resolved Hide resolved

This LlamaPack shows how to use an implementation of RAPTOR with llama-index, leveraging the RAPTOR pack.

RAPTOR works by recursively clustering and summarizing clusters in layers for retrieval.

There two retrieval modes:

- tree_traversal -- traversing the tree of clusters, performing top-k at each level in the tree.
- collapsed -- treat the entire tree as a giant pile of nodes, perform simple top-k.

See [the paper](https://arxiv.org/abs/2401.18059) for full algorithm details.

### CLI Usage

You can download llamapacks directly using `llamaindex-cli`, which comes installed with the `llama-index` python package:

```bash
llamaindex-cli download-llamapack RaptorPack --download-dir ./raptor_pack
```

You can then inspect/modify the files at `./raptor_pack` and use them as a template for your own project.

### Code Usage

You can alternaitvely install the package:

`pip install llama-index-packs-raptor`

Then, you can import and initialize the pack! This will perform clustering and summarization over your data.

```python
from llama_index.packs.raptor import RaptorPack

pack = RaptorPack(documents, llm=llm, embed_model=embed_model)
```

The `run()` function is a light wrapper around `retriever.retrieve()`.

```python
nodes = pack.run(
"query",
mode="collapsed", # or tree_traversal
)
```

You can also use modules individually.

```python
# get the retriever
retriever = pack.retriever
```

## Persistence

The `RaptorPack` comes with the `RaptorRetriever`, which offers ways of saving/reloading!

If you are using a remote vector-db, just pass it in

```python
# Pack usage
pack = RaptorPack(..., vector_store=vector_store)

# RaptorRetriever usage
retriever = RaptorRetreiver(..., vector_store=vector_store)
logan-markewich marked this conversation as resolved.
Show resolved Hide resolved
```

Then, to re-connect, just pass in the vector store again and an empty list of documents

```python
# Pack usage
pack = RaptorPack([], ..., vector_store=vector_store)

# RaptorRetriever usage
retriever = RaptorRetreiver([], ..., vector_store=vector_store)
```

Check out the [notebook here for complete details!](https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-raptor/examples/raptor.ipynb).
Loading
Loading