Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support prompt caching and apply Anthropic's long-context prompt format #52

Merged
merged 12 commits into from
Dec 3, 2024
Prev Previous commit
Next Next commit
feat: Improve segment reconstruction from chunks.
undo76 committed Nov 25, 2024
commit 04d9eb501777fd9129296cd2f5413b8fa07b4535
37 changes: 19 additions & 18 deletions src/raglite/_database.py
Original file line number Diff line number Diff line change
@@ -397,34 +397,35 @@ def chunk_ids(self) -> list[str]:
def __str__(self) -> str:
"""Return a string representation reconstructing the document with headings.

Shows each unique header exactly once, when it first appears.
For example:
- First chunk with "# A ## B" shows both headers
- Next chunk with "# A ## B" shows no headers as they're the same
- Next chunk with "# A ## C" only shows "## C" as it's the only new header
Treats headings as a stack, showing headers only when they differ from
the current stack path.

Returns
-------
str: A string containing content with each heading shown once.
For example:
- "# A ## B" shows both headers
- "# A ## B" shows nothing (already seen)
- "# A ## C" shows only "## C" (new branch)
- "# D ## B" shows both (new path)
"""
if not self.chunks:
return ""

result = []
seen_headers = set() # Track headers we've already shown
result: list[str] = []
stack: list[str] = []

for chunk in self.chunks:
# Get all headers in this chunk
headers = [h.strip() for h in chunk.headings.split("\n") if h.strip()]

# Add any headers we haven't seen before
new_headers = [h for h in headers if h not in seen_headers]
if new_headers:
result.extend(new_headers)
result.append("") # Empty line after headers
seen_headers.update(new_headers) # Mark these headers as seen
# Find first differing header
i = 0
while i < len(headers) and i < len(stack) and headers[i] == stack[i]:
i += 1

# Update stack and show new headers
stack[i:] = headers[i:]
if headers[i:]:
result.extend(headers[i:])
result.append("")

# Add the chunk body if it's not empty
if chunk.body.strip():
result.append(chunk.body.strip())