-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add ramalama rag command #501
Open
rhatdan
wants to merge
1
commit into
containers:main
Choose a base branch
from
rhatdan:rag
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
% ramalama-rag 1 | ||
|
||
## NAME | ||
ramalama\-rag - generate rag (Retrieval Augmented Generation) data from provided documents and convert into an OCI Image | ||
|
||
## SYNOPSIS | ||
**ramalama rag** [options] [path ...] image | ||
|
||
## DESCRIPTION | ||
Generate rag data from provided documents and convert into an OCI Image | ||
|
||
positional arguments: | ||
path Files/Directory containing PDF, DOCX, PPTX, XLSX, HTML, AsciiDoc & Markdown formatted files to be processed. Can be specified multiple times. | ||
image OCI Image name to contain processed rag data | ||
|
||
|
||
## OPTIONS | ||
|
||
#### **--help**, **-h** | ||
Print usage message | ||
|
||
## EXAMPLES | ||
|
||
``` | ||
$ ramalama rag https://arxiv.org/pdf/2408.09869 /tmp/pdf quay.io/rhatdan/myrag | ||
Fetching 9 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00, 68509.50it/s] | ||
Neither CUDA nor MPS are available - defaulting to CPU. Note: This module is much faster with a GPU. | ||
2024-12-04 13:49:07.372 ( 70.927s) [ 75AB6740] doc_normalisation.h:448 WARN| found new `other` type: checkbox-unselected | ||
``` | ||
|
||
## SEE ALSO | ||
**[ramalama(1)](ramalama.1.md)** | ||
|
||
## HISTORY | ||
Dec 2024, Originally compiled by Dan Walsh <[email protected]> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
import tempfile | ||
import os | ||
import json | ||
import logging | ||
from pathlib import Path | ||
from typing import Iterable | ||
|
||
from ramalama.common import run_cmd | ||
from docling.datamodel.base_models import ConversionStatus | ||
from docling.datamodel.document import ConversionResult | ||
from docling.document_converter import DocumentConverter | ||
|
||
_log = logging.getLogger(__name__) | ||
|
||
ociimage_rag = "org.containers.type=ai.image.rag" | ||
|
||
|
||
def walk(path): | ||
targets = [] | ||
for root, dirs, files in os.walk(path, topdown=True): | ||
if len(files) == 0: | ||
continue | ||
for f in files: | ||
file = os.path.join(root, f) | ||
if os.path.isfile(file): | ||
targets.append(file) | ||
return targets | ||
|
||
|
||
def export_documents( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. issue (complexity): Consider using a format mapping dictionary to handle document exports The export_documents function contains significant repetition that can be simplified without losing clarity. Consider restructuring using a format mapping: EXPORT_FORMATS = {
'.json': lambda doc: json.dumps(doc.export_to_dict()),
'.yaml': lambda doc: yaml.safe_dump(doc.export_to_dict()),
'.doctags.txt': lambda doc: doc.export_to_document_tokens(),
'.md': lambda doc: doc.export_to_markdown(),
'.txt': lambda doc: doc.export_to_markdown(strict_text=True)
}
def export_documents(conv_results: Iterable[ConversionResult], output_dir: Path):
output_dir.mkdir(parents=True, exist_ok=True)
stats = {'success': 0, 'partial': 0, 'failed': 0}
for conv_res in conv_results:
if conv_res.status == ConversionStatus.SUCCESS:
stats['success'] += 1
doc_filename = conv_res.input.file.stem
for ext, export_fn in EXPORT_FORMATS.items():
output_file = output_dir / f"{doc_filename}{ext}"
with output_file.open('w') as fp:
fp.write(export_fn(conv_res.document))
elif conv_res.status == ConversionStatus.PARTIAL_SUCCESS:
stats['partial'] += 1
_log.info(f"Document {conv_res.input.file} was partially converted with the following errors:")
for item in conv_res.errors:
_log.info(f"\t{item.error_message}")
else:
stats['failed'] += 1
_log.info(f"Document {conv_res.input.file} failed to convert.")
_log.info(f"Processed {sum(stats.values())} docs, of which {stats['failed']} failed "
f"and {stats['partial']} were partially converted.")
return stats['success'], stats['partial'], stats['failed'] This approach:
|
||
conv_results: Iterable[ConversionResult], | ||
output_dir: Path, | ||
): | ||
output_dir.mkdir(parents=True, exist_ok=True) | ||
|
||
success_count = 0 | ||
failure_count = 0 | ||
partial_success_count = 0 | ||
|
||
for conv_res in conv_results: | ||
if conv_res.status == ConversionStatus.SUCCESS: | ||
success_count += 1 | ||
doc_filename = conv_res.input.file.stem | ||
|
||
# Export Docling document format to JSON: | ||
with (output_dir / f"{doc_filename}.json").open("w") as fp: | ||
fp.write(json.dumps(conv_res.document.export_to_dict())) | ||
|
||
elif conv_res.status == ConversionStatus.PARTIAL_SUCCESS: | ||
_log.info(f"Document {conv_res.input.file} was partially converted with the following errors:") | ||
for item in conv_res.errors: | ||
_log.info(f"\t{item.error_message}") | ||
partial_success_count += 1 | ||
else: | ||
_log.info(f"Document {conv_res.input.file} failed to convert.") | ||
failure_count += 1 | ||
|
||
_log.info( | ||
f"Processed {success_count + partial_success_count + failure_count} docs, " | ||
f"of which {failure_count} failed " | ||
f"and {partial_success_count} were partially converted." | ||
) | ||
return success_count, partial_success_count, failure_count | ||
|
||
|
||
def build(source, target, args): | ||
print(f"Building {target}...") | ||
src = os.path.realpath(source) | ||
contextdir = os.path.dirname(src) | ||
model = os.path.basename(src) | ||
containerfile = tempfile.NamedTemporaryFile(prefix='RamaLama_Containerfile_', delete=True) | ||
# Open the file for writing. | ||
with open(containerfile.name, 'w') as c: | ||
c.write( | ||
f"""\ | ||
FROM scratch | ||
COPY {model} / | ||
LABEL {ociimage_rag} | ||
""" | ||
) | ||
imageid = ( | ||
run_cmd( | ||
[args.engine, "build", "-t", target, "--no-cache", "-q", "-f", containerfile.name, contextdir], | ||
debug=args.debug, | ||
) | ||
.stdout.decode("utf-8") | ||
.strip() | ||
) | ||
return imageid | ||
|
||
|
||
def generate(args): | ||
tmpdir = tempfile.TemporaryDirectory(prefix="ramalama_", delete=True) | ||
targets = [] | ||
for p in args.PATH: | ||
if os.path.isfile(p): | ||
targets.append(p) # Process selected file | ||
continue | ||
if os.path.isdir(p): | ||
targets.extend(walk(p)) # Walk directory and process all files | ||
continue | ||
targets.append(p) # WEB? | ||
|
||
converter = DocumentConverter() | ||
conv_results = converter.convert_all(targets, raises_on_error=False) | ||
success_count, partial_success_count, failure_count = export_documents(conv_results, output_dir=Path(tmpdir.name)) | ||
if failure_count > 0: | ||
raise RuntimeError(f"failed to convert {failure_count} target(s) out of {len(targets)} documents.") | ||
|
||
build(tmpdir.name, args.IMAGE, args) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: The walk() function should consistently return a list in all cases
Currently returns None for empty directories but a list otherwise. This inconsistency could cause runtime errors. Consider returning an empty list for empty directories and removing the early return.