-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add ramalama rag command #501
base: main
Are you sure you want to change the base?
Conversation
Reviewer's Guide by SourceryThis PR adds a new 'rag' command to the RamaLama CLI that enables generating Retrieval Augmented Generation (RAG) data from various document formats and converting them into an OCI Image. The implementation includes a new command parser, core RAG functionality for document processing, and corresponding documentation. Sequence diagram for the RAG command executionsequenceDiagram
actor User
participant CLI as RamaLama CLI
participant rag as rag
participant Converter as DocumentConverter
participant Result as ConversionResult
User->>CLI: Execute 'ramalama rag PATH IMAGE'
CLI->>rag: rag_cli(args)
rag->>Converter: convert_all(targets)
Converter->>Result: ConversionResult
rag->>Result: export_documents(conv_results, output_dir)
Result-->>rag: Success/Failure counts
rag-->>CLI: Processed results
CLI-->>User: Display results
Class diagram for the new RAG functionalityclassDiagram
class DocumentConverter {
+convert_all(targets, raises_on_error)
}
class ConversionResult {
+input: File
+status: ConversionStatus
+document: Document
+errors: List<Error>
}
class ConversionStatus {
<<enumeration>>
SUCCESS
PARTIAL_SUCCESS
FAILURE
}
class Document {
+export_to_dict()
+export_to_document_tokens()
+export_to_markdown(strict_text)
}
class rag {
+generate(args)
+walk(path)
+export_documents(conv_results, output_dir)
}
DocumentConverter --> ConversionResult
ConversionResult --> ConversionStatus
ConversionResult --> Document
rag --> DocumentConverter
rag --> ConversionResult
rag --> ConversionStatus
rag --> Document
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @rhatdan - I've reviewed your changes - here's some feedback:
Overall Comments:
- Consider improving error message clarity in rag.py - 'The example failed converting X on Y' is confusing. Consider something like 'Failed to convert X out of Y documents'.
- Help text capitalization is inconsistent across commands (e.g. 'display' vs 'Display'). Consider standardizing on sentence case for all help messages.
Here's what I looked at during the review
- 🟡 General issues: 2 issues found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
_log = logging.getLogger(__name__) | ||
|
||
|
||
def walk(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: The walk() function should consistently return a list in all cases
Currently returns None for empty directories but a list otherwise. This inconsistency could cause runtime errors. Consider returning an empty list for empty directories and removing the early return.
ramalama/rag.py
Outdated
|
||
converter = DocumentConverter() | ||
conv_results = converter.convert_all(targets, raises_on_error=False) | ||
success_count, partial_success_count, failure_count = export_documents(conv_results, output_dir=Path("scratch")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Avoid hardcoding the scratch directory path
Consider making this configurable or using a constant defined at module level
OUTPUT_DIR = Path("scratch") # At module level, near top of file
success_count, partial_success_count, failure_count = export_documents(conv_results, output_dir=OUTPUT_DIR)
return targets | ||
|
||
|
||
def export_documents( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider using a format mapping dictionary to handle document exports
The export_documents function contains significant repetition that can be simplified without losing clarity. Consider restructuring using a format mapping:
EXPORT_FORMATS = {
'.json': lambda doc: json.dumps(doc.export_to_dict()),
'.yaml': lambda doc: yaml.safe_dump(doc.export_to_dict()),
'.doctags.txt': lambda doc: doc.export_to_document_tokens(),
'.md': lambda doc: doc.export_to_markdown(),
'.txt': lambda doc: doc.export_to_markdown(strict_text=True)
}
def export_documents(conv_results: Iterable[ConversionResult], output_dir: Path):
output_dir.mkdir(parents=True, exist_ok=True)
stats = {'success': 0, 'partial': 0, 'failed': 0}
for conv_res in conv_results:
if conv_res.status == ConversionStatus.SUCCESS:
stats['success'] += 1
doc_filename = conv_res.input.file.stem
for ext, export_fn in EXPORT_FORMATS.items():
output_file = output_dir / f"{doc_filename}{ext}"
with output_file.open('w') as fp:
fp.write(export_fn(conv_res.document))
elif conv_res.status == ConversionStatus.PARTIAL_SUCCESS:
stats['partial'] += 1
_log.info(f"Document {conv_res.input.file} was partially converted with the following errors:")
for item in conv_res.errors:
_log.info(f"\t{item.error_message}")
else:
stats['failed'] += 1
_log.info(f"Document {conv_res.input.file} failed to convert.")
_log.info(f"Processed {sum(stats.values())} docs, of which {stats['failed']} failed "
f"and {stats['partial']} were partially converted.")
return stats['success'], stats['partial'], stats['failed']
This approach:
- Separates format configuration from export logic
- Reduces code duplication
- Makes adding new export formats easier
- Maintains explicit error handling and logging
ff9f7fa
to
97a74c1
Compare
@ericcurtin any idea why docling is not being found on Mac? Seems to be found on my MAC? |
Allow users to specify Docx, PDF, Markdown ... files on the command line and then processes them with docling and rag finally putting the output into the specified container image. Signed-off-by: Daniel J Walsh <[email protected]>
Summary by Sourcery
Add a new 'rag' command to the RamaLama CLI for generating RAG data from documents and converting it into an OCI Image, along with corresponding documentation updates.
New Features:
Documentation: