Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support JSON format with DoclingDocument as InputFormat #781

Closed
ceberam opened this issue Jan 21, 2025 · 0 comments · Fixed by #783
Closed

Support JSON format with DoclingDocument as InputFormat #781

ceberam opened this issue Jan 21, 2025 · 0 comments · Fixed by #783
Assignees
Labels
enhancement New feature or request

Comments

@ceberam
Copy link
Contributor

ceberam commented Jan 21, 2025

Requested feature

Background

  • Docling's DocumentConverter supports several formats like PDF, HTML, or .docx, which allows the conversion of those files into a DoclingDocument object. Each supported format is enumerated as an InputFormat instance.
  • The DocumentConverter is leveraged in many integrations. For instance, in LlamaIndex the DoclingReader can be leveraged in SimpleDirectoryReader to convert PDF files
    from llama_index.core import SimpleDirectoryReader
    from llama_index.readers.docling import DoclingReader
    
    reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
    dir_reader = SimpleDirectoryReader(
        input_dir="/my/temp/dir",
        file_extractor={".pdf": reader}
    )
  • Since the conversion may be computationally costly, users may want to persist the converted documents as .json files and use them later in other data processing pipelines

Request

  • Create a new conversion backend that simply reads the content of a JSON file that contains a DoclingDocumentexport.
  • In this way, the pattern example above could be reused
    from llama_index.core import SimpleDirectoryReader
    from llama_index.readers.docling import DoclingReader
    
    reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
    dir_reader = SimpleDirectoryReader(
        input_dir=/my/temp/dir",
        file_extractor={".json": reader}
    )

Alternatives

  • Delegate the reading of JSON DoclingDocument files to integration frameworks (e.g., create another DoclingReader for JSON in LlamaIndex)
  • Delegate the reading of JSON to docling-core
@ceberam ceberam added the enhancement New feature or request label Jan 21, 2025
@vagenas vagenas self-assigned this Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants