Skip to content

Conversation

@MthwRobinson
Copy link
Contributor

Summary

Adds a generic partition brick that detects the file type and then invokes the appropriate partitioning brick. In support of this functionality, this PR also adds the following:

  • A module for detecting file type
  • Includes magic in the dependencies and instructions for installing libmagic
  • Added support for processing file-like objects in "rb" mode to partition_html and partition_eml

Testing

  import docx
  from unstructured.partition.auto import partition

  document = docx.Document()
  document.add_paragraph("Important Analysis", style="Heading 1")
  document.add_paragraph("Here is my first thought.", style="Body Text")
  document.add_paragraph("Here is my second thought.", style="Normal")
  document.save("mydoc.docx")
  elements = partition(filename="mydoc.docx")

  with open("mydoc.docx", "rb") as f:
      elements = partition(file=f)
  from unstructured.partition.auto import partition
  elements = partition(filename="example-docs/layout-parser-paper-fast.pdf")

@MthwRobinson MthwRobinson requested a review from qued January 9, 2023 15:21
@MthwRobinson
Copy link
Contributor Author

Test failure is due to libmagic detecting different MIME types for .docx and .xlsx files on Mac vs. Linux (see this issue for details). Currently working a fix.

Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to add support in the Dockerfile? I believe the package is file-devel

@MthwRobinson
Copy link
Contributor Author

Per offline convo, I'm removing the dockerfile since we're not using that for anything

@MthwRobinson MthwRobinson merged commit 5376bc5 into main Jan 9, 2023
@MthwRobinson MthwRobinson deleted the robinson/generic-partition branch January 9, 2023 21:15

.PHONY: install-ci
install-ci: install-base-pip-packages install-test install-nltk-models install-huggingface install-unstructured-inference
install-ci: install-base-pip-packages install-test install-nltk-models install-huggingface install-unstructured-inference local-inference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
install-ci: install-base-pip-packages install-test install-nltk-models install-huggingface install-unstructured-inference local-inference
install-ci: install-base-pip-packages install-test install-nltk-models install-huggingface install-unstructured-inference install-local-inference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants