Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: generic partition brick with filetype detection #132

Merged
merged 32 commits into from
Jan 9, 2023

Conversation

MthwRobinson
Copy link
Contributor

Summary

Adds a generic partition brick that detects the file type and then invokes the appropriate partitioning brick. In support of this functionality, this PR also adds the following:

  • A module for detecting file type
  • Includes magic in the dependencies and instructions for installing libmagic
  • Added support for processing file-like objects in "rb" mode to partition_html and partition_eml

Testing

  import docx
  from unstructured.partition.auto import partition

  document = docx.Document()
  document.add_paragraph("Important Analysis", style="Heading 1")
  document.add_paragraph("Here is my first thought.", style="Body Text")
  document.add_paragraph("Here is my second thought.", style="Normal")
  document.save("mydoc.docx")
  elements = partition(filename="mydoc.docx")

  with open("mydoc.docx", "rb") as f:
      elements = partition(file=f)
  from unstructured.partition.auto import partition
  elements = partition(filename="example-docs/layout-parser-paper-fast.pdf")

@MthwRobinson MthwRobinson requested a review from qued January 9, 2023 15:21
@MthwRobinson
Copy link
Contributor Author

Test failure is due to libmagic detecting different MIME types for .docx and .xlsx files on Mac vs. Linux (see this issue for details). Currently working a fix.

Copy link
Contributor

@cragwolfe cragwolfe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to add support in the Dockerfile? I believe the package is file-devel

@MthwRobinson
Copy link
Contributor Author

Per offline convo, I'm removing the dockerfile since we're not using that for anything

@MthwRobinson MthwRobinson merged commit 5376bc5 into main Jan 9, 2023
@MthwRobinson MthwRobinson deleted the robinson/generic-partition branch January 9, 2023 21:15
@@ -20,7 +20,7 @@ install-base: install-base-pip-packages install-nltk-models
install: install-base-pip-packages install-dev install-nltk-models install-test install-huggingface install-unstructured-inference

.PHONY: install-ci
install-ci: install-base-pip-packages install-test install-nltk-models install-huggingface install-unstructured-inference
install-ci: install-base-pip-packages install-test install-nltk-models install-huggingface install-unstructured-inference local-inference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
install-ci: install-base-pip-packages install-test install-nltk-models install-huggingface install-unstructured-inference local-inference
install-ci: install-base-pip-packages install-test install-nltk-models install-huggingface install-unstructured-inference install-local-inference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants