Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: UnstructuredFileConverter meta field #242

Merged
merged 14 commits into from
Jan 23, 2024

Conversation

lambda-science
Copy link
Contributor

@lambda-science lambda-science commented Jan 18, 2024

Fix: #241

This pull request add a new optional meta field to add custom metadata to Unstructured documents.
From:

class UnstructuredFileConverter:

    @component.output_types(documents=List[Document])
    def run(self, paths: Union[List[str], List[os.PathLike]]):

To:

class UnstructuredFileConverter:

    @component.output_types(documents=List[Document])
    def run(
        self,
        paths: Union[List[str], List[os.PathLike]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
    ):

The implementation is inspired from the handling of metadata in the PyPDFToDocument component from Haystack main repo.

Ping @anakin87

@lambda-science lambda-science requested a review from a team as a code owner January 18, 2024 12:35
@lambda-science lambda-science requested review from shadeMe and removed request for a team January 18, 2024 12:35
@CLAassistant
Copy link

CLAassistant commented Jan 18, 2024

CLA assistant check
All committers have signed the CLA.

@lambda-science
Copy link
Contributor Author

Not sure I understand why the test are failing on 3.9

@lambda-science
Copy link
Contributor Author

lambda-science commented Jan 18, 2024

And maybe I should add more test case because all test case only have one file with one metadata dict. Maybe a test case with two files and a meta list of dict

EDIT: Done.

@lambda-science lambda-science changed the title Feat/unstructured meta field Feat: UnstructuredFileConverter meta field Jan 18, 2024
@anakin87
Copy link
Member

@lambda-science tests are failing because of mypy. Something in the code should be fixed.

Thanks for your PR. I will take a look later or tomorrow...

@lambda-science
Copy link
Contributor Author

@lambda-science tests are failing because of mypy. Something in the code should be fixed.

Thanks for your PR. I will take a look later or tomorrow...

No worries, I realized I could use my own fix before it is merged and released with a simple requirement.txt such as:

unstructured-fileconverter-haystack @ git+https://github.com/deepset-ai/haystack-core-integrations@3bc51a788356545dc692183591cf91bdf5391713#subdirectory=integrations/unstructured

That's very cool ! So it's not urgent at all 🌞

@lambda-science
Copy link
Contributor Author

lambda-science commented Jan 19, 2024

I have the feeling that something is off with metadata, because the auto-generated page_number is always the last page with one doc per elem
SOLVED: ..... mutable data in python, always have been...... replacing metadata = meta by metadata = meta.copy() solved it 😭

@anakin87 anakin87 self-assigned this Jan 21, 2024
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @lambda-science, thanks for the good work!

  • I left a comment about metadata handling
  • I would add a test where the path is a directory

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey... Sorry for the long wait.

I found some small opportunities for improvement, then we can merge this good PR!

Corentin and others added 5 commits January 23, 2024 15:15
@anakin87 anakin87 self-requested a review January 23, 2024 14:36
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good...

Thanks @lambda-science!

I'm going to merge this PR and release a new version...

@anakin87 anakin87 merged commit 5cbb529 into deepset-ai:main Jan 23, 2024
6 checks passed
@anakin87
Copy link
Member

New version: https://pypi.org/project/unstructured-fileconverter-haystack/0.3.0/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UnstructuredFileConverter: support custom metadata dict
3 participants