Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

partition_pdf got TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format' #3253

Closed
liyang79 opened this issue Jun 19, 2024 · 7 comments
Labels
awaiting-response bug Something isn't working pdf

Comments

@liyang79
Copy link

Describe the bug
partition_pdf got TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format'

To Reproduce

from unstructured.partition.pdf import partition_pdf

filename = "./data/salesforce-fy24-annual-report.pdf"
# file downloaded from https://s23.q4cdn.com/574569502/files/doc_financials/2024/ar/salesforce-fy24-annual-report.pdf
elements = partition_pdf(filename=filename, strategy="hi_res", infer_table_structure=True)

Expected behavior
No error.

Screenshots

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 elements = partition_pdf(
      2     filename=file_path,
      3 
      4     # Unstructured Helpers
      5     strategy="hi_res", 
      6     infer_table_structure=True, 
      7 )

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/documents/elements.py:593, in process_metadata.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    591 @functools.wraps(func)
    592 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> list[Element]:
--> 593     elements = func(*args, **kwargs)
    594     call_args = get_call_args_applying_defaults(func, *args, **kwargs)
    596     regex_metadata: dict["str", "str"] = call_args.get("regex_metadata", {})

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:626, in add_filetype.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    624 @functools.wraps(func)
    625 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 626     elements = func(*args, **kwargs)
    627     params = get_call_args_applying_defaults(func, *args, **kwargs)
    628     include_metadata = params.get("include_metadata", True)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:582, in add_metadata.<locals>.wrapper(*args, **kwargs)
    580 @functools.wraps(func)
    581 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 582     elements = func(*args, **kwargs)
    583     call_args = get_call_args_applying_defaults(func, *args, **kwargs)
    584     include_metadata = call_args.get("include_metadata", True)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:74, in add_chunking_strategy.<locals>.wrapper(*args, **kwargs)
     71 """The decorated function is replaced with this one."""
     73 # -- call the partitioning function to get the elements --
---> 74 elements = func(*args, **kwargs)
     76 # -- look for a chunking-strategy argument --
     77 call_args = get_call_args_applying_defaults(func, *args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf.py:192, in partition_pdf(filename, file, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, include_metadata, metadata_filename, metadata_last_modified, chunking_strategy, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    188 exactly_one(filename=filename, file=file)
    190 languages = check_language_args(languages or [], ocr_languages) or ["eng"]
--> 192 return partition_pdf_or_image(
    193     filename=filename,
    194     file=file,
    195     include_page_breaks=include_page_breaks,
    196     strategy=strategy,
    197     infer_table_structure=infer_table_structure,
    198     languages=languages,
    199     metadata_last_modified=metadata_last_modified,
    200     hi_res_model_name=hi_res_model_name,
    201     extract_images_in_pdf=extract_images_in_pdf,
    202     extract_image_block_types=extract_image_block_types,
    203     extract_image_block_output_dir=extract_image_block_output_dir,
    204     extract_image_block_to_payload=extract_image_block_to_payload,
    205     date_from_file_object=date_from_file_object,
    206     starting_page_number=starting_page_number,
    207     extract_forms=extract_forms,
    208     form_extraction_skip_tables=form_extraction_skip_tables,
    209     **kwargs,
    210 )

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf.py:288, in partition_pdf_or_image(filename, file, is_image, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, metadata_last_modified, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    286     with warnings.catch_warnings():
    287         warnings.simplefilter("ignore")
--> 288         elements = _partition_pdf_or_image_local(
    289             filename=filename,
    290             file=spooled_to_bytes_io_if_needed(file),
    291             is_image=is_image,
    292             infer_table_structure=infer_table_structure,
    293             include_page_breaks=include_page_breaks,
    294             languages=languages,
    295             metadata_last_modified=metadata_last_modified or last_modification_date,
    296             hi_res_model_name=hi_res_model_name,
    297             pdf_text_extractable=pdf_text_extractable,
    298             extract_images_in_pdf=extract_images_in_pdf,
    299             extract_image_block_types=extract_image_block_types,
    300             extract_image_block_output_dir=extract_image_block_output_dir,
    301             extract_image_block_to_payload=extract_image_block_to_payload,
    302             starting_page_number=starting_page_number,
    303             extract_forms=extract_forms,
    304             form_extraction_skip_tables=form_extraction_skip_tables,
    305             **kwargs,
    306         )
    307         out_elements = _process_uncategorized_text_elements(elements)
    309 elif strategy == PartitionStrategy.FAST:

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/utils.py:249, in requires_dependencies.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    246 @wraps(func)
    247 def wrapper(*args: _P.args, **kwargs: _P.kwargs):
    248     run_check()
--> 249     return func(*args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf.py:580, in _partition_pdf_or_image_local(filename, file, is_image, infer_table_structure, include_page_breaks, languages, ocr_mode, model_name, hi_res_model_name, pdf_image_dpi, metadata_last_modified, pdf_text_extractable, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, analysis, analyzed_image_output_dir_path, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    573         # NOTE(christine): merged_document_layout = extracted_layout + inferred_layout
    574         merged_document_layout = merge_inferred_with_extracted_layout(
    575             inferred_document_layout=inferred_document_layout,
    576             extracted_layout=extracted_layout,
    577             hi_res_model_name=hi_res_model_name,
    578         )
--> 580         final_document_layout = process_file_with_ocr(
    581             filename,
    582             merged_document_layout,
    583             extracted_layout=extracted_layout,
    584             is_image=is_image,
    585             infer_table_structure=infer_table_structure,
    586             ocr_languages=ocr_languages,
    587             ocr_mode=ocr_mode,
    588             pdf_image_dpi=pdf_image_dpi,
    589         )
    590 else:
    591     inferred_document_layout = process_data_with_model(
    592         file,
    593         is_image=is_image,
    594         model_name=hi_res_model_name,
    595         pdf_image_dpi=pdf_image_dpi,
    596     )

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/utils.py:249, in requires_dependencies.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    246 @wraps(func)
    247 def wrapper(*args: _P.args, **kwargs: _P.kwargs):
    248     run_check()
--> 249     return func(*args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py:166, in process_file_with_ocr(filename, out_layout, extracted_layout, is_image, infer_table_structure, ocr_languages, ocr_mode, pdf_image_dpi)
    164 except Exception as e:
    165     if os.path.isdir(filename) or os.path.isfile(filename):
--> 166         raise e
    167     else:
    168         raise FileNotFoundError(f'File "{filename}" not found!') from e

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py:154, in process_file_with_ocr(filename, out_layout, extracted_layout, is_image, infer_table_structure, ocr_languages, ocr_mode, pdf_image_dpi)
    152     extracted_regions = extracted_layout[i] if i < len(extracted_layout) else None
    153     with PILImage.open(image_path) as image:
--> 154         merged_page_layout = supplement_page_layout_with_ocr(
    155             page_layout=out_layout.pages[i],
    156             image=image,
    157             infer_table_structure=infer_table_structure,
    158             ocr_languages=ocr_languages,
    159             ocr_mode=ocr_mode,
    160             extracted_regions=extracted_regions,
    161         )
    162         merged_page_layouts.append(merged_page_layout)
    163 return DocumentLayout.from_pages(merged_page_layouts)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/utils.py:249, in requires_dependencies.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    246 @wraps(func)
    247 def wrapper(*args: _P.args, **kwargs: _P.kwargs):
    248     run_check()
--> 249     return func(*args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py:232, in supplement_page_layout_with_ocr(page_layout, image, infer_table_structure, ocr_languages, ocr_mode, extracted_regions)
    229     if tables.tables_agent is None:
    230         raise RuntimeError("Unable to load table extraction agent.")
--> 232     page_layout.elements[:] = supplement_element_with_table_extraction(
    233         elements=cast(List["LayoutElement"], page_layout.elements),
    234         image=image,
    235         tables_agent=tables.tables_agent,
    236         ocr_languages=ocr_languages,
    237         ocr_agent=ocr_agent,
    238         extracted_regions=extracted_regions,
    239     )
    241 return page_layout

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/utils.py:249, in requires_dependencies.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    246 @wraps(func)
    247 def wrapper(*args: _P.args, **kwargs: _P.kwargs):
    248     run_check()
--> 249     return func(*args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py:279, in supplement_element_with_table_extraction(elements, image, tables_agent, ocr_languages, ocr_agent, extracted_regions)
    264 cropped_image = image.crop(
    265     (
    266         padded_element.bbox.x1,
   (...)
    270     ),
    271 )
    272 table_tokens = get_table_tokens(
    273     table_element_image=cropped_image,
    274     ocr_languages=ocr_languages,
   (...)
    277     table_element=padded_element,
    278 )
--> 279 tatr_cells = tables_agent.predict(
    280     cropped_image, ocr_tokens=table_tokens, result_format="cells"
    281 )
    283 # NOTE(christine): `tatr_cells == ""` means that the table was not recognized
    284 text_as_html = "" if tatr_cells == "" else cells_to_html(tatr_cells)

TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format'

Environment Info
Please run python scripts/collect_env.py and paste the output here.

/data/projects/collect_env.py:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
OS version:  Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-glibc2.28
Python version:  3.11.5
unstructured version:  0.14.6
unstructured-inference version:  0.7.15
pytesseract version:  0.3.10
Torch version:  2.3.0
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.11
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice version:  LibreOffice 5.3.6.1 30(Build:1)

Additional context
Add any other context about the problem here.

@liyang79 liyang79 added the bug Something isn't working label Jun 19, 2024
@SystemAgent
Copy link

Hi! I have been getting the same error today when trying to use partition_pdf - TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format'

When the infer_table_structure=False it manages to partition the pdf file, but that is not a solution in my case since the Tables are the critical elements that need to be extracted.

@IngLP
Copy link

IngLP commented Jun 26, 2024

I have the same problem here!

@christinestraub
Copy link
Contributor

Hi @liyang79 @IngLP

I think you're using an old version of unstructured-inference library (0.7.15). You won't get this error if you upgrade both unstructured-inference and unstructured libraries to the latest versions.

@liyang79
Copy link
Author

@christinestraub You're right. Problem is solved after upgrading the latest unstructured-inference library. Thanks.

@IngLP
Copy link

IngLP commented Jun 27, 2024

@christinestraub I have: Python3.10, unstructured = {extras = ["pdf"], version = "^0.14.8"} in my poetry config, unstructured 0.14.8 and unstructured-inference 0.7.36. But I still get the error.

@christinestraub
Copy link
Contributor

christinestraub commented Jun 27, 2024

@IngLP Can you please provide a pdf document that we could use to reproduce?

@IngLP
Copy link

IngLP commented Jul 2, 2024

Hi @christinestraub , I deleted and recreated the whole Python environment and now everything works. Thank you for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-response bug Something isn't working pdf
Projects
None yet
Development

No branches or pull requests

4 participants