partition_pdf got TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format' #3253

liyang79 · 2024-06-19T07:39:38Z

Describe the bug
partition_pdf got TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format'

To Reproduce

from unstructured.partition.pdf import partition_pdf

filename = "./data/salesforce-fy24-annual-report.pdf"
# file downloaded from https://s23.q4cdn.com/574569502/files/doc_financials/2024/ar/salesforce-fy24-annual-report.pdf
elements = partition_pdf(filename=filename, strategy="hi_res", infer_table_structure=True)

Expected behavior
No error.

Screenshots

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[6], line 1
----> 1 elements = partition_pdf(
      2     filename=file_path,
      3 
      4     # Unstructured Helpers
      5     strategy="hi_res", 
      6     infer_table_structure=True, 
      7 )

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/documents/elements.py:593, in process_metadata.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    591 @functools.wraps(func)
    592 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> list[Element]:
--> 593     elements = func(*args, **kwargs)
    594     call_args = get_call_args_applying_defaults(func, *args, **kwargs)
    596     regex_metadata: dict["str", "str"] = call_args.get("regex_metadata", {})

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:626, in add_filetype.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    624 @functools.wraps(func)
    625 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 626     elements = func(*args, **kwargs)
    627     params = get_call_args_applying_defaults(func, *args, **kwargs)
    628     include_metadata = params.get("include_metadata", True)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/file_utils/filetype.py:582, in add_metadata.<locals>.wrapper(*args, **kwargs)
    580 @functools.wraps(func)
    581 def wrapper(*args: _P.args, **kwargs: _P.kwargs) -> List[Element]:
--> 582     elements = func(*args, **kwargs)
    583     call_args = get_call_args_applying_defaults(func, *args, **kwargs)
    584     include_metadata = call_args.get("include_metadata", True)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/chunking/dispatch.py:74, in add_chunking_strategy.<locals>.wrapper(*args, **kwargs)
     71 """The decorated function is replaced with this one."""
     73 # -- call the partitioning function to get the elements --
---> 74 elements = func(*args, **kwargs)
     76 # -- look for a chunking-strategy argument --
     77 call_args = get_call_args_applying_defaults(func, *args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf.py:192, in partition_pdf(filename, file, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, include_metadata, metadata_filename, metadata_last_modified, chunking_strategy, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    188 exactly_one(filename=filename, file=file)
    190 languages = check_language_args(languages or [], ocr_languages) or ["eng"]
--> 192 return partition_pdf_or_image(
    193     filename=filename,
    194     file=file,
    195     include_page_breaks=include_page_breaks,
    196     strategy=strategy,
    197     infer_table_structure=infer_table_structure,
    198     languages=languages,
    199     metadata_last_modified=metadata_last_modified,
    200     hi_res_model_name=hi_res_model_name,
    201     extract_images_in_pdf=extract_images_in_pdf,
    202     extract_image_block_types=extract_image_block_types,
    203     extract_image_block_output_dir=extract_image_block_output_dir,
    204     extract_image_block_to_payload=extract_image_block_to_payload,
    205     date_from_file_object=date_from_file_object,
    206     starting_page_number=starting_page_number,
    207     extract_forms=extract_forms,
    208     form_extraction_skip_tables=form_extraction_skip_tables,
    209     **kwargs,
    210 )

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf.py:288, in partition_pdf_or_image(filename, file, is_image, include_page_breaks, strategy, infer_table_structure, ocr_languages, languages, metadata_last_modified, hi_res_model_name, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, date_from_file_object, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    286     with warnings.catch_warnings():
    287         warnings.simplefilter("ignore")
--> 288         elements = _partition_pdf_or_image_local(
    289             filename=filename,
    290             file=spooled_to_bytes_io_if_needed(file),
    291             is_image=is_image,
    292             infer_table_structure=infer_table_structure,
    293             include_page_breaks=include_page_breaks,
    294             languages=languages,
    295             metadata_last_modified=metadata_last_modified or last_modification_date,
    296             hi_res_model_name=hi_res_model_name,
    297             pdf_text_extractable=pdf_text_extractable,
    298             extract_images_in_pdf=extract_images_in_pdf,
    299             extract_image_block_types=extract_image_block_types,
    300             extract_image_block_output_dir=extract_image_block_output_dir,
    301             extract_image_block_to_payload=extract_image_block_to_payload,
    302             starting_page_number=starting_page_number,
    303             extract_forms=extract_forms,
    304             form_extraction_skip_tables=form_extraction_skip_tables,
    305             **kwargs,
    306         )
    307         out_elements = _process_uncategorized_text_elements(elements)
    309 elif strategy == PartitionStrategy.FAST:

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/utils.py:249, in requires_dependencies.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    246 @wraps(func)
    247 def wrapper(*args: _P.args, **kwargs: _P.kwargs):
    248     run_check()
--> 249     return func(*args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf.py:580, in _partition_pdf_or_image_local(filename, file, is_image, infer_table_structure, include_page_breaks, languages, ocr_mode, model_name, hi_res_model_name, pdf_image_dpi, metadata_last_modified, pdf_text_extractable, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, analysis, analyzed_image_output_dir_path, starting_page_number, extract_forms, form_extraction_skip_tables, **kwargs)
    573         # NOTE(christine): merged_document_layout = extracted_layout + inferred_layout
    574         merged_document_layout = merge_inferred_with_extracted_layout(
    575             inferred_document_layout=inferred_document_layout,
    576             extracted_layout=extracted_layout,
    577             hi_res_model_name=hi_res_model_name,
    578         )
--> 580         final_document_layout = process_file_with_ocr(
    581             filename,
    582             merged_document_layout,
    583             extracted_layout=extracted_layout,
    584             is_image=is_image,
    585             infer_table_structure=infer_table_structure,
    586             ocr_languages=ocr_languages,
    587             ocr_mode=ocr_mode,
    588             pdf_image_dpi=pdf_image_dpi,
    589         )
    590 else:
    591     inferred_document_layout = process_data_with_model(
    592         file,
    593         is_image=is_image,
    594         model_name=hi_res_model_name,
    595         pdf_image_dpi=pdf_image_dpi,
    596     )

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/utils.py:249, in requires_dependencies.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    246 @wraps(func)
    247 def wrapper(*args: _P.args, **kwargs: _P.kwargs):
    248     run_check()
--> 249     return func(*args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py:166, in process_file_with_ocr(filename, out_layout, extracted_layout, is_image, infer_table_structure, ocr_languages, ocr_mode, pdf_image_dpi)
    164 except Exception as e:
    165     if os.path.isdir(filename) or os.path.isfile(filename):
--> 166         raise e
    167     else:
    168         raise FileNotFoundError(f'File "{filename}" not found!') from e

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py:154, in process_file_with_ocr(filename, out_layout, extracted_layout, is_image, infer_table_structure, ocr_languages, ocr_mode, pdf_image_dpi)
    152     extracted_regions = extracted_layout[i] if i < len(extracted_layout) else None
    153     with PILImage.open(image_path) as image:
--> 154         merged_page_layout = supplement_page_layout_with_ocr(
    155             page_layout=out_layout.pages[i],
    156             image=image,
    157             infer_table_structure=infer_table_structure,
    158             ocr_languages=ocr_languages,
    159             ocr_mode=ocr_mode,
    160             extracted_regions=extracted_regions,
    161         )
    162         merged_page_layouts.append(merged_page_layout)
    163 return DocumentLayout.from_pages(merged_page_layouts)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/utils.py:249, in requires_dependencies.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    246 @wraps(func)
    247 def wrapper(*args: _P.args, **kwargs: _P.kwargs):
    248     run_check()
--> 249     return func(*args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py:232, in supplement_page_layout_with_ocr(page_layout, image, infer_table_structure, ocr_languages, ocr_mode, extracted_regions)
    229     if tables.tables_agent is None:
    230         raise RuntimeError("Unable to load table extraction agent.")
--> 232     page_layout.elements[:] = supplement_element_with_table_extraction(
    233         elements=cast(List["LayoutElement"], page_layout.elements),
    234         image=image,
    235         tables_agent=tables.tables_agent,
    236         ocr_languages=ocr_languages,
    237         ocr_agent=ocr_agent,
    238         extracted_regions=extracted_regions,
    239     )
    241 return page_layout

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/utils.py:249, in requires_dependencies.<locals>.decorator.<locals>.wrapper(*args, **kwargs)
    246 @wraps(func)
    247 def wrapper(*args: _P.args, **kwargs: _P.kwargs):
    248     run_check()
--> 249     return func(*args, **kwargs)

File ~/miniconda3/envs/fastchat/lib/python3.11/site-packages/unstructured/partition/pdf_image/ocr.py:279, in supplement_element_with_table_extraction(elements, image, tables_agent, ocr_languages, ocr_agent, extracted_regions)
    264 cropped_image = image.crop(
    265     (
    266         padded_element.bbox.x1,
   (...)
    270     ),
    271 )
    272 table_tokens = get_table_tokens(
    273     table_element_image=cropped_image,
    274     ocr_languages=ocr_languages,
   (...)
    277     table_element=padded_element,
    278 )
--> 279 tatr_cells = tables_agent.predict(
    280     cropped_image, ocr_tokens=table_tokens, result_format="cells"
    281 )
    283 # NOTE(christine): `tatr_cells == ""` means that the table was not recognized
    284 text_as_html = "" if tatr_cells == "" else cells_to_html(tatr_cells)

TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format'

Environment Info
Please run python scripts/collect_env.py and paste the output here.

/data/projects/collect_env.py:5: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
OS version:  Linux-3.10.0-1160.31.1.el7.x86_64-x86_64-with-glibc2.28
Python version:  3.11.5
unstructured version:  0.14.6
unstructured-inference version:  0.7.15
pytesseract version:  0.3.10
Torch version:  2.3.0
Detectron2 is not installed
PaddleOCR is not installed
Libmagic version: file-5.11
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice version:  LibreOffice 5.3.6.1 30(Build:1)

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

SystemAgent · 2024-06-19T08:17:01Z

Hi! I have been getting the same error today when trying to use partition_pdf - TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format'

When the infer_table_structure=False it manages to partition the pdf file, but that is not a solution in my case since the Tables are the critical elements that need to be extracted.

IngLP · 2024-06-26T17:11:46Z

I have the same problem here!

christinestraub · 2024-06-26T17:33:43Z

Hi @liyang79 @IngLP

I think you're using an old version of unstructured-inference library (0.7.15). You won't get this error if you upgrade both unstructured-inference and unstructured libraries to the latest versions.

liyang79 · 2024-06-27T01:47:12Z

@christinestraub You're right. Problem is solved after upgrading the latest unstructured-inference library. Thanks.

IngLP · 2024-06-27T08:29:18Z

@christinestraub I have: Python3.10, unstructured = {extras = ["pdf"], version = "^0.14.8"} in my poetry config, unstructured 0.14.8 and unstructured-inference 0.7.36. But I still get the error.

christinestraub · 2024-06-27T17:05:34Z

@IngLP Can you please provide a pdf document that we could use to reproduce?

IngLP · 2024-07-02T18:15:09Z

Hi @christinestraub , I deleted and recreated the whole Python environment and now everything works. Thank you for your help.

liyang79 added the bug Something isn't working label Jun 19, 2024

christinestraub added the pdf label Jun 21, 2024

christinestraub added the awaiting-response label Jun 26, 2024

christinestraub closed this as completed Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

partition_pdf got TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format' #3253

partition_pdf got TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format' #3253

liyang79 commented Jun 19, 2024

SystemAgent commented Jun 19, 2024

IngLP commented Jun 26, 2024

christinestraub commented Jun 26, 2024

liyang79 commented Jun 27, 2024

IngLP commented Jun 27, 2024 •

edited

Loading

christinestraub commented Jun 27, 2024 •

edited

Loading

IngLP commented Jul 2, 2024

partition_pdf got TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format' #3253

partition_pdf got TypeError: UnstructuredTableTransformerModel.predict() got an unexpected keyword argument 'result_format' #3253

Comments

liyang79 commented Jun 19, 2024

SystemAgent commented Jun 19, 2024

IngLP commented Jun 26, 2024

christinestraub commented Jun 26, 2024

liyang79 commented Jun 27, 2024

IngLP commented Jun 27, 2024 • edited Loading

christinestraub commented Jun 27, 2024 • edited Loading

IngLP commented Jul 2, 2024

IngLP commented Jun 27, 2024 •

edited

Loading

christinestraub commented Jun 27, 2024 •

edited

Loading