Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: type strategy output #216

Merged
merged 24 commits into from
Jan 16, 2025
Merged
Changes from 1 commit
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
0730ff9
feat: create first format modules
chloedia Dec 1, 2024
a052e15
Merge remote-tracking branch 'origin' into feat/make_modular
chloedia Dec 9, 2024
5b63dc6
add: example file
chloedia Dec 9, 2024
eea6cfd
add: structured output formatter
chloedia Dec 9, 2024
099780d
Merge remote-tracking branch 'origin' into feat/make_modular
chloedia Dec 26, 2024
8a98694
Merge remote-tracking branch 'origin' into feat/make_modular
chloedia Jan 5, 2025
7917ae9
fix: all parsers outputs list of elements & compatibility formatters
chloedia Jan 6, 2025
351b63a
feat: new basemodel for document
chloedia Jan 6, 2025
52e2c02
add: structured output
chloedia Jan 7, 2025
50f4bb6
fix: test
chloedia Jan 7, 2025
01cab33
fix: add uncategorized text handling
chloedia Jan 8, 2025
05dc7b0
Merge remote-tracking branch 'origin/main' into feat/make_modular
chloedia Jan 8, 2025
04a858f
add: skip on flaky pdf
chloedia Jan 8, 2025
4b5b0d4
Merge branch 'feat/make_modular' into feat/type_strategy_output
chloedia Jan 8, 2025
2dcd952
add: section block
chloedia Jan 8, 2025
ab5bae4
Merge branch 'feat/make_modular' into feat/type_strategy_output
chloedia Jan 8, 2025
790bba3
fix: change load logic & reate page element
chloedia Jan 8, 2025
79354e4
fix: add pages
chloedia Jan 9, 2025
adc69b1
add: split onnxtr det and reco
chloedia Jan 10, 2025
e0c0db0
feat: Doctr in MegaParse
chloedia Jan 13, 2025
eb65e2b
merge main
chloedia Jan 13, 2025
1960167
fix : Update ReadMe
chloedia Jan 13, 2025
be62a68
fix: add config as constructor parameters
chloedia Jan 13, 2025
1e11031
add: to_numpy to bbox
chloedia Jan 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add: example file
chloedia committed Dec 9, 2024

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
commit 5b63dc6e13cb2ae3e85eed12c9081a4d1550ef5b
14 changes: 14 additions & 0 deletions libs/megaparse/src/megaparse/examples/parse_file.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
from megaparse.formatter.unstructured_formatter.md_formatter import MarkDownFormatter
from megaparse.megaparse import MegaParse
from megaparse.parser.unstructured_parser import UnstructuredParser

if __name__ == "__main__":
# Parse a file
parser = UnstructuredParser()
formatter = MarkDownFormatter()

megaparse = MegaParse(parser=parser, formatters=[formatter])

file_path = "libs/megaparse/tests/pdf/sample_pdf.pdf"
result = megaparse.load(file_path=file_path)
print(result)
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# from typing import List

# from megaparse.formatter.base import BaseFormatter
# from pydantic import BaseModel


# class StructuredFormatter(BaseFormatter):
# async def format_string(
# self, text: str, file_path: str | None = None, model: BaseModel | None = None
# ) -> BaseModel:
# raise NotImplementedError()
Original file line number Diff line number Diff line change
@@ -3,9 +3,8 @@

from langchain_core.language_models.chat_models import BaseChatModel
from langchain_core.prompts import ChatPromptTemplate
from unstructured.documents.elements import Element

from megaparse.formatter.table_formatter import TableFormatter
from unstructured.documents.elements import Element


class SimpleMDTableFormatter(TableFormatter):
Original file line number Diff line number Diff line change
@@ -4,12 +4,11 @@

from langchain_core.language_models.chat_models import BaseChatModel
from langchain_core.messages import HumanMessage
from megaparse.formatter.table_formatter import TableFormatter
from pdf2image import convert_from_path
from PIL import Image
from unstructured.documents.elements import Element

from megaparse.formatter.table_formatter import TableFormatter

TABLE_OCR_PROMPT = """
You are tasked with transcribing the content of a table into markdown format. Your goal is to create a well-structured, readable markdown table that accurately represents the original content while adding appropriate formatting.
Answer uniquely with the parsed table. Do not include the fenced code blocks backticks.
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
from typing import List

from unstructured.documents.elements import Element

from megaparse.formatter.base import BaseFormatter
from unstructured.documents.elements import Element


class UnstructuredFormatter(BaseFormatter):
2 changes: 1 addition & 1 deletion libs/megaparse/src/megaparse/megaparse.py
Original file line number Diff line number Diff line change
@@ -61,7 +61,7 @@ async def aload(

try:
parsed_document = await self.parser.convert(file_path=file_path, file=file)
# @chloe FIXME: format_checker needs unstructured Elements as input which is to change
# @chloe FIXME: format_checker needs unstructured Elements as input which is to change to a megaparse element
if self.formatters:
for formatter in self.formatters:
parsed_document = await formatter.format(parsed_document)