Skip to content

Commit

Permalink
feat: add test case (#645)
Browse files Browse the repository at this point in the history
* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

* feat: add table case

---------

Co-authored-by: quyuan <[email protected]>
  • Loading branch information
dt-yy and quyuan authored Sep 23, 2024
1 parent 24c143f commit 0aa4577
Show file tree
Hide file tree
Showing 17 changed files with 288 additions and 160 deletions.
21 changes: 9 additions & 12 deletions .github/workflows/cli.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,20 +10,18 @@ on:
paths-ignore:
- "cmds/**"
- "**.md"
- "**.yml"
pull_request:
branches:
- "master"
- "dev"
paths-ignore:
- "cmds/**"
- "**.md"
- "**.yml"
workflow_dispatch:
jobs:
cli-test:
runs-on: pdf
timeout-minutes: 120
timeout-minutes: 240
strategy:
fail-fast: true

Expand All @@ -33,17 +31,16 @@ jobs:
with:
fetch-depth: 2

- name: install
- name: install&test
run: |
echo $GITHUB_WORKSPACE && sh tests/retry_env.sh
- name: unit test
run: |
cd $GITHUB_WORKSPACE && python tests/clean_coverage.py
cd $GITHUB_WORKSPACE && export PYTHONPATH=. && coverage run -m pytest tests/unittest --cov=magic_pdf/ --cov-report term-missing --cov-report html
source activate mineru
conda env list
pip show coverage
# cd $GITHUB_WORKSPACE && sh tests/retry_env.sh
cd $GITHUB_WORKSPACE && python tests/clean_coverage.py
cd $GITHUB_WORKSPACE && coverage run -m pytest tests/unittest/ --cov=magic_pdf/ --cov-report html --cov-report term-missing
cd $GITHUB_WORKSPACE && python tests/get_coverage.py
- name: cli test
run: |
source ~/.bashrc && cd $GITHUB_WORKSPACE && pytest -s -v tests/test_cli/test_cli.py
cd $GITHUB_WORKSPACE && pytest -m P0 -s -v tests/test_cli/test_cli_sdk.py
notify_to_feishu:
if: ${{ always() && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'master') }}
Expand Down
55 changes: 55 additions & 0 deletions .github/workflows/daily.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: mineru
on:
schedule:
- cron: '0 22 * * *' # 每天晚上 10 点执行
jobs:
cli-test:
runs-on: pdf
timeout-minutes: 240
strategy:
fail-fast: true

steps:
- name: PDF cli
uses: actions/checkout@v3
with:
fetch-depth: 2

- name: install&test
run: |
source activate mineru
conda env list
pip show coverage
# cd $GITHUB_WORKSPACE && sh tests/retry_env.sh
cd $GITHUB_WORKSPACE && python tests/clean_coverage.py
cd $GITHUB_WORKSPACE && coverage run -m pytest tests/unittest/ --cov=magic_pdf/ --cov-report html --cov-report term-missing
cd $GITHUB_WORKSPACE && python tests/get_coverage.py
cd $GITHUB_WORKSPACE && pytest -s -v tests/test_cli/test_cli_sdk.py
notify_to_feishu:
if: ${{ always() && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'master') }}
needs: cli-test
runs-on: pdf
steps:
- name: get_actor
run: |
metion_list="dt-yy"
echo $GITHUB_ACTOR
if [[ $GITHUB_ACTOR == "drunkpig" ]]; then
metion_list="xuchao"
elif [[ $GITHUB_ACTOR == "myhloli" ]]; then
metion_list="zhaoxiaomeng"
elif [[ $GITHUB_ACTOR == "icecraft" ]]; then
metion_list="xurui1"
fi
echo $metion_list
echo "METIONS=$metion_list" >> "$GITHUB_ENV"
echo ${{ env.METIONS }}
- name: notify
run: |
echo ${{ secrets.USER_ID }}
curl -X POST -H "Content-Type: application/json" -d '{"msg_type":"post","content":{"post":{"zh_cn":{"title":"'${{ github.repository }}' GitHubAction Failed","content":[[{"tag":"text","text":""},{"tag":"a","text":"Please click here for details ","href":"https://github.com/'${{ github.repository }}'/actions/runs/'${GITHUB_RUN_ID}'"},{"tag":"at","user_id":"'${{ secrets.USER_ID }}'"}]]}}}}' ${{ secrets.WEBHOOK_URL }}
61 changes: 61 additions & 0 deletions .github/workflows/huigui.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python

name: mineru
on:
push:
branches:
- "master"
- "dev"
paths-ignore:
- "cmds/**"
- "**.md"
workflow_dispatch:
jobs:
cli-test:
runs-on: pdf
timeout-minutes: 240
strategy:
fail-fast: true

steps:
- name: PDF cli
uses: actions/checkout@v3
with:
fetch-depth: 2

- name: install&test
run: |
source activate mineru
conda env list
pip show coverage
# cd $GITHUB_WORKSPACE && sh tests/retry_env.sh
cd $GITHUB_WORKSPACE && python tests/clean_coverage.py
cd $GITHUB_WORKSPACE && coverage run -m pytest tests/unittest/ --cov=magic_pdf/ --cov-report html --cov-report term-missing
cd $GITHUB_WORKSPACE && python tests/get_coverage.py
cd $GITHUB_WORKSPACE && pytest -s -v tests/test_cli/test_cli_sdk.py
notify_to_feishu:
if: ${{ always() && !cancelled() && contains(needs.*.result, 'failure') && (github.ref_name == 'master') }}
needs: cli-test
runs-on: pdf
steps:
- name: get_actor
run: |
metion_list="dt-yy"
echo $GITHUB_ACTOR
if [[ $GITHUB_ACTOR == "drunkpig" ]]; then
metion_list="xuchao"
elif [[ $GITHUB_ACTOR == "myhloli" ]]; then
metion_list="zhaoxiaomeng"
elif [[ $GITHUB_ACTOR == "icecraft" ]]; then
metion_list="xurui1"
fi
echo $metion_list
echo "METIONS=$metion_list" >> "$GITHUB_ENV"
echo ${{ env.METIONS }}
- name: notify
run: |
echo ${{ secrets.USER_ID }}
curl -X POST -H "Content-Type: application/json" -d '{"msg_type":"post","content":{"post":{"zh_cn":{"title":"'${{ github.repository }}' GitHubAction Failed","content":[[{"tag":"text","text":""},{"tag":"a","text":"Please click here for details ","href":"https://github.com/'${{ github.repository }}'/actions/runs/'${GITHUB_RUN_ID}'"},{"tag":"at","user_id":"'${{ secrets.USER_ID }}'"}]]}}}}' ${{ secrets.WEBHOOK_URL }}
22 changes: 0 additions & 22 deletions .github/workflows/update_base.yml

This file was deleted.

3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
*.tar
*.tar.gz
*.zip
venv*/
envs/
slurm_logs/
Expand Down Expand Up @@ -31,7 +32,7 @@ tmp
.vscode
.vscode/
ocr_demo

.coveragerc
/app/common/__init__.py
/magic_pdf/config/__init__.py
source.dev.env
Expand Down
3 changes: 2 additions & 1 deletion requirements-qa.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,5 @@ pypandoc
pyopenssl==24.0.0
struct-eqtable==0.1.0
pytest-cov
beautifulsoup4
beautifulsoup4
coverage
3 changes: 2 additions & 1 deletion tests/clean_coverage.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,4 +21,5 @@ def delete_file(path):
print(f"Error deleting directory '{path}': {e}")

if __name__ == "__main__":
delete_file("htmlcov")
delete_file("htmlcov/")
#delete_file(".coverage")
11 changes: 4 additions & 7 deletions tests/retry_env.sh
Original file line number Diff line number Diff line change
@@ -1,16 +1,13 @@
#!/bin/bash

# 定义最大重试次数
max_retries=5
retry_count=0

while true; do
# prepare env
source activate MinerU
pip install -r requirements-qa.txt
pip uninstall magic-pdf
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
#python -m pip install -r requirements-qa.txt
python -m pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple
python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/
exit_code=$?
if [ $exit_code -eq 0 ]; then
echo "test.sh 成功执行!"
Expand All @@ -22,6 +19,6 @@ while true; do
exit 1
fi
echo "test.sh 执行失败 (退出码: $exit_code)。尝试第 $retry_count 次重试..."
sleep 5 # 等待 5 秒后重试
sleep 5
fi
done
2 changes: 1 addition & 1 deletion tests/test_cli/conf/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@
"pdf_dev_path" : os.environ.get('GITHUB_WORKSPACE') + "/tests/test_cli/pdf_dev",
"pdf_res_path": "/tmp/magic-pdf",
"jsonl_path": "s3://llm-qatest-pnorm/mineru/test/line1.jsonl",
"s3_pdf_path": "s3://llm-qatest-pnorm/mineru/test/test.pdf"
"s3_pdf_path": "s3://llm-qatest-pnorm/mineru/test/test_rearch_report.pdf"
}
17 changes: 17 additions & 0 deletions tests/test_cli/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
import pytest
import torch

def clear_gpu_memory():
'''
clear GPU memory
'''
torch.cuda.empty_cache()
print("GPU memory cleared.")

@pytest.hookimpl(tryfirst=True, hookwrapper=True)
def pytest_runtest_teardown(item, nextitem):
'''
clear GPU memory after each test
'''
yield
clear_gpu_memory()
42 changes: 39 additions & 3 deletions tests/test_cli/lib/common.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,20 @@
"""common definitions."""
import os
import shutil


import re
import json
def check_shell(cmd):
"""shell successful."""
res = os.system(cmd)
assert res == 0

def update_config_file(file_path, key, value):
"""update config file."""
with open(file_path, 'r', encoding="utf-8") as f:
config = json.loads(f.read())
config[key] = value
with open(file_path, 'w', encoding="utf-8") as f:
f.write(json.dumps(config))

def cli_count_folders_and_check_contents(file_path):
"""" count cli files."""
Expand Down Expand Up @@ -40,4 +47,33 @@ def delete_file(path):
shutil.rmtree(path)
print(f"Directory '{path}' and its contents deleted.")
except TypeError as e:
print(f"Error deleting directory '{path}': {e}")
print(f"Error deleting directory '{path}': {e}")

def check_latex_table_exists(file_path):
"""check latex table exists."""
pattern = r'\\begin\{tabular\}.*?\\end\{tabular\}'
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
matches = re.findall(pattern, content, re.DOTALL)
return len(matches) > 0

def check_html_table_exists(file_path):
"""check html table exists."""
pattern = r'<table.*?>.*?</table>'
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
matches = re.findall(pattern, content, re.DOTALL)
return len(matches) > 0

def check_close_tables(file_path):
"""delete no tables."""
latex_pattern = r'\\begin\{tabular\}.*?\\end\{tabular\}'
html_pattern = r'<table.*?>.*?</table>'
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
latex_matches = re.findall(latex_pattern, content, re.DOTALL)
html_matches = re.findall(html_pattern, content, re.DOTALL)
if len(latex_matches) == 0 and len(html_matches) == 0:
return True
else:
return False
Loading

0 comments on commit 0aa4577

Please sign in to comment.