Skip to content

Commit

Permalink
Possibilité d'utiliser YOLOv8 pour extraire images et tableaux (#33)
Browse files Browse the repository at this point in the history
* docs: todo

* feat: use DocLayNet-YOLOv8 to pre-segment pages

* feat: yolo integration

* feat: yolo results

* chore: yolo

* feat: download from hf hub

* fix: 4 element colour not necessarily CMYK

* test: actually test the extractor

* refactor: extract_images -> alexi.recognize.Objets

* docs: todo

* feat: basic YOLO support for tables/figures

* fix: annotate plan durbanisme a bit

* chore: retrain

* fix: more patch

* feat: yolo optionnel et scripts

* fix: isort

* fix: ajouter plan d'urbanisme
  • Loading branch information
dhdaines committed Aug 2, 2024
1 parent 537b506 commit 3e9f63e
Show file tree
Hide file tree
Showing 23 changed files with 4,008 additions and 190 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/analyse.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ jobs:
key: reglements-urbanisme
- name: Download
run: |
alexi -v download --exclude=Plan --exclude=/derogation \
alexi -v download --exclude=/derogation \
--exclude='\d-[aA]dopt' --exclude='Z-\d' \
--exclude='-[rR]eso'
for d in download/*.pdf; do
Expand Down
49 changes: 24 additions & 25 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,32 +3,33 @@ DATA

- Correct titles in zonage glossary
- Correct extraction (see below, use RNN) of titles numbers etc
- Annotate multiple TOCs in Sainte-Agathe urbanisme DONE
- Add Sainte-Agathe to download DONE
- Add Sainte-Agathe to export (under /vsadm) DONE
- Do the same thing for Saint-Sauveur
- Redo alexi download to not use wget (httpx is nice)

DERP LERNING
------------

Pre-training
============

- DocBank is not useful unfortunately
- No BIO tags on paragraphs and list items (WTF Microsoft!)
- Hard to even get their dataset and it is full of junk
- But their extraction *is* useful
- We could redo their extraction with French data
- Look at other document structure analysis models
- DocLayNet is more interesting: https://huggingface.co/datasets/ds4sd/DocLayNet
- Yes: it separates paragraphs and section headings
- Need to download the huge image archive to get this though ;(
- Check out its leaderboard
- Evaluate models already trained on DocLayNet:
- https://github.com/moured/YOLOv10-Document-Layout-Analysis
- https://huggingface.co/spaces/atlury/document-layout-comparison
- https://huggingface.co/DILHTWD/documentlayoutsegmentation_YOLOv8_ondoclaynet
- https://huggingface.co/spaces/omoured/YOLOv10-Document-Layout-Analysis
- DocLayNet is more interesting: https://huggingface.co/datasets/ds4sd/DocLayNet
- Specifically the legal subset
- PubLayNet maybe (check the annotations)
- Evaluating DocLayNet YOLO models for my task: DONE
- can simply evaluate F1 on entire train set DONE
- test dpi, antialias, rendering engines DONE
- best results: render at YOLO model size (max dimension 640px) with
antialiasing using Cairo (pdfium is less good... why?) DONE
- Other pre-trained DocLayNet models?
- pre-train Detectron2 / SSD / R-CNN / other?
- Pre-train ALEXI LSTM on LAU and other relevant laws (code civil, etc)
- Get list of URLs from alexi link
- NOTE: license does not permit redistribution, use for modeling
(especially for layout analysis) should be okay though
- Make a script for downloading and processing data
- Pre-train an LSTM on DocLayNet legal?
- Generic Titre, Alinea, Liste only
- Layout embeddings and binary features only


Segmentation
============
Expand All @@ -46,12 +47,10 @@ Segmentation
- Could *possibly* train a CRF to do this, in fact DONE
- Do prediction with Transformers (LayoutLM) DONE
- heuristic chunking based on line gap (not indent) DONE
- Move Amendement from segmentation to sequence tagging
- update all training data
- compare main and `more_rnn_feats` branches
- Do pre-segmentation with YOLO+DocLayNet
- get bboxes and classes DONE
-
- Move Amendement from segmentation to sequence tagging DONE
- Do pre-segmentation with YOLO+DocLayNet DONE
- Integrate DocLayNet pre-segmentation into pipeline
- Do image/table identification with it
- Tokenize from chars
- Add functionality to pdfplumber
- Use Transformers for embeddings
Expand Down
25 changes: 1 addition & 24 deletions alexi/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,13 @@
import argparse
import csv
import dataclasses
import itertools
import json
import logging
import operator
import sys
from pathlib import Path

from . import annotate, download, extract
from .analyse import Analyseur, Bloc, merge_overlaps
from .analyse import Analyseur, Bloc
from .convert import Converteur, write_csv
from .format import format_html
from .index import index
Expand All @@ -36,24 +34,6 @@ def convert_main(args: argparse.Namespace):
else:
pages = None
conv = Converteur(args.pdf)
if args.images is not None:
args.images.mkdir(parents=True, exist_ok=True)
images: list[dict] = []
for _, group in itertools.groupby(
conv.extract_images(pages), operator.attrgetter("page_number")
):
merged = merge_overlaps(group)
for bloc in merged:
images.append(dataclasses.asdict(bloc))
img = (
conv.pdf.pages[bloc.page_number - 1]
.crop(bloc.bbox)
.to_image(resolution=150, antialias=True)
)
LOGGER.info("Extraction de %s", args.images / bloc.img)
img.save(args.images / bloc.img)
with open(args.images / "images.json", "wt") as outfh:
json.dump(images, outfh, indent=2)
write_csv(conv.extract_words(pages), sys.stdout)


Expand Down Expand Up @@ -135,9 +115,6 @@ def make_argparse() -> argparse.ArgumentParser:
convert.add_argument(
"--pages", help="Liste de numéros de page à extraire, séparés par virgule"
)
convert.add_argument(
"--images", help="Répertoire pour écrire des images des tableaux", type=Path
)
convert.set_defaults(func=convert_main)

segment = subp.add_parser(
Expand Down
115 changes: 10 additions & 105 deletions alexi/convert.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,15 @@
"""Conversion de PDF en CSV"""

import csv
import itertools
import logging
import operator
from collections import deque
from pathlib import Path
from typing import Any, Iterable, Iterator, Optional, TextIO

from pdfplumber import PDF
from pdfplumber.page import Page
from pdfplumber.structure import PDFStructElement, PDFStructTree, StructTreeMissing
from pdfplumber.utils import geometry
from pdfplumber.utils.geometry import T_bbox

from .analyse import Bloc
from .types import T_obj

LOGGER = logging.getLogger("convert")
Expand Down Expand Up @@ -46,50 +41,24 @@ def write_csv(
writer.writerows(doc)


def bbox_contains(bbox: T_bbox, ibox: T_bbox) -> bool:
"""Déterminer si une BBox est contenu entièrement par une autre."""
x0, top, x1, bottom = bbox
ix0, itop, ix1, ibottom = ibox
return ix0 >= x0 and ix1 <= x1 and itop >= top and ibottom <= bottom


def get_element_bbox(page: Page, el: PDFStructElement, mcids: Iterable[int]) -> T_bbox:
"""Obtenir le BBox autour d'un élément structurel."""
bbox = el.attributes.get("BBox", None)
if bbox is not None:
x0, y0, x1, y1 = bbox
top = page.height - y1
bottom = page.height - y0
return (x0, top, x1, bottom)
else:
mcidset = set(mcids)
mcid_objs = [
c
for c in itertools.chain.from_iterable(page.objects.values())
if c.get("mcid") in mcidset
]
if not mcid_objs:
return (-1, -1, -1, -1) # An impossible BBox
return geometry.objects_to_bbox(mcid_objs)


def get_rgb(c: T_obj) -> str:
"""Extraire la couleur d'un objet en 3 chiffres hexadécimaux"""
"""Extraire la couleur d'un objet en chiffres hexadécimaux"""
couleur = c.get("non_stroking_color", c.get("stroking_color"))
if couleur is None:
if couleur is None or couleur == "":
return "#000"
elif len(couleur) == 1:
r = g = b = couleur[0]
elif len(couleur) == 3:
r, g, b = couleur
elif len(couleur) == 4:
return "CMYK#" + "".join(
("%x" % int(min(0.999, val) * 16) for val in (couleur))
return "#" + "".join(
(
"%x" % int(min(0.999, val) * 16)
for val in (couleur[0], couleur[0], couleur[0])
)
)
elif len(couleur) == 3 or len(couleur) == 4:
# Could be RGB, RGBA, CMYK...
return "#" + "".join(("%x" % int(min(0.999, val) * 16) for val in couleur))
else:
LOGGER.warning("Espace couleur non pris en charge: %s", couleur)
return "#000"
return "#" + "".join(("%x" % int(min(0.999, val) * 16) for val in (r, g, b)))


def get_word_features(
Expand Down Expand Up @@ -188,67 +157,3 @@ def extract_words(self, pages: Optional[Iterable[int]] = None) -> Iterator[T_obj
feats = get_word_features(word, page, chars, elmap)
feats["path"] = str(self.path)
yield feats

def make_bloc(
self, el: PDFStructElement, page_number: int, mcids: Iterable[int]
) -> Bloc:
page = self.pdf.pages[page_number - 1]
x0, top, x1, bottom = get_element_bbox(page, el, mcids)
return Bloc(
type="Tableau" if el.type == "Table" else el.type,
contenu=[],
_page_number=int(page_number),
_bbox=(round(x0), round(top), round(x1), round(bottom)),
)

def extract_images(self, pages: Optional[Iterable[int]] = None) -> Iterator[Bloc]:
"""Trouver des éléments qui seront représentés par des images
(tableaux et figures pour le moment)"""
if self.tree is None:
return
if pages is None:
pages = range(1, len(self.pdf.pages) + 1)
pageset = set(pages)

# tables *might* span multiple pages (in practice, no...) so
# we have to split them at page breaks, but also, their
# top-level elements don't have page numbers for this reason.
# So, we find them in a first traversal, then gather their
# children in a second one.
def gather_elements() -> Iterator[PDFStructElement]:
"""Traverser l'arbre structurel en profondeur pour chercher les
figures et tableaux."""
if self.tree is None:
return
d = deque(self.tree)
while d:
el = d.popleft()
if el.type == "Table":
yield el
elif el.type == "Figure":
yield el
else:
d.extendleft(reversed(el.children))

def get_child_mcids(el: PDFStructElement) -> Iterator[tuple[int, int]]:
"""Trouver tous les MCIDs (avec numeros de page, sinon ils sont
inutiles!) à l'intérieur d'un élément structurel"""
for mcid in el.mcids:
assert el.page_number is not None
yield el.page_number, mcid
d = deque(el.children)
while d:
el = d.popleft()
for mcid in el.mcids:
assert el.page_number is not None
yield el.page_number, mcid
d.extend(el.children)

for el in gather_elements():
# Note: we must sort them as we can't guarantee they come
# in any particular order
mcids = list(get_child_mcids(el))
mcids.sort()
for page_number, group in itertools.groupby(mcids, operator.itemgetter(0)):
if page_number in pageset:
yield self.make_bloc(el, page_number, (mcid for _, mcid in group))
Loading

0 comments on commit 3e9f63e

Please sign in to comment.