Possibilité d'utiliser YOLOv8 pour extraire images et tableaux (#33)

* docs: todo * feat: use DocLayNet-YOLOv8 to pre-segment pages * feat: yolo integration * feat: yolo results * chore: yolo * feat: download from hf hub * fix: 4 element colour not necessarily CMYK * test: actually test the extractor * refactor: extract_images -> alexi.recognize.Objets * docs: todo * feat: basic YOLO support for tables/figures * fix: annotate plan durbanisme a bit * chore: retrain * fix: more patch * feat: yolo optionnel et scripts * fix: isort * fix: ajouter plan d'urbanisme
dhdaines · Aug 2, 2024 · 3e9f63e · 3e9f63e
1 parent 537b506
commit 3e9f63e
Show file tree

Hide file tree

Showing 23 changed files with 4,008 additions and 190 deletions.
diff --git a/.github/workflows/analyse.yml b/.github/workflows/analyse.yml
@@ -41,7 +41,7 @@ jobs:
         key: reglements-urbanisme
     - name: Download
       run: |
-        alexi -v download --exclude=Plan --exclude=/derogation \
+        alexi -v download --exclude=/derogation \
                  --exclude='\d-[aA]dopt' --exclude='Z-\d' \
                  --exclude='-[rR]eso'
         for d in download/*.pdf; do

diff --git a/TODO.md b/TODO.md
@@ -3,32 +3,33 @@ DATA
 
 - Correct titles in zonage glossary
 - Correct extraction (see below, use RNN) of titles numbers etc
-- Annotate multiple TOCs in Sainte-Agathe urbanisme DONE
-- Add Sainte-Agathe to download DONE
-- Add Sainte-Agathe to export (under /vsadm) DONE
-- Do the same thing for Saint-Sauveur
+- Redo alexi download to not use wget (httpx is nice)
 
 DERP LERNING
 ------------
 
 Pre-training
 ============
 
-- DocBank is not useful unfortunately
-  - No BIO tags on paragraphs and list items (WTF Microsoft!)
-  - Hard to even get their dataset and it is full of junk
-  - But their extraction *is* useful
-  - We could redo their extraction with French data
-- Look at other document structure analysis models
-  - DocLayNet is more interesting: https://huggingface.co/datasets/ds4sd/DocLayNet
-    - Yes: it separates paragraphs and section headings
-    - Need to download the huge image archive to get this though ;(
-  - Check out its leaderboard
-- Evaluate models already trained on DocLayNet:
-  - https://github.com/moured/YOLOv10-Document-Layout-Analysis
-  - https://huggingface.co/spaces/atlury/document-layout-comparison
-  - https://huggingface.co/DILHTWD/documentlayoutsegmentation_YOLOv8_ondoclaynet
-  - https://huggingface.co/spaces/omoured/YOLOv10-Document-Layout-Analysis
+- DocLayNet is more interesting: https://huggingface.co/datasets/ds4sd/DocLayNet
+  - Specifically the legal subset
+- PubLayNet maybe (check the annotations)
+- Evaluating DocLayNet YOLO models for my task: DONE
+  - can simply evaluate F1 on entire train set DONE
+  - test dpi, antialias, rendering engines DONE
+  - best results: render at YOLO model size (max dimension 640px) with
+    antialiasing using Cairo (pdfium is less good... why?) DONE
+- Other pre-trained DocLayNet models?
+  - pre-train Detectron2 / SSD / R-CNN / other?
+- Pre-train ALEXI LSTM on LAU and other relevant laws (code civil, etc)
+  - Get list of URLs from alexi link
+  - NOTE: license does not permit redistribution, use for modeling
+    (especially for layout analysis) should be okay though
+  - Make a script for downloading and processing data
+- Pre-train an LSTM on DocLayNet legal?
+  - Generic Titre, Alinea, Liste only
+  - Layout embeddings and binary features only
+
 
 Segmentation
 ============
@@ -46,12 +47,10 @@ Segmentation
   - Could *possibly* train a CRF to do this, in fact DONE
 - Do prediction with Transformers (LayoutLM) DONE
   - heuristic chunking based on line gap (not indent) DONE
-- Move Amendement from segmentation to sequence tagging
-  - update all training data
-  - compare main and `more_rnn_feats` branches
-- Do pre-segmentation with YOLO+DocLayNet
-  - get bboxes and classes DONE
-  - 
+- Move Amendement from segmentation to sequence tagging DONE
+- Do pre-segmentation with YOLO+DocLayNet DONE
+- Integrate DocLayNet pre-segmentation into pipeline
+  - Do image/table identification with it
 - Tokenize from chars
   - Add functionality to pdfplumber
 - Use Transformers for embeddings

diff --git a/alexi/__init__.py b/alexi/__init__.py
@@ -7,15 +7,13 @@
 import argparse
 import csv
 import dataclasses
-import itertools
 import json
 import logging
-import operator
 import sys
 from pathlib import Path
 
 from . import annotate, download, extract
-from .analyse import Analyseur, Bloc, merge_overlaps
+from .analyse import Analyseur, Bloc
 from .convert import Converteur, write_csv
 from .format import format_html
 from .index import index
@@ -36,24 +34,6 @@ def convert_main(args: argparse.Namespace):
     else:
         pages = None
     conv = Converteur(args.pdf)
-    if args.images is not None:
-        args.images.mkdir(parents=True, exist_ok=True)
-        images: list[dict] = []
-        for _, group in itertools.groupby(
-            conv.extract_images(pages), operator.attrgetter("page_number")
-        ):
-            merged = merge_overlaps(group)
-            for bloc in merged:
-                images.append(dataclasses.asdict(bloc))
-                img = (
-                    conv.pdf.pages[bloc.page_number - 1]
-                    .crop(bloc.bbox)
-                    .to_image(resolution=150, antialias=True)
-                )
-                LOGGER.info("Extraction de %s", args.images / bloc.img)
-                img.save(args.images / bloc.img)
-        with open(args.images / "images.json", "wt") as outfh:
-            json.dump(images, outfh, indent=2)
     write_csv(conv.extract_words(pages), sys.stdout)
 
 
@@ -135,9 +115,6 @@ def make_argparse() -> argparse.ArgumentParser:
     convert.add_argument(
         "--pages", help="Liste de numéros de page à extraire, séparés par virgule"
     )
-    convert.add_argument(
-        "--images", help="Répertoire pour écrire des images des tableaux", type=Path
-    )
     convert.set_defaults(func=convert_main)
 
     segment = subp.add_parser(

diff --git a/alexi/convert.py b/alexi/convert.py
@@ -1,20 +1,15 @@
 """Conversion de PDF en CSV"""
 
 import csv
-import itertools
 import logging
-import operator
 from collections import deque
 from pathlib import Path
 from typing import Any, Iterable, Iterator, Optional, TextIO
 
 from pdfplumber import PDF
 from pdfplumber.page import Page
 from pdfplumber.structure import PDFStructElement, PDFStructTree, StructTreeMissing
-from pdfplumber.utils import geometry
-from pdfplumber.utils.geometry import T_bbox
 
-from .analyse import Bloc
 from .types import T_obj
 
 LOGGER = logging.getLogger("convert")
@@ -46,50 +41,24 @@ def write_csv(
     writer.writerows(doc)
 
 
-def bbox_contains(bbox: T_bbox, ibox: T_bbox) -> bool:
-    """Déterminer si une BBox est contenu entièrement par une autre."""
-    x0, top, x1, bottom = bbox
-    ix0, itop, ix1, ibottom = ibox
-    return ix0 >= x0 and ix1 <= x1 and itop >= top and ibottom <= bottom
-
-
-def get_element_bbox(page: Page, el: PDFStructElement, mcids: Iterable[int]) -> T_bbox:
-    """Obtenir le BBox autour d'un élément structurel."""
-    bbox = el.attributes.get("BBox", None)
-    if bbox is not None:
-        x0, y0, x1, y1 = bbox
-        top = page.height - y1
-        bottom = page.height - y0
-        return (x0, top, x1, bottom)
-    else:
-        mcidset = set(mcids)
-        mcid_objs = [
-            c
-            for c in itertools.chain.from_iterable(page.objects.values())
-            if c.get("mcid") in mcidset
-        ]
-        if not mcid_objs:
-            return (-1, -1, -1, -1)  # An impossible BBox
-        return geometry.objects_to_bbox(mcid_objs)
-
-
 def get_rgb(c: T_obj) -> str:
-    """Extraire la couleur d'un objet en 3 chiffres hexadécimaux"""
+    """Extraire la couleur d'un objet en chiffres hexadécimaux"""
     couleur = c.get("non_stroking_color", c.get("stroking_color"))
-    if couleur is None:
+    if couleur is None or couleur == "":
         return "#000"
     elif len(couleur) == 1:
-        r = g = b = couleur[0]
-    elif len(couleur) == 3:
-        r, g, b = couleur
-    elif len(couleur) == 4:
-        return "CMYK#" + "".join(
-            ("%x" % int(min(0.999, val) * 16) for val in (couleur))
+        return "#" + "".join(
+            (
+                "%x" % int(min(0.999, val) * 16)
+                for val in (couleur[0], couleur[0], couleur[0])
+            )
         )
+    elif len(couleur) == 3 or len(couleur) == 4:
+        # Could be RGB, RGBA, CMYK...
+        return "#" + "".join(("%x" % int(min(0.999, val) * 16) for val in couleur))
     else:
         LOGGER.warning("Espace couleur non pris en charge: %s", couleur)
         return "#000"
-    return "#" + "".join(("%x" % int(min(0.999, val) * 16) for val in (r, g, b)))
 
 
 def get_word_features(
@@ -188,67 +157,3 @@ def extract_words(self, pages: Optional[Iterable[int]] = None) -> Iterator[T_obj
                 feats = get_word_features(word, page, chars, elmap)
                 feats["path"] = str(self.path)
                 yield feats
-
-    def make_bloc(
-        self, el: PDFStructElement, page_number: int, mcids: Iterable[int]
-    ) -> Bloc:
-        page = self.pdf.pages[page_number - 1]
-        x0, top, x1, bottom = get_element_bbox(page, el, mcids)
-        return Bloc(
-            type="Tableau" if el.type == "Table" else el.type,
-            contenu=[],
-            _page_number=int(page_number),
-            _bbox=(round(x0), round(top), round(x1), round(bottom)),
-        )
-
-    def extract_images(self, pages: Optional[Iterable[int]] = None) -> Iterator[Bloc]:
-        """Trouver des éléments qui seront représentés par des images
-        (tableaux et figures pour le moment)"""
-        if self.tree is None:
-            return
-        if pages is None:
-            pages = range(1, len(self.pdf.pages) + 1)
-        pageset = set(pages)
-
-        # tables *might* span multiple pages (in practice, no...) so
-        # we have to split them at page breaks, but also, their
-        # top-level elements don't have page numbers for this reason.
-        # So, we find them in a first traversal, then gather their
-        # children in a second one.
-        def gather_elements() -> Iterator[PDFStructElement]:
-            """Traverser l'arbre structurel en profondeur pour chercher les
-            figures et tableaux."""
-            if self.tree is None:
-                return
-            d = deque(self.tree)
-            while d:
-                el = d.popleft()
-                if el.type == "Table":
-                    yield el
-                elif el.type == "Figure":
-                    yield el
-                else:
-                    d.extendleft(reversed(el.children))
-
-        def get_child_mcids(el: PDFStructElement) -> Iterator[tuple[int, int]]:
-            """Trouver tous les MCIDs (avec numeros de page, sinon ils sont
-            inutiles!) à l'intérieur d'un élément structurel"""
-            for mcid in el.mcids:
-                assert el.page_number is not None
-                yield el.page_number, mcid
-            d = deque(el.children)
-            while d:
-                el = d.popleft()
-                for mcid in el.mcids:
-                    assert el.page_number is not None
-                    yield el.page_number, mcid
-                d.extend(el.children)
-
-        for el in gather_elements():
-            # Note: we must sort them as we can't guarantee they come
-            # in any particular order
-            mcids = list(get_child_mcids(el))
-            mcids.sort()
-            for page_number, group in itertools.groupby(mcids, operator.itemgetter(0)):
-                if page_number in pageset:
-                    yield self.make_bloc(el, page_number, (mcid for _, mcid in group))