Merge pull request #5 from acsenrafilho/develop

Develop
acsenrafilho · Nov 13, 2024 · be95131 · be95131
2 parents 689fe2a + ec394fe
commit be95131
Show file tree

Hide file tree

Showing 22 changed files with 377 additions and 11 deletions.
diff --git a/.github/workflows/ci-lib.yml b/.github/workflows/ci-lib.yml
@@ -43,6 +43,7 @@ jobs:
           token: ${{ secrets.CODECOV_TOKEN }}
           verbose: true
 
+
   windows:
     runs-on: windows-latest
     strategy:

diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 <img src="https://raw.githubusercontent.com/acsenrafilho/cucaracha/refs/heads/main/docs/assets/cucaracha-logo.png" width=700>
 
-[![Documentation Status](https://readthedocs.org/projects/cucaracha/badge/?version=latest)](https://cucaracha.readthedocs.io/en/latest/?badge=latest)
+[![Documentation Status](https://readthedocs.org/projects/cucaracha/badge/?version=main)](https://cucaracha.readthedocs.io/en/main/?badge=main)
 [![codecov](https://codecov.io/gh/acsenrafilho/cucaracha/graph/badge.svg?token=TgmCLPoIbW)](https://codecov.io/gh/acsenrafilho/cucaracha)
 [![CI Main](https://github.com/acsenrafilho/cucaracha/actions/workflows/ci-lib.yml/badge.svg?branch=main)](https://github.com/acsenrafilho/cucaracha/actions/workflows/ci-lib.yml)
 [![CI Develop](https://github.com/acsenrafilho/cucaracha/actions/workflows/ci-lib.yml/badge.svg?branch=develop)](https://github.com/acsenrafilho/cucaracha/actions/workflows/ci-lib.yml)
@@ -28,7 +28,7 @@ The name `cucaracha` reflects the tireless, behind-the-scenes nature of the tool
 
 ### Getting Started
 
-Check out the [full documentation](https://cucaracha.readthedocs.io/en/latest/) for detailed instructions on how to use, implement, and keep up with updates to `cucaracha`. 
+Check out the [full documentation](https://cucaracha.readthedocs.io/en/main/) for detailed instructions on how to use, implement, and keep up with updates to `cucaracha`. 
 
 ### Contributing to `cucaracha`
 
@@ -45,4 +45,4 @@ A quick to use install is via `pip`, as follows:
 
 ```bash
 pip install cucaracha
-```
+```
diff --git a/cucaracha/__init__.py b/cucaracha/__init__.py
@@ -4,6 +4,12 @@
 import numpy as np
 import pymupdf
 from pymupdf import Page
+from rich import print
+from rich.progress import track
+
+from cucaracha.aligment import inplane_deskew
+from cucaracha.noise_removal import sparse_dots
+from cucaracha.threshold import otsu
 
 
 class Document:
@@ -267,9 +273,96 @@ def get_page(self, page: int):
 
         return self._doc_file[page]
 
-    def batch_processing(self, processors: list):
-        # TODO Make a loop processor to make image processing to the doc_file
-        pass
+    def set_page(self, page: np.ndarray, index: int):
+        """Update a new page into the document file
+
+        The page index must be passed considering the total range of pages
+        in the document. See the metadata to get this information.
+
+        Examples:
+            >>> doc = Document('./'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
+            >>> doc.get_metadata('pages')
+            {'pages': 1}
+
+            The original information is loaded as usual
+            >>> np.max(doc.get_page(0))
+            255
+
+            But a new page can be changed like this:
+            >>> new_page = np.ones(doc.get_page(0).shape)
+            >>> doc.set_page(new_page, 0)
+
+            Then the new page is placed in the document object
+            >>> np.max(doc.get_page(0))
+            1.0
+
+        Args:
+            page (np.ndarray): A numpy array with the same shape of the other pages
+            index (int): The index where the new page should be placed
+
+        Raises:
+            ValueError: Page index is out of range (total page is ... and must be a positive integer)
+            ValueError: New page is not a numpy array or has different shape from previous pages
+        """
+        if index > len(self._doc_file) or index < 0:
+            raise ValueError(
+                f'Page index is out of range (total page is {len(self._doc_file)} and must be a positive integer)'
+            )
+
+        if (
+            not isinstance(page, np.ndarray)
+            or page.shape != self.get_page(index).shape
+        ):
+            raise ValueError(
+                'New page is not a numpy array or has different shape from previous pages'
+            )
+
+        self._doc_file[index] = page
+
+    def run_pipeline(self, processors: list):
+        """Execute a list of image processing methods to the document file
+        allocated in the `Document` object.
+
+        The processing order is the same as indicated in the list of processors.
+
+        Examples:
+            One can define a processor as a function caller:
+            >>> def proc2(input): return sparse_dots(input, 3)
+            >>> def proc3(input): return inplane_deskew(input, 25)
+            >>> proc_list = [otsu, proc2, proc3]
+
+            After the `proc_list` being created, the proper execution can be
+            called using:
+            >>> doc = Document('.'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
+            >>> doc.run_pipeline(proc_list) # doctest: +SKIP
+            Applying processors... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
+
+            Hence, the inner document file in the `doc` object is updated:
+            >>> type(doc.get_page(0))
+            <class 'numpy.ndarray'>
+
+        Warning:
+            All the processor in the list must be of `cucaracha` filter type.
+            Hence, make sure that the processor instance accepts an numpy array
+            as input and returns a tuple with numpy array and a dictionary of
+            extra parameters (`(np.ndarray, dict)`).
+
+        Note:
+            All the pages presented in the document object is processed. If it
+            is desired to apply only on specific pages, then it is need to
+            process it individually and then update the page using the method
+            `set_page`
+
+        Args:
+            processors (list): _description_
+        """
+        self._check_processor_list(processors)
+
+        for proc in track(
+            processors, description='[green]Applying processors...'
+        ):
+            for idx, page in enumerate(self._doc_file):
+                self._doc_file[idx] = proc(page)[0]
 
     def _read_by_ext(self, path, dpi):
         _, file_ext = os.path.splitext(path)
@@ -312,3 +405,20 @@ def _collect_inner_metadata(self, doc_path):
 
             # Set file number of pages
             self._doc_metadata['pages'] = len(self._doc_file)
+
+    def _check_processor_list(self, processors):
+        if type(processors) != list:
+            raise ValueError(
+                'processors must be a list of valid cucaracha filter methods'
+            )
+
+        for proc in processors:
+            out_test = proc(self.get_page(0))   # Test the processor output
+            if (
+                type(out_test) != tuple
+                or not isinstance(out_test[0], np.ndarray)
+                or not isinstance(out_test[1], dict)
+            ):
+                raise TypeError(
+                    f'Processor: {proc.__name__} is not valid. Unsure that the output processor is valid.'
+                )
diff --git a/cucaracha/aligment.py b/cucaracha/aligment.py
@@ -25,7 +25,9 @@ def inplane_deskew(input: np.ndarray, max_skew=10):
     height, width = input.shape[0], input.shape[1]
 
     # Create a grayscale image and denoise it
-    im_gs = cv.cvtColor(input, cv.COLOR_BGR2GRAY)
+    im_gs = input
+    if len(im_gs.shape) == 3:
+        im_gs = cv.cvtColor(input, cv.COLOR_BGR2GRAY)
     im_gs = cv.fastNlMeansDenoising(im_gs, h=3)
 
     # Create an inverted B&W copy using Otsu (automatic) thresholding

diff --git a/cucaracha/noise_removal.py b/cucaracha/noise_removal.py
@@ -24,9 +24,9 @@ def sparse_dots(input: np.ndarray, kernel_size: int = 1):
         ValueError: Kernel size must be an odd value
 
     Returns:
-        (np.ndarray): Output image without major sparse dots noise
+        (np.ndarray, dict): Output image without major sparse dots noise. This method does not return and extra information, then get an empty dict.
     """
     if kernel_size % 2 == 0:
         raise ValueError('Kernel size must be an odd value.')
 
-    return cv.medianBlur(input, kernel_size)
+    return cv.medianBlur(input, kernel_size), {}
diff --git a/docs/contribute.md b/docs/contribute.md
@@ -0,0 +1,143 @@
+# How to Contribute
+
+## Preparing the coding environment
+
+The first step to start coding new features or correcting bugs in the `cucaracha` library is doing the repository fork, directly on GitHub, and following to the repository clone:
+
+```bash
+git clone [email protected]:<YOUR_USERNAME>/cucaracha.git
+```
+
+Where `<YOUR_USERNAME>` indicates your GitHub account that has the repository fork.
+
+!!! tip
+    See more details on [GitHub](https://docs.github.com/pt/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo) for forking a repository
+
+After the repository been set in your local machine, the following setup steps can be done to prepare the coding environment:
+
+!!! warning
+    We assume the Poetry tool for project management, then make sure that the Poetry version is 1.8 or above. See more information about [Poetry installation](https://python-poetry.org/docs/#installing-with-pipx)
+
+```bash
+cd asltk
+poetry shell && poetry install
+```
+
+Then all the dependencies will be installed and the virtual environment will be created. After all being done successfully, the shortcuts for `test` and `doc` can be called:
+
+```bash
+task test
+```
+
+```bash
+task doc
+```
+
+More details about the entire project configuration is provided in the `pyproject.toml` file.
+
+### Basic tools
+
+We assume the following list of developing, testing and documentation tools:
+
+1. blue
+2. isort
+3. numpy
+4. OpenCV
+5. PyMuPDF
+6. rich
+7. pytest
+8. taskipy
+9. mkdocs-material
+10. pymdown-extensions
+
+Further adjustments in the set of tools for the project can be modified in the future. However, the details about these modifications are directly reported in new releases, regarding the specific tool versioning (more details at Version Control section)
+
+## Code Structure
+
+The general structure of the `cucaracha` library is given as the following:
+
+``` mermaid
+classDiagram
+  class Document{
+    +string doc_path
+    +dict metadata
+  }
+  class Aligment{
+    +function inplane_deskew
+  }
+  class Noise_Removal{
+    +function sparse_dots
+  }
+  class Threshold{
+    +function otsu
+    +function binary_threshold
+  }
+```
+
+Where the `Documen` class informs the basic data structure for the document file representation. All the others files are Python modules that contains the image processing methods represented by unique functions.
+
+!!! note
+    The general structure to be followed to create an image processing method is using the pattern: i) input = numpy array, ii) output = a tuple with the first item as a numpy array (data output) and the second item as a dictionary informing any additional output parameter that the function may offer.
+
+
+!!! question
+    In case of any doubt, discuss with the community using a [issue card](https://github.com/acsenrafilho/cucaracha/issues) in the repo.
+
+## Testing
+
+Another coding pattern expected in new contributions in the `cucaracha` library is the uses of unit tests. 
+
+!!! info
+    A good way to implement test together with coding steps is using a Test-Driven Desing (TDD). Further details can be found at [TDD tutorial](https://codefellows.github.io/sea-python-401d2/lectures/tdd_with_pytest.html) and in many other soruces on internet
+
+Each module or class implemented in the `cucaracha` library should have a series of tests to ensure the quality of the coding and more stability for production usage. We adopted the Python `codecov` tool to help in collecting the code coverage status, which can be accessed by the HTML page that is generated on the call
+
+```bash
+task test
+```
+
+## Code Documentation
+
+The coding documentation pattern is the [Google Docstring](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html)
+
+Please, provide as much details as possible in the methods, classes and modules implemented in the `cucaracha` library. By the way, if one may want to get deeper in the explanation of some functionality, then use the documentation webpage itself, which can be easier to add figures, graphs, diagrams and much more simple to read.
+
+!!! tip
+    As a good form to assist further users is providing `Examples` in the Google Docstring format. Then, when it is possible, add a few examples in the code documentation. 
+
+!!! info
+    The docstring also passes to a test analysis, then take care about adding `Examples` in the docstring, respecting the same usage pattern for input/output as the code provides
+
+## Version Control
+
+The `cucaracha` project adopts the [Semantic Versioning 2.0.0 SemVer](https://semver.org/) versioning pattern. Please, also take care about the specific version changes that will be added by further implementations.
+
+Another important consideration is that the `cucaracha` repository has two permanent branches: `main` and `develop`. The `main` branch is placed to stable, versioning controled releases, and the `develop` branch is for unstable most up-to-date functionalities. In order to keep the library as more reliable as possible, please consider making a Pull Request (PR) at the `develop` branch before passing it to the `main` branch.
+
+!!! info
+    The `main` branch is marked by the repository `tag` using the standard `vM.m.p`, where `M` is a major update, `m` minor update and `p` a patch update. All based on SemVer pattern.
+
+
+## Extending the library
+
+### Extending core functionalities
+
+If you want to provide a new functionality in the `cucaracha`, e.g. a new class that supports a novel ASL processing method, please keep the same data and coding structure as described in the `Code Structure` section.
+
+Any new ideas to improve the project readbility and coding organization is also welcome. If it is the case, please raise a new issue ticket at GitHub, using the Feature option to open an community debate about your suggestion. Once it is approved, a new project version is release with the new implementations glued in the core code.
+
+### Scripts
+
+A easier and less burocratic way to provide new code in the project is using a Python script. In this way, a simple calling script can be added in the repository, under the `scripts` folder, that can be used directly using the python command:
+
+```bash
+python -m cucaracha.scripts.YOUR_SCRIPT [input options]
+```
+
+In this way, you can share a code that can be called for a specific execution and can be used as a command-line interface (CLI). There are some examples already implemented in the `cucaracha.scripts`, and you can use then to get a general idea on how to apply it.
+
+!!! tip
+    Feel free to get inspired adding new scripts in the `cucaracha` project. A quick way to get this is simply making a copy of an existing python script and making your specific modifications.
+
+!!! info
+    We adopted the general Python `Argparse` scripting module to create a standarized code. More details on how to use it can be found at the [official documentation](https://docs.python.org/3/library/argparse.html)
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -41,7 +41,6 @@ extra_css:
 nav:
   - 'index.md'
   - 'installation_guide.md'
-  - 'getting_started.md'
   - 'faq.md'
   - 'api/document.md'
   - 'api/threshold.md'

diff --git a/tests/doc_samples/doc_1.jpg b/tests/doc_samples/doc_1.jpg
diff --git a/tests/doc_samples/doc_10.jpg b/tests/doc_samples/doc_10.jpg
diff --git a/tests/doc_samples/doc_11.JPG b/tests/doc_samples/doc_11.JPG
diff --git a/tests/doc_samples/doc_12.jpg b/tests/doc_samples/doc_12.jpg
diff --git a/tests/doc_samples/doc_13.jpg b/tests/doc_samples/doc_13.jpg
diff --git a/tests/doc_samples/doc_2.png b/tests/doc_samples/doc_2.png
diff --git a/tests/doc_samples/doc_3.jpg b/tests/doc_samples/doc_3.jpg
diff --git a/tests/doc_samples/doc_4.jpg b/tests/doc_samples/doc_4.jpg
diff --git a/tests/doc_samples/doc_5.jpg b/tests/doc_samples/doc_5.jpg
diff --git a/tests/doc_samples/doc_6.jpg b/tests/doc_samples/doc_6.jpg
diff --git a/tests/doc_samples/doc_7.jpg b/tests/doc_samples/doc_7.jpg
diff --git a/tests/doc_samples/doc_8.jpg b/tests/doc_samples/doc_8.jpg
diff --git a/tests/doc_samples/doc_9.jpg b/tests/doc_samples/doc_9.jpg