Skip to content

Commit

Permalink
Merge pull request #5 from acsenrafilho/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
acsenrafilho authored Nov 13, 2024
2 parents 689fe2a + ec394fe commit be95131
Show file tree
Hide file tree
Showing 22 changed files with 377 additions and 11 deletions.
1 change: 1 addition & 0 deletions .github/workflows/ci-lib.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ jobs:
token: ${{ secrets.CODECOV_TOKEN }}
verbose: true


windows:
runs-on: windows-latest
strategy:
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
<img src="https://raw.githubusercontent.com/acsenrafilho/cucaracha/refs/heads/main/docs/assets/cucaracha-logo.png" width=700>

[![Documentation Status](https://readthedocs.org/projects/cucaracha/badge/?version=latest)](https://cucaracha.readthedocs.io/en/latest/?badge=latest)
[![Documentation Status](https://readthedocs.org/projects/cucaracha/badge/?version=main)](https://cucaracha.readthedocs.io/en/main/?badge=main)
[![codecov](https://codecov.io/gh/acsenrafilho/cucaracha/graph/badge.svg?token=TgmCLPoIbW)](https://codecov.io/gh/acsenrafilho/cucaracha)
[![CI Main](https://github.com/acsenrafilho/cucaracha/actions/workflows/ci-lib.yml/badge.svg?branch=main)](https://github.com/acsenrafilho/cucaracha/actions/workflows/ci-lib.yml)
[![CI Develop](https://github.com/acsenrafilho/cucaracha/actions/workflows/ci-lib.yml/badge.svg?branch=develop)](https://github.com/acsenrafilho/cucaracha/actions/workflows/ci-lib.yml)
Expand Down Expand Up @@ -28,7 +28,7 @@ The name `cucaracha` reflects the tireless, behind-the-scenes nature of the tool

### Getting Started

Check out the [full documentation](https://cucaracha.readthedocs.io/en/latest/) for detailed instructions on how to use, implement, and keep up with updates to `cucaracha`.
Check out the [full documentation](https://cucaracha.readthedocs.io/en/main/) for detailed instructions on how to use, implement, and keep up with updates to `cucaracha`.

### Contributing to `cucaracha`

Expand All @@ -45,4 +45,4 @@ A quick to use install is via `pip`, as follows:
```bash
pip install cucaracha
```
```
116 changes: 113 additions & 3 deletions cucaracha/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@
import numpy as np
import pymupdf
from pymupdf import Page
from rich import print
from rich.progress import track

from cucaracha.aligment import inplane_deskew
from cucaracha.noise_removal import sparse_dots
from cucaracha.threshold import otsu


class Document:
Expand Down Expand Up @@ -267,9 +273,96 @@ def get_page(self, page: int):

return self._doc_file[page]

def batch_processing(self, processors: list):
# TODO Make a loop processor to make image processing to the doc_file
pass
def set_page(self, page: np.ndarray, index: int):
"""Update a new page into the document file
The page index must be passed considering the total range of pages
in the document. See the metadata to get this information.
Examples:
>>> doc = Document('./'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
>>> doc.get_metadata('pages')
{'pages': 1}
The original information is loaded as usual
>>> np.max(doc.get_page(0))
255
But a new page can be changed like this:
>>> new_page = np.ones(doc.get_page(0).shape)
>>> doc.set_page(new_page, 0)
Then the new page is placed in the document object
>>> np.max(doc.get_page(0))
1.0
Args:
page (np.ndarray): A numpy array with the same shape of the other pages
index (int): The index where the new page should be placed
Raises:
ValueError: Page index is out of range (total page is ... and must be a positive integer)
ValueError: New page is not a numpy array or has different shape from previous pages
"""
if index > len(self._doc_file) or index < 0:
raise ValueError(
f'Page index is out of range (total page is {len(self._doc_file)} and must be a positive integer)'
)

if (
not isinstance(page, np.ndarray)
or page.shape != self.get_page(index).shape
):
raise ValueError(
'New page is not a numpy array or has different shape from previous pages'
)

self._doc_file[index] = page

def run_pipeline(self, processors: list):
"""Execute a list of image processing methods to the document file
allocated in the `Document` object.
The processing order is the same as indicated in the list of processors.
Examples:
One can define a processor as a function caller:
>>> def proc2(input): return sparse_dots(input, 3)
>>> def proc3(input): return inplane_deskew(input, 25)
>>> proc_list = [otsu, proc2, proc3]
After the `proc_list` being created, the proper execution can be
called using:
>>> doc = Document('.'+os.sep+'tests'+os.sep+'files'+os.sep+'sample-text-en.pdf')
>>> doc.run_pipeline(proc_list) # doctest: +SKIP
Applying processors... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
Hence, the inner document file in the `doc` object is updated:
>>> type(doc.get_page(0))
<class 'numpy.ndarray'>
Warning:
All the processor in the list must be of `cucaracha` filter type.
Hence, make sure that the processor instance accepts an numpy array
as input and returns a tuple with numpy array and a dictionary of
extra parameters (`(np.ndarray, dict)`).
Note:
All the pages presented in the document object is processed. If it
is desired to apply only on specific pages, then it is need to
process it individually and then update the page using the method
`set_page`
Args:
processors (list): _description_
"""
self._check_processor_list(processors)

for proc in track(
processors, description='[green]Applying processors...'
):
for idx, page in enumerate(self._doc_file):
self._doc_file[idx] = proc(page)[0]

def _read_by_ext(self, path, dpi):
_, file_ext = os.path.splitext(path)
Expand Down Expand Up @@ -312,3 +405,20 @@ def _collect_inner_metadata(self, doc_path):

# Set file number of pages
self._doc_metadata['pages'] = len(self._doc_file)

def _check_processor_list(self, processors):
if type(processors) != list:
raise ValueError(
'processors must be a list of valid cucaracha filter methods'
)

for proc in processors:
out_test = proc(self.get_page(0)) # Test the processor output
if (
type(out_test) != tuple
or not isinstance(out_test[0], np.ndarray)
or not isinstance(out_test[1], dict)
):
raise TypeError(
f'Processor: {proc.__name__} is not valid. Unsure that the output processor is valid.'
)
4 changes: 3 additions & 1 deletion cucaracha/aligment.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,9 @@ def inplane_deskew(input: np.ndarray, max_skew=10):
height, width = input.shape[0], input.shape[1]

# Create a grayscale image and denoise it
im_gs = cv.cvtColor(input, cv.COLOR_BGR2GRAY)
im_gs = input
if len(im_gs.shape) == 3:
im_gs = cv.cvtColor(input, cv.COLOR_BGR2GRAY)
im_gs = cv.fastNlMeansDenoising(im_gs, h=3)

# Create an inverted B&W copy using Otsu (automatic) thresholding
Expand Down
4 changes: 2 additions & 2 deletions cucaracha/noise_removal.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ def sparse_dots(input: np.ndarray, kernel_size: int = 1):
ValueError: Kernel size must be an odd value
Returns:
(np.ndarray): Output image without major sparse dots noise
(np.ndarray, dict): Output image without major sparse dots noise. This method does not return and extra information, then get an empty dict.
"""
if kernel_size % 2 == 0:
raise ValueError('Kernel size must be an odd value.')

return cv.medianBlur(input, kernel_size)
return cv.medianBlur(input, kernel_size), {}
143 changes: 143 additions & 0 deletions docs/contribute.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# How to Contribute

## Preparing the coding environment

The first step to start coding new features or correcting bugs in the `cucaracha` library is doing the repository fork, directly on GitHub, and following to the repository clone:

```bash
git clone [email protected]:<YOUR_USERNAME>/cucaracha.git
```

Where `<YOUR_USERNAME>` indicates your GitHub account that has the repository fork.

!!! tip
See more details on [GitHub](https://docs.github.com/pt/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo) for forking a repository

After the repository been set in your local machine, the following setup steps can be done to prepare the coding environment:

!!! warning
We assume the Poetry tool for project management, then make sure that the Poetry version is 1.8 or above. See more information about [Poetry installation](https://python-poetry.org/docs/#installing-with-pipx)

```bash
cd asltk
poetry shell && poetry install
```

Then all the dependencies will be installed and the virtual environment will be created. After all being done successfully, the shortcuts for `test` and `doc` can be called:

```bash
task test
```

```bash
task doc
```

More details about the entire project configuration is provided in the `pyproject.toml` file.

### Basic tools

We assume the following list of developing, testing and documentation tools:

1. blue
2. isort
3. numpy
4. OpenCV
5. PyMuPDF
6. rich
7. pytest
8. taskipy
9. mkdocs-material
10. pymdown-extensions

Further adjustments in the set of tools for the project can be modified in the future. However, the details about these modifications are directly reported in new releases, regarding the specific tool versioning (more details at Version Control section)

## Code Structure

The general structure of the `cucaracha` library is given as the following:

``` mermaid
classDiagram
class Document{
+string doc_path
+dict metadata
}
class Aligment{
+function inplane_deskew
}
class Noise_Removal{
+function sparse_dots
}
class Threshold{
+function otsu
+function binary_threshold
}
```

Where the `Documen` class informs the basic data structure for the document file representation. All the others files are Python modules that contains the image processing methods represented by unique functions.

!!! note
The general structure to be followed to create an image processing method is using the pattern: i) input = numpy array, ii) output = a tuple with the first item as a numpy array (data output) and the second item as a dictionary informing any additional output parameter that the function may offer.


!!! question
In case of any doubt, discuss with the community using a [issue card](https://github.com/acsenrafilho/cucaracha/issues) in the repo.

## Testing

Another coding pattern expected in new contributions in the `cucaracha` library is the uses of unit tests.

!!! info
A good way to implement test together with coding steps is using a Test-Driven Desing (TDD). Further details can be found at [TDD tutorial](https://codefellows.github.io/sea-python-401d2/lectures/tdd_with_pytest.html) and in many other soruces on internet

Each module or class implemented in the `cucaracha` library should have a series of tests to ensure the quality of the coding and more stability for production usage. We adopted the Python `codecov` tool to help in collecting the code coverage status, which can be accessed by the HTML page that is generated on the call

```bash
task test
```

## Code Documentation

The coding documentation pattern is the [Google Docstring](https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html)

Please, provide as much details as possible in the methods, classes and modules implemented in the `cucaracha` library. By the way, if one may want to get deeper in the explanation of some functionality, then use the documentation webpage itself, which can be easier to add figures, graphs, diagrams and much more simple to read.

!!! tip
As a good form to assist further users is providing `Examples` in the Google Docstring format. Then, when it is possible, add a few examples in the code documentation.

!!! info
The docstring also passes to a test analysis, then take care about adding `Examples` in the docstring, respecting the same usage pattern for input/output as the code provides

## Version Control

The `cucaracha` project adopts the [Semantic Versioning 2.0.0 SemVer](https://semver.org/) versioning pattern. Please, also take care about the specific version changes that will be added by further implementations.

Another important consideration is that the `cucaracha` repository has two permanent branches: `main` and `develop`. The `main` branch is placed to stable, versioning controled releases, and the `develop` branch is for unstable most up-to-date functionalities. In order to keep the library as more reliable as possible, please consider making a Pull Request (PR) at the `develop` branch before passing it to the `main` branch.

!!! info
The `main` branch is marked by the repository `tag` using the standard `vM.m.p`, where `M` is a major update, `m` minor update and `p` a patch update. All based on SemVer pattern.


## Extending the library

### Extending core functionalities

If you want to provide a new functionality in the `cucaracha`, e.g. a new class that supports a novel ASL processing method, please keep the same data and coding structure as described in the `Code Structure` section.

Any new ideas to improve the project readbility and coding organization is also welcome. If it is the case, please raise a new issue ticket at GitHub, using the Feature option to open an community debate about your suggestion. Once it is approved, a new project version is release with the new implementations glued in the core code.

### Scripts

A easier and less burocratic way to provide new code in the project is using a Python script. In this way, a simple calling script can be added in the repository, under the `scripts` folder, that can be used directly using the python command:

```bash
python -m cucaracha.scripts.YOUR_SCRIPT [input options]
```

In this way, you can share a code that can be called for a specific execution and can be used as a command-line interface (CLI). There are some examples already implemented in the `cucaracha.scripts`, and you can use then to get a general idea on how to apply it.

!!! tip
Feel free to get inspired adding new scripts in the `cucaracha` project. A quick way to get this is simply making a copy of an existing python script and making your specific modifications.

!!! info
We adopted the general Python `Argparse` scripting module to create a standarized code. More details on how to use it can be found at the [official documentation](https://docs.python.org/3/library/argparse.html)
1 change: 0 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ extra_css:
nav:
- 'index.md'
- 'installation_guide.md'
- 'getting_started.md'
- 'faq.md'
- 'api/document.md'
- 'api/threshold.md'
Expand Down
Binary file added tests/doc_samples/doc_1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/doc_samples/doc_10.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/doc_samples/doc_11.JPG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/doc_samples/doc_12.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/doc_samples/doc_13.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/doc_samples/doc_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/doc_samples/doc_3.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/doc_samples/doc_4.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/doc_samples/doc_5.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/doc_samples/doc_6.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/doc_samples/doc_7.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/doc_samples/doc_8.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/doc_samples/doc_9.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit be95131

Please sign in to comment.