The class is used to classify letters according to its type. Every letter is stored in a pdf file and has its code (TID) in the header of the file:
In order to classify properly the class uses 3rd party libraries to read that code and move that pdf file into a separate folder according to its type.
The logic behind the code is:
-
Uses ghostscript library to transform the pdf into a jpg file.
-
Uses imagemagick library to crop that jpg file into a smaller file containing only the TID code
-
Uses tessaract-ocr library to read that line and extract a character which represent the type.
(There are 4 known types of letters: B, E, D, F)
-
Finally, you will have a folder for each type of letter, and inside this folder you will have the pdf files.
Class was made in plain PHP
but some external libraries are required to work properly.
Open your terminal and use apt-get
to install the following packages:
1.1. # apt-get install tesseract-ocr
1.2. # apt-get install ghostscript
1.3. # apt-get install imagemagick
You MUST set the pathToFiles
parameter stored in you config.json
file.
"pathToFiles": "/home/[PROYECT_FOLDER]/pdfs"
Place the pdfs that you want to classify inside the folder configured in the previous step
(A couple of pdfs were provided for testing purpose)
Inside the project folder type the following command to run the class.
# php classifier-cli.php
If everything goes well you should have the pdf files inside a folder for each type of letters.