GitHub - servierhub/Archery: Framework to manipulate semi structured documents and extract data from them

servierhub / Archery Public

forked from RomualdRousseau/Archery

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Framework to manipulate semi structured documents and extract data from them

romualdrousseau.github.io/archery/

GPL-3.0 license

1 star 1 fork Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 906 Commits
.github/workflows		.github/workflows
.mvn		.mvn
archery-commons		archery-commons
archery-csv		archery-csv
archery-dbf		archery-dbf
archery-documents		archery-documents
archery-examples		archery-examples
archery-excel		archery-excel
archery-layex-parser		archery-layex-parser
archery-llm-classifier		archery-llm-classifier
archery-models/sales-english		archery-models/sales-english
archery-net-classifier		archery-net-classifier
archery-parquet		archery-parquet
archery-pdf		archery-pdf
archery		archery
libs/xelem-3.1		libs/xelem-3.1
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
justfile		justfile
pom.xml		pom.xml

Repository files navigation

A java API to manipulate semi structured documents and extract data from them.

Description

In today's data-driven landscape, navigating the complexities of semi-structured documents poses a significant challenge for organizations. These documents, characterized by diverse formats and a lack of standardization, often require specialized skills for effective manipulation and analysis. However, we propose a novel framework to address this challenge. By leveraging innovative algorithms and machine learning techniques, Archery offers a solution that gives you control over the data extraction process with tweakable and repeatable settings. Moreover, by automating the extraction process, it not only saves time but also minimizes errors, particularly beneficial for industries dealing with large volumes of such documents. Crucially, this framework integrates with machine learning workflows, unlocking new possibilities for data enrichment and predictive modeling. By leveraging determinist algorithms, this framework is perfect to prepare your data for training processes in a predictive and reproductible manner. Aligned with the paradigm of data as a service, it offers a scalable and efficient means of managing semi-structured data, thereby expanding the toolkit of data services available to organizations.

Visit our full documentation and learn more about how it works, try our tutorials and find a full list of plugins and models.

Getting Started

Dependencies

The Java Developer Kit, version 17.
Apache Maven, version 3.0 or above.

Apache Maven Installation

For more details, see the Installation Guide.

Update dependencies

Run the following command line:

mvn -DcreateChecksum=true versions:display-dependency-updates

Update pom.xml plugins

Run the following command line:

mvn -DcreateChecksum=true versions:display-plugin-updates

Build and install locally

Run the following command line:

mvn clean install

Build and deploy a snapshot to the Maven repository

Run the following command line:

mvn -P snapshot clean deploy

Build and deploy a release to the Maven repository

Run the following command line:

mvn -P release clean deploy

Build the javadoc documentation

Run the following command line:

mvn -P documentation clean site site:stage

Do not forget to configure the GitHub authentication in ~/.m2/settings.xml as follow:

<server>
    <id>github</id>
    <password>PERSONAL_TOKEN_CLASSIC</password>
</server>

Documentation

The following links will give you documentation about some background information, takes you through some implementation details, and then focuses on step-by-step instructions for getting the most out of Archery:

Using Archery: here.
API Reference: here.

Contribute

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Authors

Romuald Rousseau, [email protected]

Version History

2.37
...
Initial Release

About

Framework to manipulate semi structured documents and extract data from them

romualdrousseau.github.io/Archery/

GPL-3.0 license

Security policy

Custom properties

Report repository

Releases

No releases published

Packages

No packages published

Languages

Java 93.8%
TeX 5.0%
Other 1.2%