Skip to content

Commit

Permalink
documentation updates (#19)
Browse files Browse the repository at this point in the history
* cleanup hash logic, make `rdf_graph` private

* checkpoint

* more cleanup

* update action versions

* checkpoint

* update tests

* checkpoint: green tests

* checkpoint: cases `14_2` and `15` are failing

* fix: make `__hash()` public

* cleanup `conftest`

* fix `v_count/e_count` assertions, more test cleanup

* new: case 15_1, 15_2, 15_3

* cleanup case 13

* checkpoing: all tests green (except 1 flaky)

15_2 RPT is flaky, need to revisit

* update case 14_1

* cleanup tests

* checkpoint: `__process_subject_predicate_object`

* fix lint

* fix conftest

* set `continue-on-error`

* update case `15_2` (still flaky)

* fix: `type` instead of `isinstance`

* update case `14_1`

* update tests

* update case `container.ttl`

* update `test_main`

add `+ RDFGraph()` hack, update `test_pgt_container`

* new: `pgt_remove_blacklisted_statements`, `pgt_parse_literal_statements`, remove `adb_col_blacklist`

so many todos...

* fix flake

* new: case 13_1 and 13_2

* new: case 14_3

* new: case 15_4

* update tests

* fix: rdf namespacing

* fix case 7 and native graph

* new: `adb_col_statements`, `write_adb_col_statements` (pgt)

* update 13_2, 15_2, 14_3

* new test cases, use `pytest.xfail` on flaky assertions

tests should be green now...

* flake ignore

can't reproduce

* new: `explicit_metagraph`, optimize `fetch_adb_docs`

* fix typo

* cleanup: `**adb_kwargs`

* doc cleanup

* cleanup

* cleanup: `flatten_reified_triples`

* cleanup: progress/spinner bars via `rich`

* more `rich` cleanup

* update cases 10, 14.3, 15_4

* final main checkpoint: `arango_rdf`

* minor cleanup

* new: `__pgt_process_rdf_literal`

* new: `serialize` as a conversion mode

* new: `test_open_intelligence_graph`

* fix lint

* fix: `statements`, `rdf_graph` ref

* cleanup: `write_adb_col_statements`

* initial commit

* fix lint

* update notebook

* checkpoint

* fix lint

* Create .readthedocs.yaml

* Update README.md

* Update requirements.txt

* fix: code block warning

* cleanup

* nit

* fix hyperlinks

* fix docstring
  • Loading branch information
aMahanna authored Jan 23, 2024
1 parent ff23744 commit 6aa9ecf
Show file tree
Hide file tree
Showing 28 changed files with 56,821 additions and 6,613 deletions.
29 changes: 29 additions & 0 deletions .github/workflows/docs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: Docs

on:
pull_request:
workflow_dispatch:

jobs:
docs:
runs-on: ubuntu-latest

name: Docs

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Fetch all tags and branches
run: git fetch --prune --unshallow

- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'

- name: Install dependencies
run: pip install .[dev] && pip install -r docs/requirements.txt

- name: Generate Sphinx HTML
run: cd docs && make html
4 changes: 2 additions & 2 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,10 @@ jobs:
python-version: "3.10"

- name: Install release packages
run: pip install setuptools wheel twine setuptools-scm[toml]
run: pip install build twine

- name: Build distribution
run: python setup.py sdist bdist_wheel
run: python -m build

- name: Publish to Test PyPi
env:
Expand Down
29 changes: 29 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
# .readthedocs.yaml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.12"

# Build documentation in the "docs/" directory with Sphinx
sphinx:
configuration: docs/conf.py
fail_on_warning: true

# Optionally build your docs in additional formats such as PDF and ePub
# formats:
# - pdf
# - epub

# Optional but recommended, declare the Python requirements required
# to build your documentation
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
python:
install:
- requirements: docs/requirements.txt
162 changes: 47 additions & 115 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ Resources to get started:
* [RDF Primer](https://www.w3.org/TR/rdf11-concepts/)
* [RDFLib (Python)](https://pypi.org/project/rdflib/)
* [One Example for Modeling RDF as ArangoDB Graphs](https://www.arangodb.com/docs/stable/data-modeling-graphs-from-rdf.html)

## Installation

#### Latest Release
Expand All @@ -41,69 +42,73 @@ pip install git+https://github.com/ArangoDB-Community/ArangoRDF
```

## Quickstart
Run the full version with Google Colab: <a href="https://colab.research.google.com/github/ArangoDB-Community/ArangoRDF/blob/main/examples/ArangoRDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<a href="https://colab.research.google.com/github/ArangoDB-Community/ArangoRDF/blob/main/examples/ArangoRDF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

```py
from rdflib import Graph
from arango import ArangoClient
from arango_rdf import ArangoRDF

db = ArangoClient(hosts="http://localhost:8529").db("_system_", username="root", password="")
db = ArangoClient().db()

adbrdf = ArangoRDF(db)

g = Graph()
g.parse("https://raw.githubusercontent.com/stardog-union/stardog-tutorials/master/music/beatles.ttl")

# RDF to ArangoDB
###################################################################################
def beatles():
g = Graph()
g.parse("https://raw.githubusercontent.com/ArangoDB-Community/ArangoRDF/main/tests/data/rdf/beatles.ttl", format="ttl")
return g
```

# 1.1: RDF-Topology Preserving Transformation (RPT)
adbrdf.rdf_to_arangodb_by_rpt("Beatles", g, overwrite_graph=True)
### RDF to ArangoDB

# 1.2: Property Graph Transformation (PGT)
adbrdf.rdf_to_arangodb_by_pgt("Beatles", g, overwrite_graph=True)
**Note**: RDF-to-ArangoDB functionality has been implemented using concepts described in the paper
*[Transforming RDF-star to Property Graphs: A Preliminary Analysis of Transformation Approaches](https://arxiv.org/abs/2210.05781)*. So we offer two transformation approaches:

g = adbrdf.load_meta_ontology(g)
1. [RDF-Topology Preserving Transformation (RPT)](https://arangordf.readthedocs.io/en/docs/rdf_to_arangodb_rpt.html)
2. [Property Graph Transformation (PGT)](https://arangordf.readthedocs.io/en/docs/rdf_to_arangodb_pgt.html)

# 1.3: RPT w/ Graph Contextualization
adbrdf.rdf_to_arangodb_by_rpt("Beatles", g, contextualize_graph=True, overwrite_graph=True)
```py
# 1. RDF-Topology Preserving Transformation (RPT)
adbrdf.rdf_to_arangodb_by_rpt(name="BeatlesRPT", rdf_graph=beatles(), overwrite_graph=True)

# 1.4: PGT w/ Graph Contextualization
adbrdf.rdf_to_arangodb_by_pgt("Beatles", g, contextualize_graph=True, overwrite_graph=True)
# 2. Property Graph Transformation (PGT)
adbrdf.rdf_to_arangodb_by_pgt(name="BeatlesPGT", rdf_graph=beatles(), overwrite_graph=True)
```

# 1.5: PGT w/ ArangoDB Document-to-Collection Mapping Exposed
adb_mapping = adbrdf.build_adb_mapping_for_pgt(g)
print(adb_mapping.serialize())
adbrdf.rdf_to_arangodb_by_pgt("Beatles", g, adb_mapping, contextualize_graph=True, overwrite_graph=True)
### ArangoDB to RDF

# ArangoDB to RDF
###################################################################################
```py
# pip install arango-datasets
from arango_datasets import Datasets

# Start from scratch!
g = Graph()
g.parse("https://raw.githubusercontent.com/stardog-union/stardog-tutorials/master/music/beatles.ttl")
adbrdf.rdf_to_arangodb_by_pgt("Beatles", g, overwrite_graph=True)
name = "OPEN_INTELLIGENCE_ANGOLA"
Datasets(db).load(name)

# 2.1: Via Graph Name
g2, adb_mapping_2 = adbrdf.arangodb_graph_to_rdf("Beatles", Graph())
# 1. Graph to RDF
rdf_graph = adbrdf.arangodb_graph_to_rdf(name, rdf_graph=Graph())

# 2.2: Via Collection Names
g3, adb_mapping_3 = adbrdf.arangodb_collections_to_rdf(
"Beatles",
Graph(),
v_cols={"Album", "Band", "Class", "Property", "SoloArtist", "Song"},
e_cols={"artist", "member", "track", "type", "writer"},
# 2. Collections to RDF
rdf_graph_2 = adbrdf.arangodb_collections_to_rdf(
name,
rdf_graph=Graph(),
v_cols={"Event", "Actor", "Source"},
e_cols={"eventActor", "hasSource"},
)

print(len(g2), len(adb_mapping_2))
print(len(g3), len(adb_mapping_3))

print('--------------------')
print(g2.serialize())
print('--------------------')
print(adb_mapping_2.serialize())
print('--------------------')
# 3. Metagraph to RDF
rdf_graph_3 = adbrdf.arangodb_to_rdf(
name=name,
rdf_graph=Graph(),
metagraph={
"vertexCollections": {
"Event": {"date", "description", "fatalities"},
"Actor": {"name"}
},
"edgeCollections": {
"eventActor": {}
},
},
)
```

## Development & Testing
Expand All @@ -123,76 +128,3 @@ def pytest_addoption(parser):
parser.addoption("--username", action="store", default="root")
parser.addoption("--password", action="store", default="")
```

## Additional Info: RDF to ArangoDB

RDF-to-ArangoDB functionality has been implemented using concepts described in the paper *[Transforming RDF-star to Property Graphs: A Preliminary Analysis of Transformation Approaches](https://arxiv.org/abs/2210.05781)*.

In other words, `ArangoRDF` offers 2 RDF-to-ArangoDB transformation methods:
1. RDF-topology Preserving Transformation (RPT): `ArangoRDF.rdf_to_arangodb_by_rpt()`
2. Property Graph Transformation (PGT): `ArangoRDF.rdf_to_arangodb_by_pgt()`

RPT preserves the RDF Graph structure by transforming each RDF Statement into an ArangoDB Edge.

PGT on the other hand ensures that Datatype Property Statements are mapped as ArangoDB Document Properties.

```ttl
@prefix ex: <http://example.org/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
ex:book ex:publish_date "1963-03-22"^^xsd:date .
ex:book ex:pages "100"^^xsd:integer .
ex:book ex:cover 20 .
ex:book ex:index 55 .
```

| RPT | PGT |
|:-------------------------:|:-------------------------:|
| ![image](https://user-images.githubusercontent.com/43019056/232347662-ab48ebfb-e215-4aff-af28-a5915414a8fd.png) | ![image](https://user-images.githubusercontent.com/43019056/232347681-c899ef09-53c7-44de-861e-6a98d448b473.png) |

--------------------
### RPT


The `ArangoRDF.rdf_to_arangodb_by_rpt` method will store the RDF Resources of your RDF Graph under the following ArangoDB Collections:

- {graph_name}_URIRef: The Document collection for `rdflib.term.URIRef` resources.
- {graph_name}_BNode: The Document collection for`rdflib.term.BNode` resources.
- {graph_name}_Literal: The Document collection for `rdflib.term.Literal` resources.
- {graph_name}_Statement: The Edge collection for all triples/quads.

--------------------
### PGT

In contrast to RPT, the `ArangoRDF.rdf_to_arangodb_by_pgt` method will rely on the nature of the RDF Resource/Statement to determine which ArangoDB Collection it belongs to. This is referred as the **ArangoDB Collection Mapping Process**. This process relies on 2 fundamental URIs:

1) `<http://www.arangodb.com/collection>` (adb:collection)
- Any RDF Statement of the form `<http://example.com/Bob> <adb:collection> "Person"` will map the Subject to the ArangoDB "Person" document collection.

2) `<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>` (rdf:type)
- This strategy is divided into 3 cases:

1. If an RDF Resource only has one `rdf:type` statement,
then the local name of the RDF Object is used as the ArangoDB
Document Collection name. For example,
`<http://example.com/Bob> <rdf:type> <http://example.com/Person>`
would create an JSON Document for `<http://example.com/Bob>`,
and place it under the `Person` Document Collection.
NOTE: The RDF Object will also have its own JSON Document
created, and will be placed under the "Class"
Document Collection.

2. If an RDF Resource has multiple `rdf:type` statements,
with some (or all) of the RDF Objects of those statements
belonging in an `rdfs:subClassOf` Taxonomy, then the
local name of the "most specific" Class within the Taxonomy is
used (i.e the Class with the biggest depth). If there is a
tie between 2+ Classes, then the URIs are alphabetically
sorted & the first one is picked.

3. If an RDF Resource has multiple `rdf:type` statements, with none
of the RDF Objects of those statements belonging in an
`rdfs:subClassOf` Taxonomy, then the URIs are
alphabetically sorted & the first one is picked. The local
name of the selected URI will be designated as the Document
collection for that Resource.
--------------------
1 change: 1 addition & 0 deletions arango_rdf/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
from arango_rdf.controller import ArangoRDFController # noqa: F401
from arango_rdf.main import ArangoRDF # noqa: F401
67 changes: 27 additions & 40 deletions arango_rdf/controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,19 @@


class ArangoRDFController(AbstractArangoRDFController):
"""ArangoDB-RDF controller.
"""Controller used in RDF-to-ArangoDB (PGT).
You can derive your own custom ArangoRDFController.
Responsible for handling how the ArangoDB Collection Mapping Process
identifies the "ideal RDFS Class" among a selection of RDFS Classes
for a given RDF Resource.
The "ideal RDFS Class" is defined as an RDFS Class whose local name best
represents the RDF Resource in question. This local name will be
used as the ArangoDB Collection name that will store **rdf_resource**.
`Read more about how the PGT ArangoDB Collection Mapping
Process works here
<./rdf_to_arangodb_pgt.html#arangodb-collection-mapping-process>`_.
"""

def __init__(self) -> None:
Expand All @@ -28,42 +38,19 @@ def identify_best_class(
"""Find the ideal RDFS Class among a selection of RDFS Classes. Essential
for the ArangoDB Collection Mapping Process used in RDF-to-ArangoDB (PGT).
The "ideal RDFS Class" is defined as an RDFS Class whose local name can be
used as the ArangoDB Document Collection that will store **rdf_resource**.
`Read more about how the PGT ArangoDB Collection Mapping
Process works here
<./rdf_to_arangodb_pgt.html#arangodb-collection-mapping-process>`_.
The "ideal RDFS Class" is defined as an RDFS Class whose local name best
represents the RDF Resource in question. This local name will be
used as the ArangoDB Collection name that will store **rdf_resource**.
This system is a work-in-progress. Users are welcome to overwrite this
method via their own implementation of the `ArangoRDFController`
Python Class.
NOTE: Users are able to access the RDF Graph of the current
RDF-to-ArangoDB transformation via the `self.rdf_graph`
instance variable, and the database instance via the
`self.db` instance variable.
The current identification process goes as follows:
1) If an RDF Resource only has one `rdf:type` statement
(either by explicit definition or by domain/range inference),
then the local name of the single RDFS Class is used as the ArangoDB
Document Collection name. For example,
<http://example.com/Bob> <rdf:type> <http://example.com/Person>
would place the JSON Document for <http://example.com/Bob>
under the ArangoDB "Person" Document Collection.
2) If an RDF Resource has multiple `rdf:type` statements
(either by explicit definition or by domain/range inference),
with some (or all) of the RDFS Classes of those statements
belonging in an `rdfs:subClassOf` Taxonomy, then the
local name of the "most specific" Class within the Taxonomy is
used (i.e the Class with the biggest depth). If there is a
tie between 2+ Classes, then the URIs are alphabetically
sorted & the first one is picked. Relies on **subclass_tree**.
3) If an RDF Resource has multiple `rdf:type` statements, with
none of the RDFS Classes of those statements belonging in an
`rdfs:subClassOf` Taxonomy, then the URIs are
alphabetically sorted & the first one is picked. The local
name of the selected URI will be designated as the Document
Collection for **rdf_resource**.
Class. Users are able to access the RDF Graph of the current
RDF-to-ArangoDB transformation via `self.rdf_graph`, and the
database instance via the `self.db`.
:param rdf_resource: The RDF Resource in question.
:type rdf_resource: URIRef | BNode
Expand All @@ -73,12 +60,12 @@ def identify_best_class(
domain/range inference.
:type class_set: Set[str]
:param subclass_tree: The Tree data structure representing
the RDFS subClassOf Taxonomy. See `ArangoRDF.__build_subclass_tree()`
for more info.
the RDFS subClassOf Taxonomy.
See :func:`arango_rdf.main.ArangoRDF.__build_subclass_tree` for more info.
:type subclass_tree: arango_rdf.utils.Tree
:return: The most suitable RDFS Class URI among the set of RDFS Classes
to use as the ArangoDB Document Collection name associated to
**rdf_resource**.
:return: The string representation of the URI of the most suitable
RDFS Class URI among the set of RDFS Classes to use as the ArangoDB
Document Collection name for **rdf_resource**.
:rtype: str
"""
# These are accessible!
Expand Down
Loading

0 comments on commit 6aa9ecf

Please sign in to comment.