This repository contains the implementation and evaluation code for my master's thesis "NLP-Based Semantic Matching on ECLASS: Design and Validation of an Industrie 4.0 Matching Service" at the Chair of Information and Automation Systems for Process and Material Technology at RWTH Aachen University.
The aim is to develop a proof-of-concept Semantic Matching Service that leverages NLP techniques to semantically match concept definitions from the IEC 61360-2-compliant ECLASS dictionary. Furthermore, the project aims to investigate occurring matching patterns, outliers and errors.
semantic-matching-nlp-eclass/
│
├── data/ # All data
│ ├── embedded/ # Embeddings
│ │ ├── filtered/ # Filtered Embeddings
│ │ └── unfiltered/ # Unfiltered Embeddings
│ ├── extracted/ # Extracted data
│ ├── original/ # Original data
│ └── scores/ # Matching scores
│
├── src/ # Source code
│ ├── embedding/ # Data preprocessing and embeddings generation
│ ├── evaluation/ # Data evaluation and visualisation
│ ├── service/ # Semantic Matching Service
│ └── utils/ # Helper functions
│
├── test/ # Unit testing
│
├── test_data/ # Data for testing
│
└── visualisation/ # Visualised results
Please note that, due to ECLASS copyright restrictions, files in data/
and visualisation/
cannot be included in this public repository.
How to get this project running, assuming you have a cuda-capable GPU on your Windows machine:
- Run
nvidia-smi
to see which version of cuda you have (For me that was cuda v12.9) - Visit https://pytorch.org/get-started/locally/ and select the correct properties to get an installation link that looks like this:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
- To verify if it worked, try the following snippet:
import torch
torch.cuda.is_available()
-
Run
pip install -r requirements.txt
(Note that pip was a bit unhappy with the pytorch dependency, but since we already installed it above I quickly commented it out.) -
Add the raw ECLASS Basic files (
ECLASS15_0_BASIC_EN_SG_01.xml
) to the./data/raw/
directory -
Run
src/data/extract_xml_to_csv.py
-
Run
src/data/embeddings_<model>.py
Note
Theoretically, the script is supposed to download the necessary model files, however I had to manually download the models
and use a local filepath, since the script download kept getting stuck for some reason.
If that's the case, you can simply edit the line: model = SentenceTransformer("<put path here instead of model name>, ...)
.