Skip to content

Naxel100/Permatrago

Repository files navigation

Permatrago

Causal LM for commit messages

Model

The model chosen for doing the suggestions of text in the commit messages has been a deep neural network based on decoder transformers. In this case we have fine-tuned DistilGPT2 which is a lighter but faster version of OpensAI’s GPT2 developed by huggingface trained with a reproduction of OpenAI's WebText dataset (https://skylion007.github.io/OpenWebTextCorpus/). The model has 6 layers, 768 dimensions and 12 heads, totalizing 82M parameters

However, we must consider that the network is not able neither to process the raw text as input nor to generate raw text at the output. In order to be able to forward the text through the network and to generate a comprehensible output the raw text used either in the training phase or the inference phase must be translated to a numeric based vocabulary. For this purpose, the creators of the model architecture provide with it a tokenizer so we can convert the raw text input to a numeric system comprehensible for the network and can convert the output back to text.

Execution

All the files of the repository are python scripts easy to execute

For running de demo, the python library streamlit has to be installed. Once done, it has to be executed with the following command where the file is located (src/visualization/):

streamlit run demo.py

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

Team

This project was developed by:

CarlOwOs Naxel100 marcfuon turcharnau
Carlos Hurtado Alex Ferrando Marc Fuentes Arnau Turch

Students of Data Science and Engineering at UPC.

About

TAED2 project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •