Skip to content

This is the code I will be using for the "Big Code" course in Big Data 2020 (https://bigdat2020.irdta.eu/)

Notifications You must be signed in to change notification settings

jayanti-prasad/BigCode

Repository files navigation

Introduction :

In general a software  package consists of a set of computer programs written in a  
high level programming language such as C++, Python or Java. A computer program or source 
code is more than a sequence of tokens and have quite different properties than that of
text written in natural languages. However, just like we can mine a huge corpora of
natural language text and discover interesting patters using machine learning that is 
possible for software repositories also. There are multiple challenges in applying 
machine learning on source code directly. The project discussed here is inspired from 
models of natural language processing and outlines a road-map for applying machine learning on 
source code.  

Big Code : What it is ?

  • It is prototype pipeline to apply machine learning on source code. It has five main programs:

    A program to find repositories on Github on the basis of some criteria and dump the results in a csv file.

    A program to process (get AST also using Babelfish) a git repo and write the output file in a csv file which can be feed to a neural network models.

    A program to train a set of Neural Network Models (LSTM) on the basis of the data created in the last step.

    A program to get inference on the basis of the trained model.

    A set of examples to demonstrate how to build neural network models.

Requirements:

  • Babelfish - Universal Code Parser

    (1) Get Babelfish Docker container with the command: ref -> https://hub.docker.com/r/bblfsh/bblfshd

    docker pull bblfsh/bblfshd

    (2) Install the required driver with the command (make sure docker/babelfish is up and running)

    docker exec -it bblfshd bblfshctl driver install --all

    Install Python client : ref -> https://github.com/bblfsh/python-client

    pip3 install bblfsh

  • Machine Learning Tools :

    conda create --name bigcode_env python=3.6

    $ conda activate bigcode_env

    $ pip install -r requirements.txt

Running the code :

- Review the input parameter files (data_input.ini and ml_input.ini) in 'config' directory.  

- (1) Get a list of repos by running the following [This is optional]

> `$ python github_data_crawler.py --help`

- (2) Process the giit data :

> `$ python driver_data.py -c config/input_data.ini `

- (3) Launch the ML Pipeline 

> `$ python driver_ml.py -c config/input_ml.ini`

References :

https://doc.bblf.sh/

https://github.com/src-d/awesome-machine-learning-on-source-code

About

This is the code I will be using for the "Big Code" course in Big Data 2020 (https://bigdat2020.irdta.eu/)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages