Tinder for researchers

Good job to everyone. I love you all 💖

Source code structure

Inside the src folder you will find the 4 main divisions of our project. The data gathering team works with fetching, the data engineering, algorithms and GUI teams work respectively with database, engine and GUI. The tests folder contains tests for each division above.

Architecture

Crawler

The crawler's goal is to amass data from arxiv. We build a bipartite graph from authors and papers according to the following rule: there is an edge between an author and a paper if

the author published the paper
the author cited the paper in a publication of his

The data we must obtain from arxiv is therefore two-fold. We use the arxiv API to request author metadata for edges of type (1.). We use arxiv's S3 bucket to download PDFs en masse and extract citations from the text using pattern matching.

@ghjuliasialelli @YassineMarrakchi @mariabenkhadra @deivisbanys @TotoJean @Smakson @ltricot

Database

The database is subdivided in a number of storage units:

the academic graph as a key-value store (custom)
the user/paper vectors as a key-value store (custom)
paper summaries (custom)
the user likes stored directly in text files on the server
the cluster labels as a csv

@Abdelrahmansameh

Server & API: (to come)

An API to the recommendation service.

@deivisbanys @TotoJean @ltricot

Recommendation

A number of algorithms come together to produce the final recommendations. We implement the following:

MinHash @jjbl99
Label propagation for clustering @MarineHoche
KNN for candidate production @clemie
Matrix Factorization for match scoring and user/paper embedding @shrey183
TF-IDF for content based embedding @MarineHoche

Putting it all together

Documentation

Doxygen

We use doxygen as a documentation system. It is configured to generate a html folder which essentially contains all the documentation in a format that can be explored by a web browser. Use

doxygen Doxyfile

to generate the documentation. Open the pages file inside the html in your web browser and you are set to explore the documentation.

Commenting conventions

The template for a comment documentating a function is as follows:

/** @brief can we build a wall?
 * 
 * @details evaluate whether the US can build a wall at the Mexican
 * border given their ambitions and budget. If the function returns
 * false, the US government shuts down.
 * 
 * @param height the height of the wanted wall in meters
 * @param budget the budget of the government in dollars
 * @return whether the government succeeds or not
 */
bool buildWall(int height, int budget) {
    if(height > 1e6 * budget) {
        // government shutdown
        return false;
    }

    return true;
}

The @brief tag is followed by a brief description of the function's responsobilities. The @details tag is optional and should only be filled for non-trivial functions. The @param tag is followed by the name of a parameter along with its description. It is not optional. The @return tag must always be filled for non-void functions and describes the significance of the return value.

Building

We use cmake to build our project. As of now we wish to build 4 executables:

The crawler which will upload data to our database
The engine responsible for the training of the recommender system
The recommender capable of answering recommendation requests
The GUI

Build commands:

git clone https://github.com/ltricot/CSE201_prototype
cd CSE201_prototype
mkdir build
cd build
cmake ..
make

Building the GUI

We use Qt Creator to build and run the Graphical User Interface. Procedure:

Donwload the Qt project named GUI_final
Open Qt Creator on your computer
File > Open file or project and select GUI_final
When Qt Creatorask you, construct the project
Run the project (shortcut Ctrl + R)

Running the tests

Once the build is over, inside the build folder, run

cd build
ctest

or run the test executable itself for a more detailed output. An example:

cd build/tests/fetching
./testcrawler

Adding an executable

How it works

cmake uses CmakeLists.txt files inside each directory containing files relevant to the build process (most often source code files). Suppose you have worked on some work.cpp file which contains a main function. Suppose work.cpp includes oldwork.hpp. You may append the following code to the CMakeLists.txt file in work.cpp's folder:

add_executable (work work.cpp oldwork.cpp)

to create an executable cmake will refer to as work. If your work does not contain a main function, it is called a library and added to the build in the following way:

add_library (worklib workwithoutmain.cpp)

Observe we only list the .cpp files instead of the headers. If you use an external library such as curl, you should manage this in the top-level CMakeLists.txt file (or the lowest level folder such that all .cpp files using the library). How cmake includes such libraries depends on its nature. You may study the CMakeLists.txt files of the project to observe how we manage external libraries. In any case, once the library is available, suppose it is store in some variable curl, adding it as a dependency is as simple as:

target_link_libraries (work curl)

where work is a target (executable or even another library). Once this is added to CMakeLists.txt, your .cpp files need only include curl as they always do.

Our case

Eigen is a header only library. It is included in this repository as a submodule (essentially a link to anothr git repository). The top-level CMakeLists.txt file includes it for the whole project to use, so that you need only write

#include <Eigen/Dense>

at the top of your file to use the Dense part of the Eigen library.

rapidxml is treated the same way but is only included for specific targets, i.e. the crawler.

catch is also a header only library and is used only by the tests folder, so is included only for test targets.

curl is included at the fetching level CMakeLists.txt file. It is not header only, but is well integrated with cmake so that there is no need to do anything else than #include it as you would for the 3 latter libraries.

zlib is treated the same way as curl.

pistache is used for the http server. Instructions to install it can be found on their github

Name		Name	Last commit message	Last commit date
Latest commit History 626 Commits
libraries		libraries
src		src
tests		tests
.DS_Store		.DS_Store
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
Defense.md		Defense.md
Doxyfile		Doxyfile
README.md		README.md
REST.md		REST.md
architecture.jpg		architecture.jpg
archives.txt		archives.txt
contributor_list.md		contributor_list.md
state.json		state.json
toparts.json		toparts.json
topics.json		topics.json
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tinder for researchers

Good job to everyone. I love you all 💖

Source code structure

Architecture

Crawler

Database

Server & API: (to come)

Recommendation

Putting it all together

Documentation

Doxygen

Commenting conventions

Building

Building the GUI

Running the tests

Adding an executable

How it works

Our case

About

Releases

Packages

Contributors 13

Languages

ltricot/CSE201_prototype

Folders and files

Latest commit

History

Repository files navigation

Tinder for researchers

Good job to everyone. I love you all 💖

Source code structure

Architecture

Crawler

Database

Server & API: (to come)

Recommendation

Putting it all together

Documentation

Doxygen

Commenting conventions

Building

Building the GUI

Running the tests

Adding an executable

How it works

Our case

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 13

Languages

Packages