Skip to content

Automated scraper built on top of Stack Exchange API.

Notifications You must be signed in to change notification settings

adjaskam/stack-code-finder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

stack-code-finder

Automated scraper built on top of Stack Exchange API.
Search code fragments by given phrase in Stack Overflow - consider only code snippets in selected threads.
Report Bug

Zrzut ekranu 2022-03-20 o 16 43 09
The example of a code snippet

About the project

image

Built With

Getting Started

Prerequisites

  • Set all following environmental variables, e.g.:
HOST=0.0.0.0
PORT=3000

MONGO_DB_USER=root
MONGO_DB_PASSWORD=example
MONGO_DB_NAME=appdb
MONGO_DB_PORT=27017
MONGO_DB_SERVICE_NAME=mongodb

CODE_FRAGMENTS_FETCH_LIMIT=10
JWT_TOKEN_SECRET=access_token_secret
STACK_API_KEY=stack_api_key

You can find more information on getting the STACK_API_KEY by following → https://api.stackexchange.com/docs/authentication.

Important note → https://api.stackexchange.com/docs/throttle

Dev installation

  1. Clone the repo
    git clone https://github.com/adjaskam/stack-code-finder.git
  2. Install NPM packages for the client
    cd client
    npm i
  3. Start the project with concurrently (invoke from the root directory)
    npm run dev:fullstack

Note: The backend part of this project is based on Dockerfile and the development process is placed within the container.

Endpoints

  • POST /api/codefragments - start a job for given tag (includes scraping procedure). The application supports:
    • Preventing creation of duplicates extracted fragments (comparing values of MD5 hash from code fragment factors).
    • Handling user-specific documents - that means owning the single code fragment by multiple users.
    • Web scraping optionally with Puppeteer or Cheerio.

Body of the example request:

{
   "tag": "Java",
   "searchPhrase": "int",
   "amount": 1,
   "scraperType": "cheerio"
}

Response:

{
   "items":[
      {
         "questionId": "71860220",
         "tag": "Java",
         "searchPhrase": "int",
         "codeFragment": "public class TekuciRacun implements IRacun{\n private String vlasnik;\n private int isplate;\n private int kredit;\nthis.stanje = stanje;\n    }\n    \n    \n    \n}\n",
         "hashMessage": "2de6aac5afba3f6f44aa7f9e91cb9d8d",
         "usersOwn": [
            "[email protected]"
         ],
         "_id": "625712efea1e61ff34001739",
         "createdAt": "2022-04-13T18:14:07.563Z",
         "updatedAt": "2022-04-13T18:14:07.563Z",
         "__v": 0
      }
   ],
   "amount": 1,
   "executionTime": 960
}
  • GET /api/codefragments/my - get all obtained code fragments per user

  • DELETE /api/codefragments/:hashMessage- delete code fragment by MD5 hash

    • Available for authenticated user.
    • Soft delete is being proceeded until the last user owns the specific code fragment.
    • usersOwn array of given code fragment is empty? -> hard delete item.

Authentication is needed to handle user-specific documents and is based on JWT standard. No confirmation needed while registering. Email has to be unique. All forms available in the application are being validated.

  • POST /api/register - register new user
  • POST /api/login- service login

TODO

  • Handle user-specific documents - create authentication & owning the documents by the specific user
  • Work on performance - added cheerio as the main scraper
  • Adjust searching for searchPhrase in the obtained content to be more precise (currently, the base of the search process is check if fragment includes given searchPhrase)
  • Work on refresh tokens

About

Automated scraper built on top of Stack Exchange API.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published