Skip to content

Commit

Permalink
Added classifier for category filtering.
Browse files Browse the repository at this point in the history
  • Loading branch information
abishekmuthian committed Feb 19, 2022
1 parent df9ef6a commit 71728a0
Show file tree
Hide file tree
Showing 8 changed files with 328 additions and 51 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ hntoebook_linux_amd64.zip
hntoebook_linux_arm64.zip
hntoebook_windows_amd64.zip
db

models
42 changes: 27 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,25 +17,30 @@ Hence I've stripped down HN to Kindle code to enable local transfer to any e-boo

### How
1. Retrieve HN stories using official API with a [Go wrapper](https://github.com/hoenn/go-hn/).
2. Filter best(Determined by HN) stories older than 9 hours but lesser than 24 hours with at least 20 comments and top comment older than 2 hours.
3. Convert the HTML to .pdf after applying cosmetic changes using WKhtmlTopdf with a [Go wrapper](https://github.com/SebastiaanKlippert/go-wkhtmltopdf).
4. Convert the .pdf to .mobi using Calibre command line tool.
5. Place the .mobi file on the device.
6. Store the item id in the K,V database to prevent duplicates.
2. Filter best(Determined by HN) stories older than 9 hours but lesser than 24 hours with at least 20 comments and top comment older than 2 hours. For transferring specific HN item, Story/Comment is received through the item id.
3. Python classifier server is run in the background for classifying story titles against category keywords if category filter option is chosen.
4. Convert the HTML to .pdf after applying cosmetic changes using WKhtmlTopdf with a [Go wrapper](https://github.com/SebastiaanKlippert/go-wkhtmltopdf).
5. Convert the .pdf to .mobi using Calibre command line tool.
6. Place the .mobi file on the device.
7. Store the item id in the K,V database to prevent duplicates.

### Requirements
1. [WKhtmlTopdf](https://wkhtmltopdf.org/downloads.html)
2. [Calibre CLI](https://calibre-ebook.com/download)
3. [hntoebook](https://github.com/abishekmuthian/hntoebook/releases)
#### For Category Filter (Optional)
4. git lfs
5. pytorch
(Other python packages are installed through requirements.txt, Could update your existing packages.)

### Usage

#### Operating System
1. Linux amd64 (Tested)
2. Linux arm64 (Not tested)
3. darwin amd64 (Not tested)
4. darwin arm64 (Not tested)
5. windows amd64 (Not tested)
2. Linux arm64 (Not tested, Reports are welcome)
3. darwin amd64 (Not tested, Reports are welcome)
4. darwin arm64 (Not tested, Reports are welcome)
5. Windows amd64 (Not tested, Reports are welcome)

#### Set the path to store the .mobi file on the e-book reader
./hntoebook -c
Expand All @@ -46,6 +51,9 @@ Hence I've stripped down HN to Kindle code to enable local transfer to any e-boo
#### Send particular HN story or HN comment to the e-book reader
./hntoebook -i

#### Filter HN story categories(Whitelist)
./hntoebook -f

### Feature parity with HN To Kindle
#### Email
Local file transfer is used instead of Email.
Expand All @@ -54,19 +62,23 @@ Local file transfer is used instead of Email.
Individuals HN item (Story or Comment) can be sent to the e-book reader.

#### Category Filter
Filtering is not implemented as it requires python server for classifier with large sized models. If there's enough interest for the feature then I will include it in the project.

#### Misc
No web server, Relational database, Concurrency etc. as there's no need for user accounts or subscriptions.
HN stories can be whitelisted by using category keywords and are filtered using a classifier.

### Troubleshooting

#### Errors with mobiPath
Make sure that the path for .mobi files on E-Reader ends with a trailing slash / .
Make sure that the path for .mobi files on E-Reader ends with a trailing slash / and the folder where you want to place .mobi files exists prior to running this program.

#### Errors with Category Filter
Category filter requires specific requirements such as Python, PyTorch etc. Models downloaded during config.

#### Not functioning after an error
See if the process executed by the program e.g. uvicorn, calibre, wkhtmltopdf are still running, If so stop the process before executing the program again.

#### Database errors
Delete the db folder and start again. If you were using < v0.0.3 and upgraded to v0.0.3 then the db folder needs to be deleted regardless of any error as v0.0.3 uses new database.
##### SEGFAULT

Delete the db folder and start again. If you were using < v0.0.3 and upgraded to v0.0.3 then the db folder needs to be deleted regardless of any error as v0.0.3 uses new database.

### License

Expand Down
17 changes: 17 additions & 0 deletions bert.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/usr/bin/python
from transformers import pipeline
import sys

classifier = pipeline("zero-shot-classification", model="models/distilbert-base-uncased-mnli")

sequence = sys.argv[1]
candidate_labels = sys.argv[2].split(",")

res = classifier(sequence, candidate_labels, multi_label=True, truncation=False)

for i, label in enumerate(candidate_labels):
print("%d. %s [%.2f]" % (i, res['labels'][i], res['scores'][i]))
if res['scores'][i] > 0.75:
print("Keyword is True")
else:
print("Keyword is False")
189 changes: 169 additions & 20 deletions main.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,36 @@ package main

import (
"bufio"
"context"
"fmt"
"github.com/dgraph-io/badger/v3"
"github.com/hoenn/go-hn/pkg/hnapi"
"hntoebook/stories/operations"
"io/ioutil"
"log"
"os"
"os/exec"
"regexp"
"strings"
"sync"

"github.com/dgraph-io/badger/v3"
"github.com/hoenn/go-hn/pkg/hnapi"
)

func startClassifierServer(ctx context.Context) {
cmd := exec.CommandContext(ctx, "uvicorn", "main:app")
err := cmd.Start()
if err != nil {
// Run could also return this error and push the program
// termination decision to the `main` method.
log.Fatalln("Starting classifier server, Error when starting the server. Check if all the requirements are fulfilled", err)
}

err = cmd.Wait()
if err != nil {
log.Println("waiting on cmd:", err)
}
}

func OperationsMode(db *badger.DB, mode string) {
var mobiPath string

Expand All @@ -27,7 +48,7 @@ func OperationsMode(db *badger.DB, mode string) {
// This func with val would only be called if item.Value encounters no error.

// Accessing val here is valid.
fmt.Printf("The answer is: %s\n", val)
fmt.Printf("The .mobi path is: %s\n", val)

// Copying or parsing val is valid.
mobiPath = string(append([]byte{}, val...))
Expand Down Expand Up @@ -82,16 +103,83 @@ func OperationsMode(db *badger.DB, mode string) {
}
os.Remove(dir)
break
case "filter":

var wg sync.WaitGroup
ctx, cancel := context.WithCancel(context.Background())

// Increment the WaitGroup synchronously in the main method, to avoid
// racing with the goroutine starting.
wg.Add(1)
go func() {
startClassifierServer(ctx)
// Signal the goroutine has completed
wg.Done()
}()

fmt.Println("Enter the categories for filtering separted by a comma, e.g. Tech,Climate,Gaming:")
var categories []string
var categoryParam string

scanner := bufio.NewScanner(os.Stdin)
scanner.Scan()
if scanner.Err() != nil {
log.Fatalln("Error in getting the categories")
} else {
categoryParam = scanner.Text()
}

if len(categoryParam) == 0 {
log.Fatalln("No categories were entered, try again")
} else {
re := regexp.MustCompile("^(\\w+ *\\w*)+( *, *\\w* *\\w*)*$")
if !(re.MatchString(categoryParam)) {
log.Fatalln("Invalid category entered, Enter categories separated by a comma, e.g. Tech,Climate,Gaming:")
}
}

categoriesTemp := strings.Split(categoryParam, ",")

for _, category := range categoriesTemp {
if len(category) > 1000 {
fmt.Println("Was the category name in german?")
}
categories = append(categories, strings.TrimSpace(category))
}

log.Println("Categories: ", categories)

log.Println("Creating temporary directory for storing .pdf files")
dir, err := ioutil.TempDir("", "hn")
if err != nil {
log.Fatal(err)
}

log.Println("Temporary directory name:", dir)

operations.UpdateStories(db, dir+"/", mobiPath, categories)
os.Remove(dir)

log.Println("closing via ctx")
cancel()

// Wait for the child goroutine to finish, which will only occur when
// the child process has stopped and the call to cmd.Wait has returned.
// This prevents main() exiting prematurely.
wg.Wait()

break

default:
fmt.Println("Creating temporary directory for storing .pdf files")
log.Println("Creating temporary directory for storing .pdf files")
dir, err := ioutil.TempDir("", "hn")
if err != nil {
log.Fatal(err)
}

fmt.Println("Temporary directory name:", dir)
log.Println("Temporary directory name:", dir)

operations.UpdateStories(db, dir+"/", mobiPath)
operations.UpdateStories(db, dir+"/", mobiPath, nil)
os.Remove(dir)
break
}
Expand All @@ -109,7 +197,7 @@ func OperationsMode(db *badger.DB, mode string) {
func main() {
var mobiPath string

fmt.Println("***HN to E-book***")
log.Println("***HN to E-book***")

db, err := badger.Open(badger.DefaultOptions("db"))
if err != nil {
Expand All @@ -120,30 +208,91 @@ func main() {
if len(args) > 0 {
switch args[0] {
case "-c":
fmt.Println("Entering config mode")
log.Println("Entering config mode")

fmt.Println("Enter a path for storing .mobi files on the e-reader e.g. /run/media/username/Kindle/documents/Downloads/hn/ :")
log.Println("Would like to set up the path for .mobi? Y/N")
scanner := bufio.NewScanner(os.Stdin)
scanner.Scan()
if scanner.Err() != nil {
log.Fatalln("Error in getting the path for storing the ebook")
log.Fatalln("Error in getting the answer for storing path for the ebook")
} else {
mobiPath = scanner.Text()
mobiAnswer := scanner.Text()
if mobiAnswer == "Y" {
fmt.Println("Enter a path for storing .mobi files on the e-reader(After creating the folder) e.g. /run/media/username/Kindle/documents/Downloads/hn/ :")
scanner.Scan()
if scanner.Err() != nil {
log.Fatalln("Error in getting the path for storing the ebook")
} else {
mobiPath = scanner.Text()
}
err = db.Update(func(txn *badger.Txn) error {
err := txn.Set([]byte("mobiPath"), []byte(mobiPath))
return err
})
if err != nil {
log.Fatal(err)
}
log.Println("Stored mobiPath for future operations")
} else if mobiAnswer == "N" {
log.Println("Not setting up path for .mobi this time")
} else {
log.Println("Invalid answer, Enter Y (or) N")
}
}
err = db.Update(func(txn *badger.Txn) error {
err := txn.Set([]byte("mobiPath"), []byte(mobiPath))
return err
})
if err != nil {
log.Fatal(err)

fmt.Println("Would you like to setup category filter? Y/N")
scanner.Scan()
if scanner.Err() != nil {
log.Fatalln("Error in getting the answer for category filter")
} else {
categoryFilter := scanner.Text()

if categoryFilter == "Y" {
log.Println("Downloading model for classification")
log.Println("This would take a while....")

var out []byte
out, err = exec.Command("git", "clone", "https://huggingface.co/typeform/distilbert-base-uncased-mnli", "models/distilbert-base-uncased-mnli/").CombinedOutput()

// if there is an error with our execution
// handle it here
if err != nil {
log.Println("Downloading models, Error executing command to download models. Check if the models folder is empty", err)
return
}
log.Println("Command Successfully Executed")
output := string(out[:])
log.Println(output)

log.Println("Installing necessary python packages")

out, err = exec.Command("pip", "install", "-r", "requirements.txt").CombinedOutput()

// if there is an error with our execution
// handle it here
if err != nil {
log.Println("Installing python packages, Error executing command to install python packages. Install the packages manually.", err)
return
}
log.Println("Command Successfully Executed")
output = string(out)
log.Println(output)

} else if categoryFilter == "N" {
log.Println("Category filter not enabled")
} else {
log.Println("Invalid answer, Enter Y (or) N")
}
}
fmt.Println("Stored mobiPath for future operations")

OperationsMode(db, "default")
log.Println("Configuration done, You can now use ./hntoebook")
break
case "-i":
fmt.Println("Entering item mode")
log.Println("Entering item mode")
OperationsMode(db, "item")
case "-f":
log.Println("Entering filter mode")
OperationsMode(db, "filter")
default:
log.Fatalln("Invalid argument, Use -c for config mode (or) -i for item mode")
}
Expand Down
23 changes: 23 additions & 0 deletions main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
from fastapi import FastAPI
from pydantic import BaseModel, constr, conlist
from typing import List
from transformers import pipeline

classifier = pipeline("zero-shot-classification",
model="models/distilbert-base-uncased-mnli")
app = FastAPI()


class UserRequestIn(BaseModel):
text: constr(min_length=1)
labels: conlist(str, min_items=1)


class ScoredLabelsOut(BaseModel):
labels: List[str]
scores: List[float]


@app.post("/classification", response_model=ScoredLabelsOut)
def read_classification(user_request_in: UserRequestIn):
return classifier(user_request_in.text, user_request_in.labels)
6 changes: 6 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
uvicorn
torch
fastapi
pydantic
typing
transformers
Loading

0 comments on commit 71728a0

Please sign in to comment.