Skip to content

Commit

Permalink
chore(bench): add benchmarks
Browse files Browse the repository at this point in the history
  • Loading branch information
j-mendez committed Dec 27, 2023
1 parent bbb8a83 commit 7c73e75
Show file tree
Hide file tree
Showing 7 changed files with 123 additions and 9 deletions.
35 changes: 35 additions & 0 deletions .github/workflows/bench.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: Bench Compare

on:
push:
branches:
- main
pull_request:
branches:
- main

jobs:
checkout_and_test:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["pypy3.9", "pypy3.10", "3.9", "3.10", "3.11", "3.12"]

steps:
- name: Checkout code from ${{ github.repository }}
uses: actions/checkout@v4

- name: Setup python
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'

- name: Install Deps
run: pip install scrapy && pip install spider_rs

- name: Run Bench @spider-rs/spider-rs
run: python ./bench/spider.py

- name: Run Bench Scrapy
run: python ./bench/scrappy.py
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -203,4 +203,5 @@ __test__/*.js
/storage
/bench/*.js
/bench/case/**.js
/bench/storage/
/bench/storage/
/bench/__pycache__
8 changes: 4 additions & 4 deletions bench/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
You can run the benches with python in terminal.

```sh
python scrappy.py && python spider.py
python scrapy.py && python spider.py
```

## Cases
Expand All @@ -16,15 +16,15 @@ mac Apple M1 Max

URL used `https://rsseau.fr`

[Scrapy](scrappy.py)
[Scrapy](scrapy.py)

```
Scrappy
Scrapy
pages found 188
elasped duration 9.301506042480469
```

[Spider-Rs](spider.py)
[Spider-RS](spider.py)

```
Spider
Expand Down
7 changes: 3 additions & 4 deletions bench/scrappy.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
import time
import scrapy
from scrapy.spiders import CrawlSpider, Rule
import time, scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.crawler import CrawlerProcess

class MySpider(CrawlSpider):
Expand All @@ -23,8 +22,8 @@ def parse_item(self, response):

print("benching scrappy(python)...")
process = CrawlerProcess()
start = time.time()
spider = MySpider
start = time.time()
process.crawl(spider)
process.start()
end = time.time()
Expand Down
5 changes: 5 additions & 0 deletions book/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,8 @@
- [Crawl](./crawl.md)
- [Scrape](./scrape.md)
- [Cron Job](./cron-job.md)
- [Storing Data](./storing-data.md)

# Benchmarks

- [Compare](./benchmarks.md)
52 changes: 52 additions & 0 deletions book/src/benchmarks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Benchmarks

View the latest runs on [github](https://github.com/spider-rs/spider-py/actions/workflows/bench.yml).

```sh
Linux
8-core CPU
32 GB of RAM memory
-----------------------
```

Test url: `https://choosealicense.com` (small)
32 pages

| `libraries` | `speed` |
| :-------------------------------- | :------ |
| **`spider-rs: crawl 10 samples`** | `76ms` |
| **`scrapy: crawl 10 samples`** | `2.5s` |

Test url: `https://rsseau.fr` (medium)
211 pages

| `libraries` | `speed` |
| :-------------------------------- | :------ |
| **`spider-rs: crawl 10 samples`** | `0.5s` |
| **`scrapy: crawl 10 samples`** | `72s` |

```sh
----------------------
mac Apple M1 Max
10-core CPU
64 GB of RAM memory
-----------------------
```

Test url: `https://choosealicense.com` (small)
32 pages

| `libraries` | `speed` |
| :-------------------------------- | :------ |
| **`spider-rs: crawl 10 samples`** | `286ms` |
| **`scrapy: crawl 10 samples`** | `2.5s` |

Test url: `https://rsseau.fr` (medium)
211 pages

| `libraries` | `speed` |
| :-------------------------------- | :------ |
| **`spider-rs: crawl 10 samples`** | `2.5s` |
| **`scrapy: crawl 10 samples`** | `10s` |

The performance scales the larger the website and if throttling is needed. Linux benchmarks are about 10x faster than macOS for spider-rs.
22 changes: 22 additions & 0 deletions book/src/storing-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Storing Data

Storing data can be done to collect the raw content for a website.

This allows you to upload and download the content without UTF-8 conversion. The property only appears when
setting the second param of the `Website` class constructor to true.

```py
import asyncio
from spider_rs import Website

class Subscription:
def __init__(self):
print("Subscription Created...")
def __call__(self, page):
print(page.url + " - bytes: " + str(page.raw_content))
# do something with page.raw_content

async def main():
website = Website("https://choosealicense.com")
website.crawl(Subscription(), True)
```

0 comments on commit 7c73e75

Please sign in to comment.