Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add lesson about using the platform #1424

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/styles/config/vocabularies/Docs/accept.txt
Original file line number Diff line number Diff line change
Expand Up @@ -93,3 +93,4 @@ preconfigured
asyncio
parallelization
IMDb
dev
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,91 @@
slug: /scraping-basics-python/platform
---

import Exercises from './_exercises.mdx';

**In this lesson, we'll deploy our application to a scraping platform that automatically runs it daily. We'll also use the platform's API to retrieve and work with the results.**

---

Before starting with a scraping platform, let's highlight a few caveats in our current setup:

- **User-operated:** We have to run the scraper ourselves. If we're tracking price trends, we'd need to remember to run it daily. And if we want alerts for big discounts, manually running the program isn't much better than just checking the site in a browser every day.
- **No monitoring:** If we have a spare server or a Raspberry Pi lying around, we could use [cron](https://en.wikipedia.org/wiki/Cron) to schedule it. But even then, we'd have little insight into whether it ran successfully, what errors or warnings occurred, how long it took, or what resources it used.
- **Manual data management:** Tracking prices over time means figuring out how to organize the exported data ourselves. Processing the data could also be tricky since different analysis tools often require different formats.
- **Anti-scraping risks:** If the target website detects our scraper, they can rate-limit or block us. Sure, we could run it from a coffee shop's Wi-Fi, but eventually, they'd block that too—risking seriously annoying the barista.

In this lesson, we'll use a platform to address all of these issues. Generic cloud platforms like [GitHub Actions](https://github.com/features/actions) can work for simple scenarios. But platforms dedicated to scraping, like [Apify](https://apify.com/), offer extra features such as monitoring scrapers, managing retrieved data, and overcoming anti-scraping measures.

:::info Why Apify

Scraping platforms come in many varieties, offering a wide range of tools and approaches. As the course authors, we're obviously a bit biased toward Apify—we think it's both powerful and complete.

That said, the main goal of this lesson is to show how deploying to **any platform** can make life easier—it's not Apify-specific. Plus, everything we cover here fits within [Apify's free tier](https://apify.com/pricing).

Check failure on line 28 in sources/academy/webscraping/scraping_basics_python/13_platform.md

View workflow job for this annotation

GitHub Actions / lint

[vale] reported by reviewdog 🐶 [Vale.Terms] Use 'Apify(?=-\w+)' instead of 'Apify'. Raw Output: {"message": "[Vale.Terms] Use 'Apify(?=-\\w+)' instead of 'Apify'.", "location": {"path": "sources/academy/webscraping/scraping_basics_python/13_platform.md", "range": {"start": {"line": 28, "column": 116}}}, "severity": "ERROR"}

:::

## Packaging the project

Until now we've been adding dependencies to our project by only installing them by `pip` inside an activated virtual environment. If we were to send our code to a friend, they wouldn't know what they need to install before they can run the scraper without import errors. The same applies if we were to send our code to a cloud platform.

In the root of the project, let's create a file called `requirements.txt`, with a single line consisting of a single word:

```text title="requirements.txt"
crawlee
```

Each line in the file represents a single dependency, but so far our program has just one. With `requirements.txt` in place, Apify can run `pip install -r requirements.txt` to download and install all dependencies of the project before starting our program.

:::tip Packaging projects

The [requirements file](https://pip.pypa.io/en/latest/user_guide/#requirements-files) is an obsolete approach to packaging a Python project, but it still works and it's the simplest, which is convenient for the purposes of this lesson.

For any serious work the best and future-proof approach to packaging is to create the [`pyproject.toml`](https://packaging.python.org/en/latest/guides/writing-pyproject-toml/) configuration file. We recommend the official [Python Packaging User Guide](https://packaging.python.org/) for more info.

:::

## Registering

As a second step, let's [create a new Apify account](https://console.apify.com/sign-up). The process includes several verifications that you're a human being and that your e-mail address is valid. While annoying, these are necessary measures to prevent abuse of the platform.

Apify serves both as an infrastructure where to privately deploy and run own scrapers, and as a marketplace, where anyone can offer their ready scrapers to others for rent. We'll overcome our curiosity for now and leave exploring the Apify Store for later.

## Getting access from the command line

To control the platform from our machine and send the code of our program there, we'll need the Apify CLI. On macOS, we can install the CLI using [Homebrew](https://brew.sh), otherwise we'll first need [Node.js](https://nodejs.org/en/download).

After following the [Apify CLI installation guide](https://docs.apify.com/cli/docs/installation), we'll verify that we installed the tool by printing its version:

```text
$ apify --version
apify-cli/0.0.0 system-arch00 node-v0.0.0
```

Now let's connect the CLI with the platform using our account:

```text
$ apify login
...
Success: You are logged in to Apify as user1234!
```

<!--
it seems apify init won't recognize the project only with requirements.txt
https://crawlee.dev/python/docs/introduction/deployment

Check failure on line 79 in sources/academy/webscraping/scraping_basics_python/13_platform.md

View workflow job for this annotation

GitHub Actions / lint

[vale] reported by reviewdog 🐶 [Vale.Terms] Use 'Crawlee' instead of 'crawlee'. Raw Output: {"message": "[Vale.Terms] Use 'Crawlee' instead of 'crawlee'.", "location": {"path": "sources/academy/webscraping/scraping_basics_python/13_platform.md", "range": {"start": {"line": 79, "column": 13}}}, "severity": "ERROR"}
https://packaging.python.org/en/latest/tutorials/installing-packages/
https://docs.apify.com/sdk/python/docs/overview/introduction
-->

## Creating an Actor

...

## What's next

---

<Exercises />

:::danger Work in progress

This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
Expand Down
6 changes: 0 additions & 6 deletions sources/academy/webscraping/scraping_basics_python/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,6 @@ import DocCardList from '@theme/DocCardList';

---

:::danger Work in progress

This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. Comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.

:::

In this course we'll use Python to create an application for watching prices. It'll be able to scrape all product pages of an e-commerce website and record prices. Data from several runs of such program would be useful for seeing trends in price changes, detecting discounts, etc.

<!--
Expand Down
Loading