diff --git a/.github/styles/config/vocabularies/Docs/accept.txt b/.github/styles/config/vocabularies/Docs/accept.txt
index fb5719ddd..089f3ce7f 100644
--- a/.github/styles/config/vocabularies/Docs/accept.txt
+++ b/.github/styles/config/vocabularies/Docs/accept.txt
@@ -88,3 +88,8 @@ preconfigured
[Mm]ultiselect
+
+[Ss]crapy
+asyncio
+parallelization
+IMDb
diff --git a/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md b/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
index 677414b31..f2682f55f 100644
--- a/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
+++ b/sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md
@@ -325,7 +325,7 @@ For each job posting found, use [`pp()`](https://docs.python.org/3/library/pprin
Your output should look something like this:
-```text
+```py
{'title': 'Senior Full Stack Developer',
'company': 'Baserow',
'url': 'https://www.python.org/jobs/7705/',
diff --git a/sources/academy/webscraping/scraping_basics_python/12_framework.md b/sources/academy/webscraping/scraping_basics_python/12_framework.md
index 4845c025b..d711bdf39 100644
--- a/sources/academy/webscraping/scraping_basics_python/12_framework.md
+++ b/sources/academy/webscraping/scraping_basics_python/12_framework.md
@@ -6,25 +6,589 @@ sidebar_position: 12
slug: /scraping-basics-python/framework
---
-:::danger Work in progress
+import Exercises from './_exercises.mdx';
-This course is incomplete. As we work on adding new lessons, we would love to hear your feedback. You can comment right here under each page or [file a GitHub Issue](https://github.com/apify/apify-docs/issues) to discuss a problem.
+**In this lesson, we'll rework our application for watching prices so that it builds on top of a scraping framework. We'll use Crawlee to make the program simpler, faster, and more robust.**
+
+---
+
+Before rewriting our code, let's point out several caveats in our current solution:
+
+- **Hard to maintain:** All the data we need from the listing page is also available on the product page. By scraping both, we have to maintain selectors for two HTML documents. Instead, we could scrape links from the listing page and process all data on the product pages.
+- **Slow:** The program runs sequentially, which is generously considerate toward the target website, but extremely inefficient.
+- **No logging:** The scraper gives no sense of progress, making it tedious to use. Debugging issues becomes even more frustrating without proper logs.
+- **Boilerplate code:** We implement downloading and parsing HTML, or exporting data to CSV, although we're not the first people to meet and solve these problems.
+- **Prone to anti-scraping:** If the target website implemented anti-scraping measures, a bare-bones program like ours would stop working.
+- **Browser means rewrite:** We got lucky extracting variants. If the website didn't include a fallback, we might have had no choice but to spin up a browser instance and automate clicking on buttons. Such a change in the underlying technology would require a complete rewrite of our program.
+- **No error handling:** The scraper stops if it encounters issues. It should allow for skipping problematic products with warnings or retrying downloads when the website returns temporary errors.
+
+In this lesson, we'll tackle all the above issues while keeping the code concise thanks to a scraping framework.
+
+:::info Why Crawlee and not Scrapy
+
+From the two main open-source options for Python, [Scrapy](https://scrapy.org/) and [Crawlee](https://crawlee.dev/python/), we chose the latter—not just because we're the company financing its development.
+
+We genuinely believe beginners to scraping will like it more, since it allows to create a scraper with less code and less time spent reading docs. Scrapy's long history ensures it's battle-tested, but it also means its code relies on technologies that aren't really necessary today. Crawlee, on the other hand, builds on modern Python features like asyncio and type hints.
+
+:::
+
+## Installing Crawlee
+
+When starting with the Crawlee framework, we first need to decide which approach to downloading and parsing we prefer. We want the one based on BeautifulSoup, so let's install the `crawlee` package with the `beautifulsoup` extra specified in brackets. The framework has a lot of dependencies, so expect the installation to take a while.
+
+```text
+$ pip install crawlee[beautifulsoup]
+...
+Successfully installed Jinja2-0.0.0 ... ... ... crawlee-0.0.0 ... ... ...
+```
+
+## Running Crawlee
+
+Now let's use the framework to create a new version of our scraper. In the same project directory where our `main.py` file lives, create a file `newmain.py`. This way, we can keep peeking at the original implementation while working on the new one. The initial content will look like this:
+
+
+```py title="newmain.py"
+import asyncio
+from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
+
+async def main():
+ crawler = BeautifulSoupCrawler()
+
+ @crawler.router.default_handler
+ async def handle_listing(context):
+ print(context.soup.title.text.strip())
+
+ await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"])
+
+if __name__ == '__main__':
+ asyncio.run(main())
+```
+
+
+In the code, we do the following:
+
+1. We perform imports and specify an asynchronous `main()` function.
+2. Inside, we first create a crawler. The crawler objects control the scraping. This particular crawler is of the BeautifulSoup flavor.
+3. In the middle, we give the crawler a nested asynchronous function `handle_listing()`. Using a Python decorator (that line starting with `@`), we tell it to treat it as a default handler. Handlers take care of processing HTTP responses. This one finds the title of the page in `soup` and prints its text without whitespace.
+4. The function ends with running the crawler with the product listing URL. We await the crawler to finish its work.
+5. The last two lines ensure that if we run the file as a standalone program, Python's asynchronous machinery will run our `main()` function.
+
+Don't worry if this involves a lot of things you've never seen before. For now, you don't need to know exactly how [`asyncio`](https://docs.python.org/3/library/asyncio.html) works or what decorators do. Let's stick to the practical side and see what the program does when executed:
+
+```text
+$ python newmain.py
+[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Current request statistics:
+┌───────────────────────────────┬──────────┐
+│ requests_finished │ 0 │
+│ requests_failed │ 0 │
+│ retry_histogram │ [0] │
+│ request_avg_failed_duration │ None │
+│ request_avg_finished_duration │ None │
+│ requests_finished_per_minute │ 0 │
+│ requests_failed_per_minute │ 0 │
+│ request_total_duration │ 0.0 │
+│ requests_total │ 0 │
+│ crawler_runtime │ 0.010014 │
+└───────────────────────────────┴──────────┘
+[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0
+Sales
+[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
+[crawlee.beautifulsoup_crawler._beautifulsoup_crawler] INFO Final request statistics:
+┌───────────────────────────────┬──────────┐
+│ requests_finished │ 1 │
+│ requests_failed │ 0 │
+│ retry_histogram │ [1] │
+│ request_avg_failed_duration │ None │
+│ request_avg_finished_duration │ 0.308998 │
+│ requests_finished_per_minute │ 185 │
+│ requests_failed_per_minute │ 0 │
+│ request_total_duration │ 0.308998 │
+│ requests_total │ 1 │
+│ crawler_runtime │ 0.323721 │
+└───────────────────────────────┴──────────┘
+```
+
+If our previous scraper didn't give us any sense of progress, Crawlee feeds us with perhaps too much information for the purposes of a small program. Among all the logging, notice the line `Sales`. That's the page title! We managed to create a Crawlee scraper that downloads the product listing page, parses it with BeautifulSoup, extracts the title, and prints it.
+
+:::tip Asynchronous code and decorators
+
+You don't need to be an expert in asynchronous programming or decorators to finish this lesson, but you might find yourself curious for more details. If so, check out [Async IO in Python: A Complete Walkthrough](https://realpython.com/async-io-python/) and [Primer on Python Decorators](https://realpython.com/primer-on-python-decorators/).
:::
-
+```json
+[
+ {
+ "url": "https://www.f1academy.com/Racing-Series/Drivers/29/Emely-De-Heus",
+ "name": "Emely De Heus",
+ "team": "MP Motorsport"
+ "nationality": "Dutch",
+ "dob": "2003-02-10",
+ "instagram_url": "https://www.instagram.com/emely.de.heus/",
+ },
+ {
+ "url": "https://www.f1academy.com/Racing-Series/Drivers/28/Hamda-Al-Qubaisi",
+ "name": "Hamda Al Qubaisi",
+ "team": "MP Motorsport"
+ "nationality": "Emirati",
+ "dob": "2002-08-08",
+ "instagram_url": "https://www.instagram.com/hamdaalqubaisi_official/",
+ },
+ ...
+]
+```
+
+Hints:
+
+- Use Python's `datetime.strptime(text, "%d/%m/%Y").date()` to parse dates in the `DD/MM/YYYY` format. Check out the [docs](https://docs.python.org/3/library/datetime.html#datetime.datetime.strptime) for more details.
+- To locate the Instagram URL, use the attribute selector `a[href*='instagram']`. Learn more about attribute selectors in the [MDN docs](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors).
+
+
+ Solution
+
+ ```py
+ import asyncio
+ from datetime import datetime
+
+ from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
+
+ async def main():
+ crawler = BeautifulSoupCrawler()
+
+ @crawler.router.default_handler
+ async def handle_listing(context):
+ await context.enqueue_links(selector=".teams-driver-item a", label="DRIVER")
+
+ @crawler.router.handler("DRIVER")
+ async def handle_driver(context):
+ info = {}
+ for row in context.soup.select(".common-driver-info li"):
+ name = row.select_one("span").text.strip()
+ value = row.select_one("h4").text.strip()
+ info[name] = value
+
+ detail = {}
+ for row in context.soup.select(".driver-detail--cta-group a"):
+ name = row.select_one("p").text.strip()
+ value = row.select_one("h2").text.strip()
+ detail[name] = value
+
+ await context.push_data({
+ "url": context.request.url,
+ "name": context.soup.select_one("h1").text.strip(),
+ "team": detail["Team"],
+ "nationality": info["Nationality"],
+ "dob": datetime.strptime(info["DOB"], "%d/%m/%Y").date(),
+ "instagram_url": context.soup.select_one(".common-social-share a[href*='instagram']").get("href"),
+ })
+
+ await crawler.run(["https://www.f1academy.com/Racing-Series/Drivers"])
+ await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)
+
+ if __name__ == '__main__':
+ asyncio.run(main())
+ ```
+
+
+
+### Use Crawlee to find the ratings of the most popular Netflix films
+
+The [Global Top 10](https://www.netflix.com/tudum/top10) page has a table listing the most popular Netflix films worldwide. Scrape the movie names from this page, then search for each movie on [IMDb](https://www.imdb.com/). Assume the first search result is correct and retrieve the film's rating. Each item you push to Crawlee's default dataset should include the following data:
+
+- URL of the film's imdb.com page
+- Title
+- Rating
+
+If you export the dataset as JSON, it should look something like this:
+
+
+```json
+[
+ {
+ "url": "https://www.imdb.com/title/tt32368345/?ref_=fn_tt_tt_1",
+ "title": "The Merry Gentlemen",
+ "rating": "5.0/10"
+ },
+ {
+ "url": "https://www.imdb.com/title/tt32359447/?ref_=fn_tt_tt_1",
+ "title": "Hot Frosty",
+ "rating": "5.4/10"
+ },
+ ...
+]
+```
+
+To scrape IMDb data, you'll need to construct a `Request` object with the appropriate search URL for each movie title. The following code snippet gives you an idea of how to do this:
+
+```py
+...
+from urllib.parse import quote_plus
+
+async def main():
+ ...
+
+ @crawler.router.default_handler
+ async def handle_netflix_table(context):
+ requests = []
+ for name_cell in context.soup.select(...):
+ name = name_cell.text.strip()
+ imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft"
+ requests.append(Request.from_url(imdb_search_url, label="..."))
+ await context.add_requests(requests)
+
+ ...
+...
+```
+
+When navigating to the first search result, you might find it helpful to know that `context.enqueue_links()` accepts a `limit` keyword argument, letting you specify the max number of HTTP requests to enqueue.
+
+
+ Solution
+
+ ```py
+ import asyncio
+ from urllib.parse import quote_plus
+
+ from crawlee import Request
+ from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler
+
+ async def main():
+ crawler = BeautifulSoupCrawler()
+
+ @crawler.router.default_handler
+ async def handle_netflix_table(context):
+ requests = []
+ for name_cell in context.soup.select(".list-tbl-global .tbl-cell-name"):
+ name = name_cell.text.strip()
+ imdb_search_url = f"https://www.imdb.com/find/?q={quote_plus(name)}&s=tt&ttype=ft"
+ requests.append(Request.from_url(imdb_search_url, label="IMDB_SEARCH"))
+ await context.add_requests(requests)
-Caveats which could be addressed in the rewrite:
+ @crawler.router.handler("IMDB_SEARCH")
+ async def handle_imdb_search(context):
+ await context.enqueue_links(selector=".find-result-item a", label="IMDB", limit=1)
-- all the info in the listing is already at the product page, so it's a bit redundant to scrape the products in the listing, we could just scrape the links
+ @crawler.router.handler("IMDB")
+ async def handle_imdb(context):
+ rating_selector = "[data-testid='hero-rating-bar__aggregate-rating__score']"
+ rating_text = context.soup.select_one(rating_selector).text.strip()
+ await context.push_data({
+ "url": context.request.url,
+ "title": context.soup.select_one("h1").text.strip(),
+ "rating": rating_text,
+ })
-Caveats which are reasons for framework:
+ await crawler.run(["https://www.netflix.com/tudum/top10"])
+ await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)
-- it's slow
-- logging
-- a lot of boilerplate code
-- anti-scraping protection
-- browser crawling support
-- error handling
+ if __name__ == '__main__':
+ asyncio.run(main())
+ ```
--->
+
diff --git a/sources/academy/webscraping/scraping_basics_python/images/dataset-item.png b/sources/academy/webscraping/scraping_basics_python/images/dataset-item.png
new file mode 100644
index 000000000..afd19b9f2
Binary files /dev/null and b/sources/academy/webscraping/scraping_basics_python/images/dataset-item.png differ