feat: lesson about using a framework (#1303) · apify/apify-docs@89a564d

Commit

feat: lesson about using a framework (#1303)

This PR introduces a new lesson to the Python course for beginners in
scraping. The lesson is about working with a framework. Decisions I
made:

- I opted not to use type hints to make the examples less cluttered and
to avoid the need to explain type hints to people who didn't ever use
them
- The logging section serves two purposes - first, it adds logging :)
and second, it conveniently provides code for the whole program at the
end of the lesson
- I had a hard time to come up with exercises, because most of the
simple stuff I came up with was too simple and would result in shorter
and simpler code without the framework 😅
- I decided to have one classic scenario (listing & detail) just to let
the student write their first Crawlee program. It's a bit challenging
regarding traversal over the HTML to get the data, but it shouldn't be
challenging regarding Crawlee.
- I introduced one scenario where the scraper needs to jump through
several pages (even domains) to get the result. Such program would be
hard or at least very annoying to write without framework.
- As always, I focused on having the examples based on real world sites
which are somewhat known and popular globally, but also don't feature
extensive anti-scraping protections.

## Crawlee feedback

Regarding Crawlee, I didn't have much trouble to write this lesson,
apart from the part where I wanted to provide hints on how to do this:

```py
requests = []
for ... in context.soup.select(...):
    ...
    requests.append(Request.from_url(imdb_search_url, label="..."))
await context.add_requests(requests)
```

I couldn't find good example in the docs, and I was afraid that even if
I provided pointers to all the individual pieces, the student wouldn't
be able to figure it out.

Also, I wanted to link to docs when pointing out the fact that
`enqueue_links()` has a `limit` argument, but I couldn't find
`enqueue_links()` in the docs. I found
[this](https://crawlee.dev/python/api/class/EnqueueLinksFunction#Methods)
which is weird. It's not clear what object is documented, or what it is,
feels like some internals, not as regular docs of a method. I probably
know how come it's this way, but I don't think it's useful this way and
I decided I don't want to send people from the course to that page.

One more thing: I do think that Crawlee should log some "progress"
information about requests made or - especially - items scraped. It's so
weird to run the program and then just look at the program as if it
hanged, waiting if something happens or not. E.g. Scrapy logs how many
items per minute I scraped, which I personally find super useful.

---------

Co-authored-by: Ondra Urban <[email protected]>

Loading branch information

honzajavorek and mnmkng authored Jan 21, 2025

1 parent fd3a98d commit 89a564d

.github/styles/config/vocabularies/Docs/accept.txt

-Original file line number
+Diff line change
@@ Expand Up / @@ -88,3 +88,8 @@ preconfigured @@
     [Mm]ultiselect
+    [Ss]crapy
+    asyncio
+    parallelization
+    IMDb

sources/academy/webscraping/scraping_basics_python/11_scraping_variants.md

-Original file line number
+Diff line change
@@ Expand Up @@
     Your output should look something like this:
-    ```text
+    ```py
     {'title': 'Senior Full Stack Developer',
      'company': 'Baserow',
      'url': 'https://www.python.org/jobs/7705/',
@@ Expand Down @@

0 comments on commit `89a564d`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `89a564d`

Commit

There are no files selected for viewing

0 comments on commit 89a564d

0 comments on commit `89a564d`