Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please add a comprehensive example to the documentation #862

Closed
matecsaj opened this issue Jan 5, 2025 · 1 comment
Closed

Please add a comprehensive example to the documentation #862

matecsaj opened this issue Jan 5, 2025 · 1 comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@matecsaj
Copy link
Contributor

matecsaj commented Jan 5, 2025

Please add a new example to the Crawlee Python examples page for users to follow; or, somewhere else if there is a better spot.

This example would address complex, real-world scenarios where users need to combine multiple crawling techniques and technologies. By providing a fully functional, extensive example, users can copy-paste it and adapt it to their specific needs, saving them the effort of figuring out how to connect all the pieces for complicated use cases.

Proposed Workflow for the Example:

  1. Login with Playwright: Use Playwright to log in to a site and establish a session (e.g., handling cookies, tokens, or authentication).
  2. Crawl JavaScript-Heavy Pages: Use Playwright to navigate and crawl dynamic, JavaScript-heavy pages using the established session.
  3. Crawl Static Pages: Leverage the session to crawl static pages using a lightweight HTTP crawler for increased speed and efficiency.
  4. Mimic Requests: Use the session to make authenticated requests (e.g., mimicking API calls) and download JSON files.
  5. Use RESTful API: Demonstrate how to use the established session to interact with a REST API to fetch more JSON data.
  6. Use GraphQL: Extend the example further by including authenticated requests to a GraphQL API to fetch additional JSON data.

Value to Users:

  • Efficiency: Users can copy-paste the example and simply delete the sections they don’t need, instead of piecing together solutions from scratch.
  • Real-World Applicability: Many web scraping tasks involve a mix of JavaScript-heavy crawling, lightweight static scraping, and direct API requests. A comprehensive example would address these common, yet complex scenarios.
  • Ease of Learning: Beginners can see how different technologies (Playwright, HTTP crawlers, RESTful APIs, GraphQL) work together in a single project, fostering a better understanding of Crawlee's full capabilities.
  • Customizability: The modular nature of the example makes it adaptable to a wide range of use cases, from crawling e-commerce sites to accessing complex data sources.
  • Demonstrate sessions persistence and sharing between the different tools. For instance, how to share cookies between Playwright and the HTTP crawler.
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 5, 2025
@vdusek vdusek changed the title Please add a comprehensive example to the documentation. Please add a comprehensive example to the documentation Jan 6, 2025
@vdusek
Copy link
Collaborator

vdusek commented Jan 6, 2025

Hi @matecsaj, thanks for your interest in Crawlee! Have you check out our Introduction guide? I believe it addresses most of what you are asking for. That said, I am aware we are missing a login example, I'll open a new issue to cover that (#870).

@vdusek vdusek closed this as completed Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants