Skip to content

NightSling/GCE-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

banner

GCE Scraper

Download past papers, faster.

Built with the tools and technologies:

Rust TOML


🔗 Table of Contents


📍 Overview

gce-scraper efficiently downloads past GCE exam papers. Users specify subjects, years, and paper types, and the tool uses multi-threading to quickly acquire and save them. This saves students and educators significant time and effort in accessing vital study materials.


👾 Features

Feature Summary
⚙️ Architecture
  • The project uses a modular architecture, with distinct modules for configuration generation (config_gen.rs), configuration management (configuration.rs), web scraping (scraper.rs), and downloading (download.rs).
  • The main application logic resides in src/main.rs, orchestrating the interaction between these modules.
  • It leverages a <Cargo>-based build system, indicating a well-structured Rust project.
  • The application uses a configuration file (likely TOML) to manage settings, promoting flexibility and maintainability. See src/configuration.rs and Cargo.toml.
🔩 Code Quality
  • The codebase is written in <Rust>, known for its focus on memory safety and performance.
  • The use of established crates like <clap> for command-line argument parsing and <log> for logging suggests a focus on best practices.
  • Further analysis of the code would be needed to assess aspects like code style consistency and adherence to coding standards.
  • The modular design promotes code reusability and maintainability.
📄 Documentation
  • The project includes a Cargo.toml file and several .rs files (6, according to the provided context). See FILE CONTENTS.
  • The primary language is <Rust>, and the documentation appears to be primarily embedded within the code itself and the Cargo.toml file.
  • The provided context suggests that the documentation could be improved by adding more detailed comments and potentially external documentation.
  • Install, usage, and test commands are provided, indicating some level of documentation for execution.
🔌 Integrations
  • The project uses <reqwest> for making HTTP requests to scrape data from a website.
  • It leverages <tokio> for asynchronous operations, likely improving performance, especially during web scraping and downloading.
  • <serde> is used for serialization, likely for handling the configuration file and potentially the scraped data.
  • The <clap> crate handles command-line argument parsing, providing a user-friendly interface.
🧩 Modularity
  • The codebase is divided into several modules (config_gen.rs, configuration.rs, scraper.rs, download.rs, lib.rs), promoting code organization and reusability.
  • The lib.rs file acts as a central point for exposing these modules, further enhancing modularity.
  • This modular design improves maintainability and allows for easier testing of individual components.
  • Dependencies are managed effectively using <Cargo>, further supporting modularity.

📁 Project Structure

└── /
    ├── Cargo.lock
    ├── Cargo.toml
    └── src
        ├── config_gen.rs
        ├── configuration.rs
        ├── download.rs
        ├── lib.rs
        ├── main.rs
        └── scraper.rs

📂 Project Index

/
__root__
Cargo.toml - `gce-scraper` defines a Rust project using various crates for command-line argument parsing, logging, HTTP requests, and parallel processing
- It leverages `reqwest` for web scraping, `clap` for user interface, and `par-stream` for concurrent operations
- The project's core functionality centers on data scraping and likely involves processing the acquired data using `serde` for serialization.
src
main.rs - The `src/main.rs` file serves as the main entry point, orchestrating the GCE-Guide past paper scraper
- It parses command-line arguments to either generate a configuration file specifying download parameters (papers, years, subjects, seasons) or download past papers using a provided configuration
- The program utilizes multi-threading for efficient I/O operations, managing logging verbosity based on user input.
config_gen.rs - `config_gen.rs` generates a TOML configuration file
- It retrieves syllabus information, paper details across specified years and seasons, and consolidates this data
- The resulting configuration file, written to the designated output path, is used by other parts of the application to manage and process academic papers, leveraging multi-threading for efficient data retrieval.
configuration.rs - The `src/configuration.rs` file defines a `Configuration` struct and implements its loading from a TOML configuration file
- This struct, used throughout the application (as indicated by its `pub` visibility), holds application-wide settings, specifically details about papers (`PaperType`) and subject-year configurations (`YearConfiguration`)
- It acts as a central point for managing the application's configurable parameters.
scraper.rs - The `scraper.rs` module facilitates web scraping of examination papers from a specific website
- It retrieves available years and papers based on syllabus codes, paper types, and seasons
- The module then downloads and saves the requested papers to specified file paths
- Error handling is implemented to manage network and parsing issues, ensuring robust data acquisition.
download.rs - `download.rs` manages the downloading and saving of academic papers
- It reads configuration data, creates necessary directories, and then concurrently downloads papers for specified subjects and years, leveraging multiple threads for efficiency
- The module handles potential errors during configuration parsing and file system operations, ensuring robust download management within the larger application.
lib.rs - `src/lib.rs` establishes the core library for the project, providing foundational modules
- It initializes logging and exposes modules responsible for configuration management (`config_gen`, `configuration`), web scraping (`scraper`), and data downloading (`download`)
- These modules collectively form the building blocks for the application's primary functionality.

🚀 Getting Started

☑️ Prerequisites

Before getting started with , ensure your runtime environment meets the following requirements:

  • Programming Language: Rust
  • Package Manager: Cargo

⚙️ Installation

Install using one of the following methods:

Build from source:

  1. Clone the repository:
❯ git clone https://github.com/NightSling/GCE-Scraper.git
  1. Navigate to the project directory:
cd GCE-Scraper
  1. Install the project dependencies:

Using cargo  

❯ cargo build

🤖 Usage

Run using the following command: Using cargo  

❯ cargo run -- --help

🧪 Testing

Run the test suite using the following command: Using cargo  

❯ cargo test

📌 Project Roadmap

  • Task 1: Parallel config generation.
  • Task 2: Parallel downloading based on config.
  • Task 3: Extend the use cases to other miscellaneous files such as specimen papers and such.
  • Task 4: Extend the limitation from A-Levels to other boards supported by GCE Guide.

🔰 Contributing

Contributing Guidelines
  1. Fork the Repository: Start by forking the project repository to your LOCAL account.
  2. Clone Locally: Clone the forked repository to your local machine using a git client.
    git clone .
  3. Create a New Branch: Always work on a new branch, giving it a descriptive name.
    git checkout -b new-feature-x
  4. Make Your Changes: Develop and test your changes locally.
  5. Commit Your Changes: Commit with a clear message describing your updates.
    git commit -m 'Implemented new feature x.'
  6. Push to LOCAL: Push the changes to your forked repository.
    git push origin new-feature-x
  7. Submit a Pull Request: Create a PR against the original project repository. Clearly describe the changes and their motivations.
  8. Review: Once your PR is reviewed and approved, it will be merged into the main branch. Congratulations on your contribution!
Contributor Graph


🎗 License

This project is protected under the Apache Version 2.0 License. For more details, refer to the LICENSE file.


🙌 Acknowledgments

  • Access all past papers easily through GCE-Guide. All papers are the property of Cambridge Assessment International Education (CAIE). The purpose of the software is not to promote piracy or the sharing of proprietary content, but rather to serve as an educational tool.
  • The README.md is generated through readme-ai.