This is the event scraping service for the Cohere platform. It uses intelligent scraping techniques with LLMs to extract event information from various online sources and store them in a structured format.
- Automated event scraping from configured sources
- LLM-powered content extraction
- Intelligent deduplication
- Scheduled scraping with configurable intervals
- Admin API for managing scraping sources
- Detailed logging and monitoring
- Language: Python 3.11+
- Framework: FastAPI
- Scraping: Playwright
- LLM Integration: LangChain
- Database: Supabase
- Task Scheduling: Schedule
- Python 3.11 or higher
- Virtual environment tool (venv)
- Playwright browser dependencies
-
Clone the repository:
git clone https://github.com/your-username/cohere.git cd cohere-scraper
-
Create and activate virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Install Playwright browsers:
playwright install
-
Create a
.env
file:cp .env.example .env
Update the environment variables with your credentials.
-
Start the service:
uvicorn src.main:app --reload
src/
├── scrapers/ # Scraping implementations
├── models/ # Data models and schemas
├── services/ # Business logic and external services
├── utils/ # Utility functions
└── main.py # Application entry point
tests/ # Test files
- Start development server:
uvicorn src.main:app --reload
- Run tests:
pytest
- Run linting:
flake8
- Run type checking:
mypy .
- Format code:
black . && isort .
When the service is running, visit:
- Swagger UI:
http://localhost:8000/docs
- ReDoc:
http://localhost:8000/redoc
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.