Stage 1a — Real-time Scraping¶
Status: 🔴 Not built
Overview¶
An incremental scrape that runs every 30 minutes. The actor stops as soon as it encounters URLs that are already in scrape_cache, giving clients near-real-time data with minimal redundant scraping.
Planned Classes¶
| Class | File | Role |
|---|---|---|
| (not yet designed) | — | — |
What It Will Do¶
- Airflow triggers an Apify actor in incremental mode
- The actor scrapes page by page
- Each result is checked against
scrape_cacheviaINSERT IGNORE - When the actor encounters a page where all URLs are already cached, it aborts — no need to continue
- New URLs flow downstream to Stage 2 immediately
Design Constraints¶
- Must not block or interfere with the weekly Stage 1b full refresh
- Stop-signal mechanism required — actor needs a way to know the cache is saturated
- Deduplication must be handled at the
url_hashlevel, not the URL string (normalisation required)
Dependencies¶
- Relies on Stage 1b having populated an initial baseline in
scrape_cache - Stage 2 must be running concurrently to consume the queue