Stage 1a — Real-time Scraping¶

Status: 🔴 Not built

Overview¶

An incremental scrape that runs every 30 minutes. The actor stops as soon as it encounters URLs that are already in scrape_cache, giving clients near-real-time data with minimal redundant scraping.

Planned Classes¶

Class	File	Role
(not yet designed)	—	—

What It Will Do¶

Airflow triggers an Apify actor in incremental mode
The actor scrapes page by page
Each result is checked against scrape_cache via INSERT IGNORE
When the actor encounters a page where all URLs are already cached, it aborts — no need to continue
New URLs flow downstream to Stage 2 immediately

Design Constraints¶

Must not block or interfere with the weekly Stage 1b full refresh
Stop-signal mechanism required — actor needs a way to know the cache is saturated
Deduplication must be handled at the url_hash level, not the URL string (normalisation required)

Dependencies¶

Relies on Stage 1b having populated an initial baseline in scrape_cache
Stage 2 must be running concurrently to consume the queue