Skip to content

Stage 1a — Real-time Scraping

Status: 🔴 Not built


Overview

An incremental scrape that runs every 30 minutes. The actor stops as soon as it encounters URLs that are already in scrape_cache, giving clients near-real-time data with minimal redundant scraping.


Planned Classes

Class File Role
(not yet designed)

What It Will Do

  1. Airflow triggers an Apify actor in incremental mode
  2. The actor scrapes page by page
  3. Each result is checked against scrape_cache via INSERT IGNORE
  4. When the actor encounters a page where all URLs are already cached, it aborts — no need to continue
  5. New URLs flow downstream to Stage 2 immediately

Design Constraints

  • Must not block or interfere with the weekly Stage 1b full refresh
  • Stop-signal mechanism required — actor needs a way to know the cache is saturated
  • Deduplication must be handled at the url_hash level, not the URL string (normalisation required)

Dependencies

  • Relies on Stage 1b having populated an initial baseline in scrape_cache
  • Stage 2 must be running concurrently to consume the queue