Skip to content

Stage 1ab — Full Refresh

Status: 🟠 Needs improvements Runs: Every Monday (full scrape of all AMS filters)


Classes

Class File
JobsAustriaETLCache jobs_austria_cache_import.py

What It Does

  1. Airflow triggers the actor with all-filters mode (no incremental stop condition)
  2. Apify scrapes every page across all AMS search filters
  3. Results stream back via a producer/consumer async queue
  4. Each item is inserted into scrape_cache using INSERT IGNORE — duplicates are silently skipped
  5. New rows have fk_job_id = NULL and queue up for Stage 2

Known Issues

Issue Notes
Engine created twice Created in __init__ and again in connect() — redundant
run() indentation Inconsistent indentation causes logic to run outside the intended scope
No stop-signal Incremental mode (Stage 1a) is not yet wired in
Hardcoded path __main__ block references a local file path — not portable

Future Extension Points

  • Add stop-signal handshake so Stage 1a can reuse this class with an abort condition
  • Replace hardcoded __main__ path with an environment variable or config file