Stage 1ab — Full Refresh
Status: 🟠 Needs improvements
Runs: Every Monday (full scrape of all AMS filters)
Classes
What It Does
- Airflow triggers the actor with all-filters mode (no incremental stop condition)
- Apify scrapes every page across all AMS search filters
- Results stream back via a producer/consumer async queue
- Each item is inserted into
scrape_cache using INSERT IGNORE — duplicates are silently skipped
- New rows have
fk_job_id = NULL and queue up for Stage 2
Known Issues
| Issue |
Notes |
| Engine created twice |
Created in __init__ and again in connect() — redundant |
run() indentation |
Inconsistent indentation causes logic to run outside the intended scope |
| No stop-signal |
Incremental mode (Stage 1a) is not yet wired in |
| Hardcoded path |
__main__ block references a local file path — not portable |
Future Extension Points
- Add stop-signal handshake so Stage 1a can reuse this class with an abort condition
- Replace hardcoded
__main__ path with an environment variable or config file