Stage 3 — Detail Enrichment¶
Status: 🟠 In progress
Classes¶
| Class | File | Role |
|---|---|---|
JobsAustriaDetailsETL |
jobs_austria_details_scraping.py |
Main ETL — fetches pending URLs, fires Apify actors, writes results |
PortalRouter |
jobs_austria_details_scraping.py |
Routes URLs to the correct Apify actor run_input by portal |
What It Does¶
- Polls
jobsfor rows whereorder_number IS NULL— these have not been detail-scraped yet - Routes each URL by portal via
PortalRouter(currently AMS only — others grouped asunknownand skipped) - Fires Apify detail actors in batches of 100 URLs, max 3 concurrent actors
- Streams results back via a producer/consumer async queue
- Writes enriched fields to
jobs:order_number,education,salary,employment_relationship - Inserts full job descriptions into the
descriptionstable - Repeats until the queue is empty
Known Issues¶
| Issue | File | Notes |
|---|---|---|
_fetch_pending_urls() creates its own engine |
jobs_austria_details_scraping.py |
Should reuse self.engine instead |
_PORTAL_INPUTS accessed from outside the class |
jobs_austria_details_scraping.py |
Should be private — access via a method |
_extract_portal() duplicated |
Multiple files | Same function as in CacheSynchronizer — move to utils/parsing.py |
| Only AMS supported | PortalRouter |
Non-AMS URLs silently fall through as unknown and are never scraped |
Future Extension Points¶
- Add
crawl4ai_jobsrun_input toPortalRouter._PORTAL_INPUTSfor non-AMS portals - Add salary scraping from external salary benchmarking sites
- Extract shared parsing utilities to
utils/parsing.py