Skip to content

Stage 3 — Detail Enrichment

Status: 🟠 In progress


Classes

Class File Role
JobsAustriaDetailsETL jobs_austria_details_scraping.py Main ETL — fetches pending URLs, fires Apify actors, writes results
PortalRouter jobs_austria_details_scraping.py Routes URLs to the correct Apify actor run_input by portal

What It Does

  1. Polls jobs for rows where order_number IS NULL — these have not been detail-scraped yet
  2. Routes each URL by portal via PortalRouter (currently AMS only — others grouped as unknown and skipped)
  3. Fires Apify detail actors in batches of 100 URLs, max 3 concurrent actors
  4. Streams results back via a producer/consumer async queue
  5. Writes enriched fields to jobs: order_number, education, salary, employment_relationship
  6. Inserts full job descriptions into the descriptions table
  7. Repeats until the queue is empty

Known Issues

Issue File Notes
_fetch_pending_urls() creates its own engine jobs_austria_details_scraping.py Should reuse self.engine instead
_PORTAL_INPUTS accessed from outside the class jobs_austria_details_scraping.py Should be private — access via a method
_extract_portal() duplicated Multiple files Same function as in CacheSynchronizer — move to utils/parsing.py
Only AMS supported PortalRouter Non-AMS URLs silently fall through as unknown and are never scraped

Future Extension Points

  • Add crawl4ai_jobs run_input to PortalRouter._PORTAL_INPUTS for non-AMS portals
  • Add salary scraping from external salary benchmarking sites
  • Extract shared parsing utilities to utils/parsing.py