Skip to content

Stage 2 — Payload Sync

Status: 🟠 In progress


Classes

Class File Role
JobsAustriaCacheProcess jobs_austria_cache_process_data_payload.py Orchestration loop — polls and calls the synchronizer
JobsAustriaCacheSynchronizer jobs_austria_cache_synchronizer.py Business logic — unpacks payload, writes to jobs, updates FKs

What It Does

  1. Polls scrape_cache every 30 seconds for rows where fk_job_id IS NULL
  2. Fetches a batch and unpacks the data_payload JSON column
  3. Extracts: url, url_hash, position, company, location, publication_date, portal
  4. Inserts new rows into jobs (deduplicates via url_hash unique constraint)
  5. Writes jobs.id back into scrape_cache.fk_job_id — marks the row as processed
  6. Enriches jobs with company_id, location_id, publication_date, portal
  7. Repeats until the queue is empty

Known Issues

Issue File Notes
Overlapping responsibilities Both files JobsAustriaCacheProcess and JobsAustriaCacheSynchronizer overlap — should be consolidated
process_once() does too much cache_process_data_payload.py Needs to be split into focused, single-responsibility helpers
_extract_portal() duplicated Multiple files Same function exists in CacheSynchronizer and DetailsETL — move to utils/parsing.py
_parse_date() duplicated Multiple files Same as above
Leftover draft class jobs_austria_cache_key_sync.py JobsAustriaCacheProcessRework is an unused draft — safe to delete

Future Extension Points

  • Extract _extract_portal(), _parse_date(), _str_or_none() into a shared utils/parsing.py
  • Consolidate the two classes once the refactor is stable