Stage 2 — Payload Sync¶
Status: 🟠 In progress
Classes¶
| Class | File | Role |
|---|---|---|
JobsAustriaCacheProcess |
jobs_austria_cache_process_data_payload.py |
Orchestration loop — polls and calls the synchronizer |
JobsAustriaCacheSynchronizer |
jobs_austria_cache_synchronizer.py |
Business logic — unpacks payload, writes to jobs, updates FKs |
What It Does¶
- Polls
scrape_cacheevery 30 seconds for rows wherefk_job_id IS NULL - Fetches a batch and unpacks the
data_payloadJSON column - Extracts:
url,url_hash,position,company,location,publication_date,portal - Inserts new rows into
jobs(deduplicates viaurl_hashunique constraint) - Writes
jobs.idback intoscrape_cache.fk_job_id— marks the row as processed - Enriches
jobswithcompany_id,location_id,publication_date,portal - Repeats until the queue is empty
Known Issues¶
| Issue | File | Notes |
|---|---|---|
| Overlapping responsibilities | Both files | JobsAustriaCacheProcess and JobsAustriaCacheSynchronizer overlap — should be consolidated |
process_once() does too much |
cache_process_data_payload.py |
Needs to be split into focused, single-responsibility helpers |
_extract_portal() duplicated |
Multiple files | Same function exists in CacheSynchronizer and DetailsETL — move to utils/parsing.py |
_parse_date() duplicated |
Multiple files | Same as above |
| Leftover draft class | jobs_austria_cache_key_sync.py |
JobsAustriaCacheProcessRework is an unused draft — safe to delete |
Future Extension Points¶
- Extract
_extract_portal(),_parse_date(),_str_or_none()into a sharedutils/parsing.py - Consolidate the two classes once the refactor is stable