Stage 2 — Payload Sync: Database Schema & Data Lifecycle¶
This document describes the four MySQL tables involved in Stage 2, their real schema, foreign key relationships, and how data flows through them during the sync process.
Entity Relationship Diagram¶
erDiagram
scrape_cache {
int id PK "auto_increment"
varchar255 url UK "raw AMS URL"
char64 url_hash "STORED GENERATED from url"
json data_payload "full scraped item JSON"
datetime scraped_at
int fk_job_id FK "NULL until synced to jobs"
}
jobs {
int id PK "auto_increment"
varchar255 url UK
char64 url_hash "STORED GENERATED from url"
varchar45 position
varchar45 order_number "NULL until Stage 3"
varchar45 portal "e.g. ams"
varchar45 employment_relationship
varchar45 contract_type
varchar45 detailed_location
varchar45 education
varchar45 original_salary
varchar45 start_timeline
varchar45 occupational_group
date publication_date
datetime created_at
datetime updated_at
int company_id FK "NULL until synced"
int location_id FK "NULL until synced"
}
companies {
int id PK "auto_increment"
varchar255 company_crawler_name UK "raw name from scraper"
varchar45 company_url
varchar45 company_info2
}
locations {
int id PK "auto_increment"
varchar45 zipcode "match key from payload"
varchar45 location
varchar45 city
varchar45 municipality
}
scrape_cache ||--o| jobs : "fk_job_id → jobs.id"
jobs }o--|| companies : "company_id → companies.id"
jobs }o--|| locations : "location_id → locations.id"
Data Lifecycle — what gets written when¶
Step 1 — Cache Import (Stage 1b)¶
scrape_cache row is created. At this point:
| Column | Value |
|---|---|
url |
AMS job URL |
url_hash |
auto-generated from url |
data_payload |
full JSON blob from Apify |
scraped_at |
timestamp of scrape |
fk_job_id |
NULL — not yet linked |
Step 2 — INSERT into jobs (JobsAustriaCacheProcess)¶
Reads scrape_cache where fk_job_id IS NULL. Unpacks data_payload, inserts a minimal row into jobs:
| Column | Value |
|---|---|
url |
from payload |
url_hash |
auto-generated |
position |
from payload employment_type field |
company_id |
NULL |
location_id |
NULL |
order_number |
NULL — Stage 3 fills this |
Step 3 — Link scrape_cache → jobs (JobsAustriaCacheSynchronizer Queue 1)¶
Matches scrape_cache.url_hash to jobs.url_hash. Writes the FK back:
| Column | Value |
|---|---|
scrape_cache.fk_job_id |
set to jobs.id — row is now done, never re-processed |
Step 4 — Enrich jobs (JobsAustriaCacheSynchronizer Queue 2)¶
Unpacks full data_payload JSON. Updates jobs using COALESCE — never overwrites existing data:
| jobs column | payload field |
|---|---|
company_id |
company → looked up / inserted into companies |
location_id |
url_location (zipcode) → looked up in locations |
publication_date |
inserted_updated parsed from DD.MM.YYYY |
contract_type |
contract_type |
detailed_location |
locations |
original_salary |
salary |
education |
education |
portal |
extracted from url hostname |
Key design notes¶
url_hash is a STORED GENERATED column — MySQL computes it automatically from url. Never insert it directly. Used as the join key between scrape_cache and jobs because it can be indexed efficiently unlike full URL strings.
fk_job_id IS NULL is the processing trigger — once it's filled the row is considered done and never re-processed by any stage.
COALESCE on all job updates — UPDATE jobs SET company_id = COALESCE(:company_id, company_id) means if the new value is NULL the existing value is kept. Safe to re-run without data loss.
locations is read-only in Stage 2 — pre-populated with Austrian zipcodes. Stage 2 only reads to map url_location → location_id. No inserts.
companies gets INSERT IGNORE — new company names inserted if not already present. Lookup by company_crawler_name (raw scraped name, not cleaned). Deduplication is exact-match only at this stage.
Source files¶
src/pipelines/JobsAustria/jobs_austria_cache_synchronizer.pysrc/pipelines/JobsAustria/jobs_austria_cache_process_data_payload.py