Stage 2 — Payload Sync: Database Schema & Data Lifecycle¶

This document describes the four MySQL tables involved in Stage 2, their real schema, foreign key relationships, and how data flows through them during the sync process.

Entity Relationship Diagram¶

erDiagram

    scrape_cache {
        int         id              PK  "auto_increment"
        varchar255  url             UK  "raw AMS URL"
        char64      url_hash            "STORED GENERATED from url"
        json        data_payload        "full scraped item JSON"
        datetime    scraped_at
        int         fk_job_id       FK  "NULL until synced to jobs"
    }

    jobs {
        int         id              PK  "auto_increment"
        varchar255  url             UK
        char64      url_hash            "STORED GENERATED from url"
        varchar45   position
        varchar45   order_number        "NULL until Stage 3"
        varchar45   portal              "e.g. ams"
        varchar45   employment_relationship
        varchar45   contract_type
        varchar45   detailed_location
        varchar45   education
        varchar45   original_salary
        varchar45   start_timeline
        varchar45   occupational_group
        date        publication_date
        datetime    created_at
        datetime    updated_at
        int         company_id      FK  "NULL until synced"
        int         location_id     FK  "NULL until synced"
    }

    companies {
        int         id              PK  "auto_increment"
        varchar255  company_crawler_name UK  "raw name from scraper"
        varchar45   company_url
        varchar45   company_info2
    }

    locations {
        int         id              PK  "auto_increment"
        varchar45   zipcode             "match key from payload"
        varchar45   location
        varchar45   city
        varchar45   municipality
    }

    scrape_cache    ||--o|   jobs        : "fk_job_id → jobs.id"
    jobs            }o--||   companies   : "company_id → companies.id"
    jobs            }o--||   locations   : "location_id → locations.id"

Data Lifecycle — what gets written when¶

Step 1 — Cache Import (Stage 1b)¶

scrape_cache row is created. At this point:

Column	Value
`url`	AMS job URL
`url_hash`	auto-generated from url
`data_payload`	full JSON blob from Apify
`scraped_at`	timestamp of scrape
`fk_job_id`	NULL — not yet linked

Step 2 — INSERT into jobs (`JobsAustriaCacheProcess`)¶

Reads scrape_cache where fk_job_id IS NULL. Unpacks data_payload, inserts a minimal row into jobs:

Column	Value
`url`	from payload
`url_hash`	auto-generated
`position`	from payload `employment_type` field
`company_id`	NULL
`location_id`	NULL
`order_number`	NULL — Stage 3 fills this

Step 3 — Link scrape_cache → jobs (`JobsAustriaCacheSynchronizer` Queue 1)¶

Matches scrape_cache.url_hash to jobs.url_hash. Writes the FK back:

Column	Value
`scrape_cache.fk_job_id`	set to `jobs.id` — row is now done, never re-processed

Step 4 — Enrich jobs (`JobsAustriaCacheSynchronizer` Queue 2)¶

Unpacks full data_payload JSON. Updates jobs using COALESCE — never overwrites existing data:

jobs column	payload field
`company_id`	`company` → looked up / inserted into `companies`
`location_id`	`url_location` (zipcode) → looked up in `locations`
`publication_date`	`inserted_updated` parsed from DD.MM.YYYY
`contract_type`	`contract_type`
`detailed_location`	`locations`
`original_salary`	`salary`
`education`	`education`
`portal`	extracted from `url` hostname

Key design notes¶

url_hash is a STORED GENERATED column — MySQL computes it automatically from url. Never insert it directly. Used as the join key between scrape_cache and jobs because it can be indexed efficiently unlike full URL strings.

fk_job_id IS NULL is the processing trigger — once it's filled the row is considered done and never re-processed by any stage.

COALESCE on all job updates — UPDATE jobs SET company_id = COALESCE(:company_id, company_id) means if the new value is NULL the existing value is kept. Safe to re-run without data loss.

locations is read-only in Stage 2 — pre-populated with Austrian zipcodes. Stage 2 only reads to map url_location → location_id. No inserts.

companies gets INSERT IGNORE — new company names inserted if not already present. Lookup by company_crawler_name (raw scraped name, not cleaned). Deduplication is exact-match only at this stage.

Source files¶

src/pipelines/JobsAustria/jobs_austria_cache_synchronizer.py
src/pipelines/JobsAustria/jobs_austria_cache_process_data_payload.py