Skip to content

Stage 2 — Payload Sync: Database Schema & Data Lifecycle

This document describes the four MySQL tables involved in Stage 2, their real schema, foreign key relationships, and how data flows through them during the sync process.


Entity Relationship Diagram

erDiagram

    scrape_cache {
        int         id              PK  "auto_increment"
        varchar255  url             UK  "raw AMS URL"
        char64      url_hash            "STORED GENERATED from url"
        json        data_payload        "full scraped item JSON"
        datetime    scraped_at
        int         fk_job_id       FK  "NULL until synced to jobs"
    }

    jobs {
        int         id              PK  "auto_increment"
        varchar255  url             UK
        char64      url_hash            "STORED GENERATED from url"
        varchar45   position
        varchar45   order_number        "NULL until Stage 3"
        varchar45   portal              "e.g. ams"
        varchar45   employment_relationship
        varchar45   contract_type
        varchar45   detailed_location
        varchar45   education
        varchar45   original_salary
        varchar45   start_timeline
        varchar45   occupational_group
        date        publication_date
        datetime    created_at
        datetime    updated_at
        int         company_id      FK  "NULL until synced"
        int         location_id     FK  "NULL until synced"
    }

    companies {
        int         id              PK  "auto_increment"
        varchar255  company_crawler_name UK  "raw name from scraper"
        varchar45   company_url
        varchar45   company_info2
    }

    locations {
        int         id              PK  "auto_increment"
        varchar45   zipcode             "match key from payload"
        varchar45   location
        varchar45   city
        varchar45   municipality
    }

    scrape_cache    ||--o|   jobs        : "fk_job_id → jobs.id"
    jobs            }o--||   companies   : "company_id → companies.id"
    jobs            }o--||   locations   : "location_id → locations.id"

Data Lifecycle — what gets written when

Step 1 — Cache Import (Stage 1b)

scrape_cache row is created. At this point:

Column Value
url AMS job URL
url_hash auto-generated from url
data_payload full JSON blob from Apify
scraped_at timestamp of scrape
fk_job_id NULL — not yet linked

Step 2 — INSERT into jobs (JobsAustriaCacheProcess)

Reads scrape_cache where fk_job_id IS NULL. Unpacks data_payload, inserts a minimal row into jobs:

Column Value
url from payload
url_hash auto-generated
position from payload employment_type field
company_id NULL
location_id NULL
order_number NULL — Stage 3 fills this

Matches scrape_cache.url_hash to jobs.url_hash. Writes the FK back:

Column Value
scrape_cache.fk_job_id set to jobs.idrow is now done, never re-processed

Step 4 — Enrich jobs (JobsAustriaCacheSynchronizer Queue 2)

Unpacks full data_payload JSON. Updates jobs using COALESCE — never overwrites existing data:

jobs column payload field
company_id company → looked up / inserted into companies
location_id url_location (zipcode) → looked up in locations
publication_date inserted_updated parsed from DD.MM.YYYY
contract_type contract_type
detailed_location locations
original_salary salary
education education
portal extracted from url hostname

Key design notes

url_hash is a STORED GENERATED column — MySQL computes it automatically from url. Never insert it directly. Used as the join key between scrape_cache and jobs because it can be indexed efficiently unlike full URL strings.

fk_job_id IS NULL is the processing trigger — once it's filled the row is considered done and never re-processed by any stage.

COALESCE on all job updatesUPDATE jobs SET company_id = COALESCE(:company_id, company_id) means if the new value is NULL the existing value is kept. Safe to re-run without data loss.

locations is read-only in Stage 2 — pre-populated with Austrian zipcodes. Stage 2 only reads to map url_locationlocation_id. No inserts.

companies gets INSERT IGNORE — new company names inserted if not already present. Lookup by company_crawler_name (raw scraped name, not cleaned). Deduplication is exact-match only at this stage.


Source files

  • src/pipelines/JobsAustria/jobs_austria_cache_synchronizer.py
  • src/pipelines/JobsAustria/jobs_austria_cache_process_data_payload.py