Skip to content

JobsAustria Pipeline — Data Flow Diagram

🗄️ participants in the diagram are MySQL tables — they are the message bus. No stage talks directly to another.

Implementation Status

Stage Status Description
1 — Real-time scraping (every 30 min) 🔴 Not built Incremental scrape. Stops when cached URLs detected. Gives clients real-time edge.
1b — Full refresh (every Monday) 🟠 Needs improvements Full scrape to detect stale/removed listings. Keeps DB clean.
2 — Payload Sync 🟠 In progress Imports payload JSON into jobs, companies, locations. Syncs FKs.
3 — Detail Enrichment 🟠 In progress Scrapes full AMS detail pages via Apify. Writes to jobs + descriptions.
4 — Additional Info 🔴 Not built LinkedIn data, company firmographics, salary benchmarks.

Data Flow Diagram

sequenceDiagram
    autonumber

    actor Airflow
    participant AMS
    participant Apify
    participant SC as scrape_cache 🗄️
    participant J as jobs 🗄️
    participant PS as Payload Sync
    participant DE as Detail Enrichment
    participant AI as Additional Info

    note over Airflow,SC: Stage 1 🔴 — Real-time scraping (every 30 min)
    Airflow->>Apify: Start actor — incremental mode
    loop Each page of results
        Apify->>AMS: Scrape page
        AMS-->>Apify: Return listings
        Apify-->>SC: INSERT IGNORE — url, url_hash, data_payload
        SC-->>Apify: ⚠️ NOT BUILT — all URLs cached, abort
    end

    note over Airflow,SC: Stage 1b 🟠 — Full refresh (every Monday)
    Airflow->>Apify: Start actor — all filters mode
    Apify->>AMS: Scrape everything
    AMS-->>Apify: Return all listings
    Apify-->>SC: INSERT IGNORE — duplicates skipped

    note over PS,J: Stage 2 🟠 — Payload Sync
    loop Poll every 30s until queue empty
        PS->>SC: Fetch batch where fk_job_id IS NULL
        SC-->>PS: Return rows + data_payload JSON
        PS->>J: INSERT new jobs — url, url_hash, position
        PS->>SC: UPDATE fk_job_id for matched rows
        PS->>J: UPDATE company_id, location_id, publication_date, portal
    end

    note over DE,J: Stage 3 🟠 — Detail Enrichment
    loop Poll until queue empty
        DE->>J: Fetch URLs where order_number IS NULL
        J-->>DE: Return pending URLs
        DE->>Apify: Fire detail actor — batches of 100, max 3 concurrent
        Apify->>AMS: Scrape full job detail pages
        AMS-->>Apify: Return details
        Apify-->>J: UPDATE order_number, education, salary, employment_relationship
        Apify-->>J: INSERT into descriptions
    end

    note over AI,J: Stage 4 🔴 — Additional Info
    loop Poll until queue empty
        AI->>J: Fetch jobs missing LinkedIn or company details
        J-->>AI: Return pending records
        AI->>Apify: Fire LinkedIn / company info actors
        Apify-->>J: UPDATE enriched fields
    end

Key Design Decisions

⚠️ Stop signal — not yet built: Detecting mid-scrape that all URLs on a page are already cached and aborting the actor cleanly. Options: Python calls Apify API via webhook, or the actor checks the DB via an external input param.

MySQL as message bus: All stages are fully decoupled — each polls for its own trigger condition (a null column, a missing FK). Stages 2, 3, and 4 can all start as soon as the first rows land in scrape_cache.

Airflow role: Schedules when each stage starts. Does not manage communication between stages.

cache_key_sync.py — safe to delete: Logic fully covered by Stage 2.