JobsAustria Pipeline — Data Flow Diagram¶
🗄️ participants in the diagram are MySQL tables — they are the message bus. No stage talks directly to another.
Implementation Status¶
| Stage | Status | Description |
|---|---|---|
| 1 — Real-time scraping (every 30 min) | 🔴 Not built | Incremental scrape. Stops when cached URLs detected. Gives clients real-time edge. |
| 1b — Full refresh (every Monday) | 🟠 Needs improvements | Full scrape to detect stale/removed listings. Keeps DB clean. |
| 2 — Payload Sync | 🟠 In progress | Imports payload JSON into jobs, companies, locations. Syncs FKs. |
| 3 — Detail Enrichment | 🟠 In progress | Scrapes full AMS detail pages via Apify. Writes to jobs + descriptions. |
| 4 — Additional Info | 🔴 Not built | LinkedIn data, company firmographics, salary benchmarks. |
Data Flow Diagram¶
sequenceDiagram
autonumber
actor Airflow
participant AMS
participant Apify
participant SC as scrape_cache 🗄️
participant J as jobs 🗄️
participant PS as Payload Sync
participant DE as Detail Enrichment
participant AI as Additional Info
note over Airflow,SC: Stage 1 🔴 — Real-time scraping (every 30 min)
Airflow->>Apify: Start actor — incremental mode
loop Each page of results
Apify->>AMS: Scrape page
AMS-->>Apify: Return listings
Apify-->>SC: INSERT IGNORE — url, url_hash, data_payload
SC-->>Apify: ⚠️ NOT BUILT — all URLs cached, abort
end
note over Airflow,SC: Stage 1b 🟠 — Full refresh (every Monday)
Airflow->>Apify: Start actor — all filters mode
Apify->>AMS: Scrape everything
AMS-->>Apify: Return all listings
Apify-->>SC: INSERT IGNORE — duplicates skipped
note over PS,J: Stage 2 🟠 — Payload Sync
loop Poll every 30s until queue empty
PS->>SC: Fetch batch where fk_job_id IS NULL
SC-->>PS: Return rows + data_payload JSON
PS->>J: INSERT new jobs — url, url_hash, position
PS->>SC: UPDATE fk_job_id for matched rows
PS->>J: UPDATE company_id, location_id, publication_date, portal
end
note over DE,J: Stage 3 🟠 — Detail Enrichment
loop Poll until queue empty
DE->>J: Fetch URLs where order_number IS NULL
J-->>DE: Return pending URLs
DE->>Apify: Fire detail actor — batches of 100, max 3 concurrent
Apify->>AMS: Scrape full job detail pages
AMS-->>Apify: Return details
Apify-->>J: UPDATE order_number, education, salary, employment_relationship
Apify-->>J: INSERT into descriptions
end
note over AI,J: Stage 4 🔴 — Additional Info
loop Poll until queue empty
AI->>J: Fetch jobs missing LinkedIn or company details
J-->>AI: Return pending records
AI->>Apify: Fire LinkedIn / company info actors
Apify-->>J: UPDATE enriched fields
end
Key Design Decisions¶
⚠️ Stop signal — not yet built: Detecting mid-scrape that all URLs on a page are already cached and aborting the actor cleanly. Options: Python calls Apify API via webhook, or the actor checks the DB via an external input param.
MySQL as message bus: All stages are fully decoupled — each polls for its own trigger condition (a null column, a missing FK). Stages 2, 3, and 4 can all start as soon as the first rows land in scrape_cache.
Airflow role: Schedules when each stage starts. Does not manage communication between stages.
cache_key_sync.py — safe to delete: Logic fully covered by Stage 2.