JobsAustria Pipeline¶
Implementation Status¶
| Stage | Status | Description |
|---|---|---|
| 1a — Real-time scraping (every 30 min) | 🔴 Not built | Incremental scrape. Stops when cached URLs detected. Gives clients real-time edge. |
| 1b — Full refresh (every Monday) | 🟠 Needs improvements | Full scrape to detect stale/removed listings. Keeps DB clean. |
| 2 — Payload Sync | 🟠 In progress | Imports payload JSON into jobs, companies, locations. Syncs FKs. |
| 3 — Detail Enrichment | 🟠 In progress | Scrapes full AMS detail pages via Apify. Writes to jobs + descriptions. |
| 4 — Additional Info | 🔴 Not built | LinkedIn data, company firmographics, salary benchmarks. |
Data Flow Diagram¶
sequenceDiagram
autonumber
actor Airflow
participant AMS
participant Apify
participant SC as scrape_cache 🗄️
participant J as jobs 🗄️
participant PS as Payload Sync
participant DE as Detail Enrichment
participant AI as Additional Info
note over Airflow,SC: Stage 1a 🔴 — Real-time scraping (every 30 min)
Airflow->>Apify: Start actor — incremental mode
loop Each page of results
Apify->>AMS: Scrape page
AMS-->>Apify: Return listings
Apify-->>SC: INSERT IGNORE — url, url_hash, data_payload
SC-->>Apify: ⚠️ NOT BUILT — all URLs cached, abort
end
note over Airflow,SC: Stage 1b 🟠 — Full refresh (every Monday)
Airflow->>Apify: Start actor — all filters mode
Apify->>AMS: Scrape everything
AMS-->>Apify: Return all listings
Apify-->>SC: INSERT IGNORE — duplicates skipped
note over PS,J: Stage 2 🟠 — Payload Sync
loop Poll every 30s until queue empty
PS->>SC: Fetch batch where fk_job_id IS NULL
SC-->>PS: Return rows + data_payload JSON
PS->>J: INSERT new jobs — url, url_hash, position
PS->>SC: UPDATE fk_job_id for matched rows
PS->>J: UPDATE company_id, location_id, publication_date, portal
end
note over DE,J: Stage 3 🟠 — Detail Enrichment
loop Poll until queue empty
DE->>J: Fetch URLs where order_number IS NULL
J-->>DE: Return pending URLs
DE->>Apify: Fire detail actor — batches of 100, max 3 concurrent
Apify->>AMS: Scrape full job detail pages
AMS-->>Apify: Return details
Apify-->>J: UPDATE order_number, education, salary, employment_relationship
Apify-->>J: INSERT into descriptions
end
note over AI,J: Stage 4 🔴 — Additional Info
loop Poll until queue empty
AI->>J: Fetch jobs missing LinkedIn or company details
J-->>AI: Return pending records
AI->>Apify: Fire LinkedIn / company info actors
Apify-->>J: UPDATE enriched fields
end
Hold "Ctrl" to enable pan & zoom