Founding Applied AI Engineer (Eval-Driven)

Airtree

Airtree

Software Engineering, Data Science
Sydney, NSW, Australia
Posted on Jan 19, 2026

This role is for one of our portfolio companies, not internally at Airtree. Your application will be reviewed by the founder, not an Airtree employee.


This early stage company, still operating in stealth, is on a mission to eradicate cost overruns in construction, a $3 trillion problem that slows down cities, destroys margins, and erodes trust between builders and clients.

Over 90% of construction projects go over budget, eroding builder margins and stalling progress across the industry. Our platform takes a new approach to helping builders stay on budget by detecting and preventing costly variations before they spiral. Our AI-powered early warning system gives builders control over project costs — protecting time, margin, and reputation on every build.

We are at the start of something big and we’re looking for an Applied AI Engineer (Eval-driven) to build and ship design-audit workflows that consistently meet measurable quality bars. This role blends ML engineering and data science, with a heavy emphasis on problem definition, evaluation, and reliability in real customer workflows.

Responsibilities

  • Define evaluation problems: success criteria, failure modes, datasets, labelling guidelines, and score functions.
  • Build and maintain an evaluation harness: regression tests, edge-case suites, and quality dashboards to prevent backsliding.
  • Implement workflow systems end-to-end (data → model/LLM components → post-processing → acceptance testing) until they pass eval thresholds.
  • Partner with product and domain stakeholders to translate messy real-world requirements into testable specs.
  • Strong Python skills and practical experience shipping ML/AI systems (not just experimentation).
  • Demonstrated experience designing evals for ML/LLM systems (offline metrics, gold sets, error analysis, regression testing, monitoring).
  • Comfort working across data science + engineering tasks: data wrangling, feature/label design, model/LLM iteration, and productionization.
  • High ownership and intensity: persistence in closing the loop from “fails eval” to “passes consistently.”

Nice to have

  • Experience with document understanding (OCR, parsing, classification/extraction) and structured outputs (schemas, validators).
  • Familiarity with AEC/construction workflows (design coordination, QA/compliance, BIM concepts like IFC/Revit).
  • Experience building human-in-the-loop review systems and adjudication processes to improve training/eval data.