Join free

You can code.
You've used the AI tools.
You're still waiting.

Built for final-year CS students and junior developers who want their first AI engineering role — not just AI hype.

  • 20 weeks
  • 12 hrs / week
  • 8 modules free
  • GitHub-based

The honest picture

Your GitHub looks great.
So does everyone else's.

AI-generated code floods every portfolio. Hiring managers can't tell who actually understands anything. The ones who get hired are proving judgment, not just output.

School taught correct. Industry needs reliable.

CS programmes train you to build systems where correctness is provable — write code, run tests, deploy, done. Production AI systems are probabilistic. The same input gives different outputs. Quality drifts without any code change. Failures are silent. That is a different engineering problem.

The interview question has changed.

Hiring managers are asking: when the AI gives you wrong code, do you catch it? When your AI feature starts degrading three months after launch, can you diagnose it? Being able to prompt ChatGPT is not what they are hiring for.

Your portfolio can't show what they're asking for.

You have repos and tutorials. They want proof you catch bad AI output before it ships, diagnose drift months after launch, and build systems that fail loud — not demos that worked once on your laptop.

What the industry needs

The bar moved.
Your proof hasn't caught up.

The honest picture above is the anxiety. Here is the shift behind it: teams aren't hiring for “can use AI tools.” They're hiring for engineers who can make AI systems reliable in production — and show evidence they've done it before.

Judgment over output

Anyone can paste model-generated code. Hiring managers need engineers who know when to reject it, override it, and explain why.

Reliability over correctness

Production AI fails silently — quality drifts, retries double-charge, retrieval misses edge cases. The job is making probabilistic systems observable and bounded.

Proof over claims

A GitHub full of green checkmarks isn't enough anymore. They want linked PRs, recorded reasoning, and artifacts they can verify in ten minutes.

That is a specific, learnable skill set — not a personality trait and not something you pick up from a video playlist. Pukkaship is built so you practise it on real broken code and leave with proof.

What Pukkaship teaches

Six skills.
One production codebase, week by week.

Each skill maps directly to what hiring managers are asking for above — not as theory, but as work you ship. Sixteen modules on one evolving AI service: fix a broken repo, open a PR that passes CI, explain your judgment. Modules 1–8 build the foundation (free); 9–16 add measurement, observability, prompt engineering, RAG, and the capstone (paid).

20 weeks · 12 hours per week · GitHub-based · every merge updates your verified skills profile

Modules 1–4 · free

Production Foundations Before AI

Your API returned 200. Nothing was saved. Catch that before you add a model.

Fix code that hides wrong assumptions instead of failing loud. Turn it into a real HTTP service with persistence. Harden external calls with timeouts and bounded retries. Then wire your first LLM — same discipline as any other unreliable network dependency, not a separate magic step.

fail-loud debuggingAPI + persistenceretries & timeoutsfirst LLM integrationstreaming responses

Modules 5–8 · free

Harden Before You Ship

Validate at the edge. Defend against adversarial input. Queue slow work. Make retries safe.

Schema validation at every boundary, guardrails against prompt injection and data egress, async decoupling of LLM work from the hot path, and signed idempotent webhook delivery — the foundation arc that makes the system production-grade.

Zod at boundariesadversarial input defensequeue + webhookHMAC verificationidempotency keys

Modules 9–10 · paid

Evals That Catch Regressions Before Users Do

Replace 'it seems better' with a number you can track. Automate judgment about AI output quality.

Fixture libraries and a harness that turn classifier output into a measured pass rate. An LLM-as-judge with rubric anchors — and the distinction between a hard gate that blocks release and a soft signal that informs only.

LLM-as-judgehard vs soft gatesregression harnessesfixture designscore drift detection

Module 11 · paid

Seeing What Your AI Does After It Ships

Quality drifted three months after launch. No code changed. You need to see it before users report it.

Structured tracing of each hop — prompt, completion, tokens, latency — written so you can debug and evaluate in production what you cannot see from CI alone.

trace pipelinesprompt/completion loggingtoken cost monitoringlatency trendssilent failure patterns

Modules 12–16 · paid

Engineer the AI Layer — Through Capstone

Navigate unfamiliar code. Stabilise prompts. Budget context. Measure retrieval. Ship the viewer.

Read and extend a system you did not write, stabilise prompts with regression fixtures, enforce context budgets with token math, measure retrieval per query — then deploy the access-controlled eval-results viewer as the capstone.

unfamiliar codebaseprompt regression fixturescontext budgetingretrieval harnesscapstone viewer

Also woven through every module

architecture principlesprompt engineering best practicespostmortem writingAI-pair programming disciplinecontext window designsystem design for probabilistic outputs

The credential

Not a certificate.
A verified skills profile.

Every claim links to a specific PR, a Loom, or a CI run. The areas of growth section is not optional — it is what makes the rest of it credible to a hiring manager.

A

Alex R. · @alexr

20 weeks complete · Cohort Jun 2026

20 PRs·17 Looms·1 postmortem·capstone deployed✓ verified

Technical foundation

Strong

Debugging, reliability & validation all proficient or above across 8 modules

Measurement & evals

Strong

Fixture-first from week 9 · judge correctly separated gates from signals

AI collaboration

Strong

Documented AI rejections in every PR from week 3 · hypothesis before prompt

Prompt & RAG

Developing

Retrieval at 71% vs 80% target · context budgeting mechanical, not design-level

Per-dimension · what the evidence supports

Technical

Measurement & evals

PRs #9, #10, #19 · Loom wk 10

Strong

Debugging discipline

Bug journals wks 1–3 · hypothesis in every PR

Proficient

Reliability engineering

PRs #3, #4 · retry & timeout patterns cited

Proficient

Boundary validation

PRs #5, #6 · Zod schemas at every edge

Proficient

Async & systems

PR #7 · pattern applied, rationale thin

Developing

Prompt, context & RAG

PRs #13–15 · retrieval at 71% vs 80% target

Developing

Behavioural

AI-collaboration judgment

Strong

Growth trajectory

Clear upward

Communication (why not what)

Proficient

Areas of growth · required in every shared profile

Prompt, context & RAG — Developing

“Retrieval fixture pass rate reached 71% (target: 80%). Token budget reasoning was mechanical — included everything that fit rather than choosing what mattered. Recommend more practice before taking on production RAG work independently.”

Detailed findings · linked to artifacts

Measurement & evals

Strong

PRs #9, #10, #19 (capstone) · Loom wk 10

“Fixture-first thinking applied consistently from week 9 onward. Week 10 judge correctly separated hard gates from soft signals without prompting. Capstone eval results viewer deployed, 94% CI pass rate.”

AI-collaboration judgment

Strong

“Every PR from week 3 includes a documented AI rejection with stated reasoning. Week 11 Loom shows hypothesis stated before AI consulted in 6 of 7 sessions observed.”

How it works

Fix broken code. Explain your fix.
Unlock the next module.

01

Clone a broken production codebase

Each repo is a real production pattern turned into a broken codebase — a type system gap, an async boundary that fails under load, a retrieval pipeline that passes CI but breaks at edge cases. Not puzzles. Real patterns.

02

Form a hypothesis — then fix it and explain it

Before you open the AI, write your hypothesis. Before you merge, open a PR — CI verifies tests pass — and write why the bug existed, why your fix works, and where your judgment overrode the model. The methodology is the proof.

03

Get your weekly snapshot

After each merge, your skills profile updates. You get a personal report — what moved, what to focus on next, with specific guidance.

Pukkaship Weekly Snapshot · Alex R. · Week 10 of 20

This week: pe-10-judge — LLM-as-judge · PR #10 merged

Current standing

Measurement & evals Strong ↑ Proficient

AI-collaboration Strong ↑ Proficient

Debugging discipline Proficient → unchanged

What moved this week

Your journal named the distinction between a hard gate and a soft signal without being prompted. That moved both dimensions.

Focus for pe-11

Evals: Your judge rubric has three criteria but no fixture for when two conflict. Add one before you merge.

AI workflow: Before asking what a function does, write your hypothesis first. Loom should show that sequence.

— Sudi

Built by a practitioner

Designed by someone who shipped GenAI at AWS.

Not a course creator who learned from courses. Every module reflects real production failures from twenty years of shipping software at scale.

Sudi Bhattacharya

Sudi Bhattacharya

Senior Software Development Manager · Amazon Web Services

GenAI Infrastructure & Applications · IIT Kanpur · Chicago Booth

  • AWSShipped GenAI infrastructure and applied science systems at scale (2022–present)
  • DeloitteLed big data and AI/ML transformation programmes for Fortune 500 clients as Managing Director
  • MicrosoftLaunched SQL Server 2005 BI — one of Microsoft's most-subscribed product releases
linkedin.com/in/sudibhattacharya

Your first real
AI engineering proof.

Clone Module 1. Fix the first bug. Open a PR that passes CI and explain your reasoning. One artifact. Verified. Yours. Nineteen more to go.

Register with GitHub — it's free

Requires a GitHub account · 20 weeks · 12 hrs/week