AI engineering, two years later

It's been two years since ChatGPT went viral.

But despite being avid users of 4o, Claude, Copilot, Cursor, etc, when we started building AI features a few months ago, it took a while to get up to speed. Building AI applications today feels a bit like web development in 2012 - the primitives are powerful, but frameworks and best practices are still emerging.

Local Environment, Models & Prompts

If you're doing AI development, you should use an editor with AI chat built in. Even if you're married to your vim keybindings or you hate code complete, AI models write pretty good prompts, and they know the provider SDKs well.

Want to get to “hello world” in five minutes? Download Cursor, back it with claude-3.5-sonnet-new, and ask the chat to write a script that uses OpenAI's Node.js SDK to classify support tickets (or whatever your use-case is). If you need sample data, ask the AI to generate some.

Claude's models have been better than OpenAI's for the last few months, so you probably want an Anthropic API key. If cost is an issue, Google's Gemini model has a generous free tier. Also, if you bring your own API key, Cursor is free.

Sequences & Workflows

The first generation of AI features were single-prompt transformations (a transcript summary or an image classifier) and then simple agents (a chatbot with some functions to call).

But LLMs generate short responses and don't perform well on long context windows. So people started decomposing long/complex prompts into multiple prompts, then joining the answers together. They started using one model to judge the results of other models. They added error handling and retries to boost reliability.

As these sort of improvements stacked, AI applications started looking more like graph-based data workflows, and researchers began calling this “compound AI”. There are a few graph-based frameworks around (and we're building one!) so if you think your application will end up looking like this it's worth giving them a try.

Agents

The best definition of an AI agent is a system where the language model chooses a sequence of actions. Six to twelve months ago, agents weren't very good, but there have been fairly rapid advances in this space.

AI agents, like self-driving cars, have different levels of autonomy. At low levels, agents are “decider” nodes in a pre-defined workflow graph. At higher levels, agents own the control flow – they can break down tasks into subtasks, execute them, and check their work.

A basic agent – something you could write in a day, like an agent to get stock prices – might have have a single prompt, do some query parsing, make a function call hitting an external service, and lightly track internal state and conversation history (or offload that to a model, like OpenAI Assistants).

The best open-source agents today (OpenHands, Composio SWEKit) have been trained on tasks like code generation with public benchmarks. They delegate to other agents in a project-specific team structure – say, a code analyzer, a code writer, and a code reviewer who coordinates the other two.

They break down tasks into subtasks, browse the web, run sandboxed code, retry with modification on errors, and prompt the user for clarification. They keep a memory of event history, current task context, and the runtime environment and include it in LLM context windows. They store task status in a state machine. A controller runs a while loop until a task reaches a finished state.

Knowledge & RAG

Building agents and workflows usually requires both general knowledge (from base models) and domain- or user-specific knowledge, from specific documents, web scrapers, or data from internal SaaS.

You get this from retrieval-augmented generation (RAG), basically an ETL pipeline with specific querying techniques. The ETL part is “chunking” documents and other content into smaller pieces, “embedding” each chunk (transforming it into a vector), and loading it into a vector DB.

Then, the querying part, known as “retrieval.” You have to embed your query text into a vector, search the DB for similar vectors, and feed the results into an LLM call. (You can also take your top results and use a more computationally-intense search method to “rerank” them.)

Knowledge and RAG diagram

Once you build an initial version there are a dozen ways to tweak and optimize: overlap between chunks, query smaller chunks to get more precise answers, feed surrounding context to the LLM synthesizer, use domain-specific embedding models, combine multiple query algorithms, add filtering based on metadata, and so on.

Observability: Tracing & Evals

The three magic properties of AI applications are accuracy, latency, and cost. So once you get a basic application working, the next step is to add tracing and hook it up to an observability service, write tests, and hook it up to evals.

Tracing

Tracing is the gold standard for LLM observability. It gives you function-level insight into execution times, and inputs and outputs at each stage. Most frameworks come with tracing out of the box (we do), so this avoids you having to hand-instrument your code. There are a bunch of different providers, and some take tracing in non-standard formats. Ideally, prefer a provider that takes straight-OpenTelemetry logs.

The UIs for these will feel fairly familiar if you've used a provider like Datadog or Honeycomb; they help you zoom into surprisingly long requests or unexpected LLM responses. Look for unexpected responses whenever you're calling an LLM.

Tracing diagram

Evals

Repeat three times after me. Evals are just tests. Evals are just tests. Evals are just tests.

Okay, evals are tests of non-deterministic systems, so you need to write more. Evals can return fractional values, not just binary pass/fail. Unlike the rest of your CI suite, your evals may not all pass all the time.

You may write five evals for a single (input, output, expected) triple of a single LLM call, checking it for accuracy, relevance, factual consistency, length adherence. You might write evals checking semantic distance, search results to look for a particular string, or have a different LLM evaluate your first LLM's response. Short, well-tested workflows can amass hundreds of evals.

After you write your evals, you'll probably go back and change your prompts, or your pipeline, or some part about your application, and then you'll be able to see the impact of those changes on the data set.

For your own and your team's sanity, you may want to use the same service for tracing and evals (Braintrust is good here).

Putting it all together

The reality of building AI projects in most companies is that it usually takes a day or two to get yourself to “wow”, a week or two to build a demo to show everyone else value, and a month or two (or more) getting to something you can ship.

Happy building, and do check out Mastra if you're building in TypeScript.

AI engineering, two years later

Stay up to date