Understanding AI costs: tokens, context windows, and pricing
AI pricing confuses almost everyone because it's not priced like normal software. There's no flat monthly seat for API usage — you pay by the "token," and costs scale with how much text goes in and out. Understanding a few concepts turns a surprising bill into a predictable one. Here's the plain-English version.
What a token actually is
Models don't read words or characters — they read tokens, chunks of text roughly three-quarters of a word on average. "Hello" is one token; a long technical word might be several. As a rough rule, 1,000 tokens is about 750 words. You're billed for tokens in (your prompt plus any context/documents you include) and tokens out (the model's response), almost always at different rates, with output usually the pricier side.
The context window — and why long prompts cost
The context window is the maximum tokens a model can consider at once: your instructions, the conversation so far, any pasted documents, and the answer all share that budget. Two consequences for cost: a big pasted document is billed on every call that includes it, and in a long chat the whole history is usually re-sent each turn — so costs creep up as conversations grow. Bigger context isn't free; it's billed.
Where bills actually blow up
Surprise costs almost always come from volume times size, not unit price. Re-sending long histories or large documents on every request, retrieval that stuffs too much text into the prompt, agent loops that make many calls per task, and high request volume all multiply token usage fast. The fix is usually architectural: send less context, summarize histories, retrieve only what's needed, and cap agent steps.
Match the model to the job to control spend
The biggest lever is using the right-sized model. Frontier models can cost many times more per token than smaller ones, so routing routine tasks (classification, extraction, simple replies) to a cheaper model and reserving the expensive one for genuinely hard work can cut bills dramatically with no quality loss where it doesn't matter. Many teams run a tiered setup for exactly this reason.
Cut costs without cutting quality
Several levers reduce spend with little downside: cache or reuse responses for repeated queries, summarize long chat histories instead of resending them verbatim, trim retrieved context to what's actually relevant, and set sensible maximum output lengths so the model doesn't ramble on your dime. Batch requests where you can, and pay for the largest model only on the calls that genuinely need it. Most teams see their first scary bill drop sharply once they stop sending more tokens than the task requires.
Estimate before you commit
Before building, do the napkin math: tokens per request times requests per day times the per-token price, output and input separately. That single estimate tells you whether a feature is viable at scale and where the crossover point is against running a model yourself. Watch tail cases too — the longest inputs and busiest hours, not the average, are what produce the scary invoice.