Save the Tokens: Why “Just Add Another Feature” Breaks the Bank

2026-06-29 · 9 min read

Photo by Zulfugar Karimov on Unsplash

Why traditional development mindsets fail in the AI era, and how to use compact tool schemas, smart delegation and semantic caching to stop the bleeding.

The Most Important Thing in AI-Powered Services

Recently, I worked with a team on developing an agentic chatbot. During the process, I noticed several interesting trends, along with a few hidden blind spots.

It made me think more deeply about one question: what is the most important thing when developing AI-powered services?

  • Is it making sure AI can discover our services through an MCP server or other AI tooling?

  • Is it helping AI understand customer intent and personalise content for each user?

  • Is it enabling AI to provide different insights and conduct deep research?

Well, all of these things are important. But the most important thing is: Cost.

The Disaster of Developing AI-Powered Chatbots

If you have worked with AI before, you probably know that its pricing model is very simple: pay as you go and pay per token.

The problem is that with AI chatbots, the cost can grow frighteningly fast.

Of course, we can always restrict conversation length or set up a five-hour usage window, just like some tech giants do.

But before we add restrictions or only allow users to ask three questions, let’s talk about a few interesting things I saw while working with one team.

The Main Differences in AI-Powered Service Development

Here’s the story. The team was trying to build a chatbot service for a high-traffic website. They created a UI, set up an MCP server with multiple tools, plugged the MCP server into Next.js using the Vercel AI SDK and dynamically loaded everything.

It worked. Everything was running smoothly.

Then they got excited and started adding more and more tools to the MCP server. And it still worked.

Woohoo!

At first, they made a rough estimate and thought each request would only consume several thousand tokens, which still seemed acceptable.

Spoiler: it was not.

I dug into the PRs and realised that the system prompt alone was already using around 1,000 tokens. Every tool they added contributed thousands more. Before the user even asked a question, the request was already consuming around 10,000 tokens. Oh my God.

This is one of the main differences between traditional software development and AI-powered service development.

In traditional development, developers can usually add or remove UI components and features quite freely. Adding another button or page does not immediately multiply the cost of every user interaction.

AI service development is different.

Every tool we add contributes to the input prompt. Every instruction, schema, description and capability takes up space in the model context. Slowly and invisibly, the context becomes bloated.

If we simply follow the old pattern and keep adding tools blindly without doing the calculations, the company might go bankrupt tomorrow.

So, is there a way to fix it?

Evaluation-Driven Development

I love running experiments.

Not the kind of A/B testing we usually do in product development, but experiments during development itself. In software engineering, it is always better to separate different concerns so we can measure their impact independently.

The same principle applies to AI development.

To keep the blast radius small and make sure we evaluate the impact correctly, every tool we add to the MCP server should meet a few basic criteria:

  • Compact tool definitions
  • Clean and minimal tool schemas
  • Only returning important fields, i.e. data pruning

Every tool schema, definition and type is passed to the LLM so it can understand which tools are available and decide which one to use. That means we need to optimise these definitions and limit the number of tokens they consume.

You may be tempted to pass the entire backend response to the LLM.

Don’t!

Do not pollute the model context with internal garbage. Only return the fields that actually matter for your service.

For example, if the chatbot only needs to show map pins, maybe all it needs is the latitude, longitude, name, and address. It does not need the entire database object, internal IDs, metadata, timestamps etc etc.

For the Vercel AI SDK, onStepFinished is extremely useful here (see official docs). It can show token usage, including input and output tokens, which makes investigation much easier.

Back to the story I mentioned earlier.

The team quickly added 5-10 tools to the MCP server and dynamically loaded all of them. Once that happened, it became extremely difficult to untangle the problem and optimise each tool one by one.

This feels like a new kind of technical debt. Or maybe we should call it token debt? Haha.

So how do we fix it?

Honestly, I guess the most realistic way is to comment out all the tools locally, then manually tune and re-enable each one step by step.

But to avoid all this doom and tragedy over the next six months, it is much better to handle this earlier during development.

I would love to see tools that can analyse token usage for each tool definition and reject a PR when it introduces too much token overhead, just like a linter. A token linter! We need this!

Frontend Tools and Backend Tools

In the Vercel AI SDK, there are frontend tools and backend tools.

The backend tools are part of AI SDK Core, while the frontend tools sit in AI SDK UI.

So what is the difference between them?

Backend tools are useful for handling more complex tasks, such as calling APIs, querying databases, or running business logic on the server side.

Frontend tools, on the other hand, are better suited for simple frontend or UI interactions, such as panning a map, opening a panel, highlighting an item or updating part of the interface.

When we use backend tools, more tokens are usually involved. The LLM needs to select the tool, call it, analyse the response and then pass the result back to the frontend. In many cases, all of this becomes part of the model context.

By contrast, frontend tools happen on the client side. The LLM chooses the tool, sends a simple instruction to the frontend, and the frontend performs the action. Done.

That means choosing where to place a tool is not just an architectural decision. It is also a cost decision.

If a tool only needs to trigger a UI interaction, keep it on the frontend. Do not send unnecessary work back to the model or the server just to burn tokens.

By choosing the right place to add each tool, you might save your product. (Or at least save your tokens…)

Smart Delegation

There is another way to reduce this problem: delegation.

In chatbot services, we do not need to use the most powerful model for everything.

Sometimes, we can ask a lighter model to handle the boring tasks first, such as summarising long text, extracting key points or cleaning up noisy data. Then we can pass the summarised result to a smarter model for deeper analysis.

For example, imagine we have a long essay with one million tokens. A lightweight model might be much cheaper to process that text, while a more capable model could cost significantly more for the same input. Instead of asking the expensive model to read everything from scratch, we can ask the cheaper model to summarise the text first.

Then the smarter model only needs to analyse the summarised content. That alone could save a large portion of the cost.

Of course, this does not mean we should blindly summarise everything. Summarisation can lose details.

The key is to use the right model for the right job, instead of sending every tiny task to the most expensive brain in the room.

Tool Routing

Another pattern worth considering is tool routing.

Instead of giving the main agent access to every tool from the beginning, we can add a routing step before the actual task starts. The router first analyses the user’s intent, then decides which group of tools should be made available to the main agent.

For example, if the user asks about map navigation, the main agent may only need map-related tools. If the user asks about account details, it may need account-related backend tools. If the user asks a general question, it may not need any tools at all.

If you are interested in learning more, Vercel has a good introduction in their official documentation:

Semantic Caching

Semantic caching is another useful technique.

Instead of caching responses based only on exact string matches, semantic caching stores and retrieves prompts based on meaning. If a new user query is semantically similar to a previous one, the system may return a cached response instead of calling the LLM again.

This means we can check the cache first before hitting the model, which can reduce both latency and cost.

However, there is a caveat.

If the similarity threshold is not tuned carefully, the cache might return the wrong answer. So semantic caching can be powerful, but it should be used with care.

If you are interested in learning more, Redis has a good introduction in their official documentation.

Final Thoughts

In AI-powered services, architecture decisions are cost decisions.

Every tool, every schema, every response, and every unnecessary field can quietly add weight to your request.

Money matters. Tokens matter. Before adding more tools and celebrating your success, remember to check the bill.



Written by Yuki Cheung

Hey there, I am Yuki! I'm a front-end software engineer who's passionate about both coding and gaming. When I'm not coding, you'll often find me playing video games like Fire Emblem or The Legend of Zelda! Lately, I've been working on AI-related topics too.