Building a Production MCP Server with Spring AI

Architecture decisions and implementation patterns for a Spring Boot MCP server backing a fleet of autonomous Claude agents.

May 02, 2026 · AI Reference

The Model Context Protocol (MCP) is Anthropic's open standard for connecting AI models to external tools and data sources. Instead of hard-coding tool definitions into each agent, an MCP server centralizes them: any compliant client (Claude Code, Claude Desktop, custom agents) connects to the server and gets access to its full tool catalog.

This article covers the architecture of a production MCP server built with Spring AI's MCP server support - one that has been running daily since early 2025, backing a fleet of autonomous Claude agents handling CRM sync, document processing, analytics pipelines, and vault maintenance.

Tech Stack

Java 21, Spring Boot 3.5
Spring AI 1.1.2 (MCP server / webmvc transport)
Quartz Scheduler (job scheduling and agent dispatch)
JGit (git operations)
dav4jvm (CardDAV / WebDAV contacts)
Apache Commons Email, SendGrid (email send/receive)
AWS SDK v2 (S3 object storage)
OpenAI API (image generation)
SQLite (local persistence)
Gradle

Architecture: One Server, Multiple Domains

The server exposes tools across 12 functional domains:

Email / EmailStore - IMAP/SMTP message retrieval, search, metadata, send, and structured CRM-oriented email queries
Filesystem / Obsidian vault - Read, write, move, search, frontmatter operations, section-level edits on a local markdown vault
Git - add, commit, log, diff, show, and worktree management on the vault repository
S3 object storage - List, upload, download, delete, and presign across multiple providers (AWS S3, Backblaze B2, Cloudflare R2)
OpenAI image generation - Text-to-image via the OpenAI API, callable by any agent
CalDAV - Calendar event upload, fetch, and delete via dav4jvm
CardDAV / WebDAV - Contact card upload and vCard management
System utilities - UUID generation, date/time conversion, epoch math, date arithmetic
HTTP - Outbound GET requests with response capture
Push notifications - Pushover delivery for agent alerts
Agent queue - Marks queued agent jobs as processed
SMTP - Targeted outbound email delivery

All 12 domains run in a single Spring Boot deployment. One server, one connection for agent clients.

Why one server? A single deployment means one connection for clients, a shared Spring context (beans, configuration, and database connections shared across domains), and a single monitoring surface. The tradeoff is a single point of failure - acceptable for personal automation infrastructure, but a concern for multi-tenant or high-availability deployments where domain isolation would matter more.

Defining Tools in Spring AI

Spring AI discovers tools via annotations. A method annotated with @Tool on a @Component gets registered automatically and added to the server's tool catalog.

The tool description matters more than the implementation. Claude uses descriptions to decide which tool to call and what arguments to pass. A few patterns that hold up in production:

Be explicit about side effects. Read tools and write tools need clear differentiation in their descriptions. Tools that modify state should say so - Claude is cautious about side effects and will use that information.

Make error returns structured. Return a consistent error object rather than throwing exceptions. Agents handle structured error responses better than stack traces, and they can make retry or fallback decisions based on error content.

Use natural-language parameter names. vaultRootPath and vaultSubPath communicate their purpose. p1 and p2 make the LLM guess. This is especially important for parameters with non-obvious constraints (e.g., a path that must be relative to the vault root, not an absolute filesystem path).

Test descriptions before wiring into agents. Drop into Claude.ai and simulate the tool selection decision in a message: "I have a tool called X that does Y. Given this situation, would you call it?" If Claude's reasoning doesn't match your expectation, the description needs work before you depend on it in a production workflow.

Agents as Markdown Files

Rather than hard-coding agent logic in Java, agents are defined as markdown files. Each file contains YAML frontmatter (agent ID, schedule, allowed tools) and a system prompt in the body.

A minimal agent definition looks like:

---
agent_id: crm-email-sync
schedule: 0 */60 * * * ?
tools:
  - EmailStore_findAllMessageMetadataSinceDateReceived
  - EmailStore_findMessagesByEmailIds
  - Obsidian_readTextFile
  - Obsidian_writeTextFile
  - Obsidian_updateFrontmatterProperties
  - System_getTimeAsFormat
---

You are a CRM email sync agent. Your job is to find new emails
involving CRM contacts since the last sync date...

The scheduler reads these files and dispatches agents according to their configured Quartz cron expression. Agents can also be triggered on demand via a remote trigger endpoint.

Defining agents as files pays off quickly: they are easy to version-control, modify without a redeploy, and inspect when something goes wrong. Every agent change is a git commit. git log shows exactly what changed when an agent starts misbehaving.

Scheduler Integration

Quartz handles job scheduling. Each agent definition maps to a Quartz job. At fire time, the scheduler resolves the agent file, builds a Claude API request with the system prompt and tool access list, and submits it to the Anthropic API.

The variable ${START_EPOCH} is interpolated into the system prompt at fire time - the current Unix epoch in seconds, with 000 appended to get milliseconds. This gives agents a reliable "now" reference without requiring a time tool call at the start of every run. Agents use it to calculate cutoff dates for incremental sync operations.

Prompt Caching for Scheduled Agents

Scheduled agents have stable system prompts - the same prompt fires on every run. Anthropic's prompt caching caches system prompts with a 5-minute TTL.

For agents that fire every 60 minutes, the cache will usually be cold on arrival. For agents that fire every 10-15 minutes (checker agents polling for new events), caching significantly reduces token costs on system prompts. Mark the system prompt as cacheable in the API request. Cache reads cost roughly 10% of the original write at current pricing - meaningful when a long system prompt fires dozens of times per day.

Tool Count Grows Fast

12 domains sounds manageable. In practice, each domain has 5-15 tools. The total catalog is 60+ tools.

Claude handles large tool catalogs, but each agent should receive only the tools it needs - not the full catalog. This keeps the context window focused, reduces the chance of a wrong tool selection, and makes the agent's behavior easier to reason about. The agent definition file's tools list enforces this at dispatch time.

Logging Tool Call Chains

The most useful debugging data is the sequence of tool calls an agent made and what each one returned. Log this at the server level. When an agent produces unexpected output, the tool call chain tells you exactly where reasoning diverged - whether it was a bad tool selection, an unexpected return value, or a multi-step logic error.

Without this log, debugging a complex multi-step agent run requires re-firing it and hoping the same inputs are still available.

Back to AI Reference