AI Agent

The AI Agent turns natural language into ClickHouse insights. Ask a question in plain English and the agent plans a series of tool calls — running read-only SQL, inspecting system.* tables, comparing hosts, spotting issues — then summarizes the findings and can render charts inline.

The agent connects through the same configured ClickHouse host and user as the rest of the dashboard, so its visibility is limited by the grants you give that user. By default it only runs read-only queries.

Open the agent at /agents (use the host selector or ?host=0 to pick a host). It is available on any deployment that has an LLM provider configured.

How It Works

  1. You send a message (optionally targeting a specific hostId).
  2. The agent calls one or more tools to gather evidence — each tool is a purpose-built, version-aware query against ClickHouse system tables.
  3. When a question needs deeper expertise, the agent loads a skill — a bundled expert guide — to inform its analysis before answering.
  4. It streams back a concise answer, shows the SQL it ran, and suggests follow-ups or visualizations.

The agent runs in a bounded tool loop: it keeps calling tools until it has enough to answer, then stops. Destructive actions (kill query, optimize) are disabled unless explicitly enabled (see Configuration).

Workflow Harness

For multi-step tasks — incident investigations, health reports, cross-host comparisons, or any "find and fix" request — the agent runs a lightweight planning harness. It lays out an ordered checklist of steps up front, then updates it after each step so you can follow along live in the chat.

Dynamic workflows

The agent ships a set of dynamic workflow templates — data-driven runbooks it can launch with start_workflow and then adapt at runtime:

WorkflowPurpose
incident-investigationTriage a reported problem to a root cause.
health-checkFull cluster health sweep.
query-optimizationAnalyze and speed up a slow query.
capacity-planningForecast storage and resource needs.
replication-triageDiagnose replication lag or failover.
migration-safetyAssess a schema change before applying it.

When a request matches a template, the agent instantiates it into a plan (tailoring or extending the steps for the specific table/host) and proceeds. When nothing fits, it authors the plan directly with update_plan. Workflows are a registry, so cluster-specific runbooks can be registered at runtime. As the agent learns from intermediate results, it revises the plan — adding, dropping, or reordering steps — keeping the workflow checklist live and accurate.

Each step has a status: pending, in_progress, or completed. Exactly one step is in progress at a time, rendered as a checklist with a progress bar:

  • ✓ completed steps are struck through,
  • the in-progress step is highlighted with a spinner,
  • pending steps wait below.

This makes the agent's reasoning transparent without slowing it down — simple single-step questions skip the plan entirely. The harness is a transparency aid layered on top of the tool loop; it does not change which tools run or the read-only safety guarantees.

Context Management

On an unfamiliar host the agent can call get_context once to orient itself with a single snapshot instead of rediscovering everything query by query. The snapshot gathers, in parallel and with graceful degradation:

  • ClickHouse version & uptime — so queries target the right schema and avoid version-specific column mistakes,
  • Current user — the connected ClickHouse user,
  • Keeper/ZooKeeper availability — whether replication coordination is configured,
  • Memory pressure — tracked vs. OS total/available memory,
  • Changed settings — how many settings differ from defaults (with notable names),
  • Capabilities — the agent's own tool count, loaded skills, and available workflows.

Any unavailable source (e.g. a host without Keeper) is reported as such rather than failing the whole snapshot.

Configuration

The agent uses an OpenAI-compatible API. At minimum set LLM_API_KEY.

LLM_API_KEY=your-provider-key
LLM_API_BASE=https://openrouter.ai/api/v1   # default
LLM_MODEL=openrouter:openrouter/free        # default
VariableDefaultPurpose
LLM_API_KEYProvider API key (required).
LLM_API_BASEhttps://openrouter.ai/api/v1OpenAI-compatible base URL.
LLM_MODELopenrouter:openrouter/freeModel identifier.
LLM_EXTRA_MODELSExtra models to surface in the model picker (see below).
AGENT_API_TOKENShared Bearer token for the agent API.
AGENT_ENABLE_CONTROL_TOOLSfalseEnables destructive tools (kill query, optimize).
AGENT_CONVERSATION_PERSISTENCEfalseEnables server-side conversation persistence.
AGENT_CONVERSATION_STOREautoStore backend: auto, agentstate, d1, durable-object, clickhouse, postgres, memory, or local.

Adding extra models

LLM_EXTRA_MODELS lets you inject additional models into the picker without modifying code. It is a comma-separated list of entries:

provider:modelId[|contextLength][|description]
  • provider — a known provider ID (openrouter, nvidia, anyrouter).
  • modelId — the model ID passed to the provider (may contain colons, e.g. qwen/qwen3-coder:free).
  • contextLength — optional integer token count; defaults to 128000.
  • description — optional display label; defaults to the model ID.

Example:

LLM_EXTRA_MODELS="nvidia:meta/llama-3.3-70b|131072|Llama 3.3 70B,openrouter:x-ai/grok-2"

Extra models are appended after the built-in registry. If an extra entry uses the same provider:modelId key as a built-in entry, the built-in entry wins. The model picker only shows models whose provider has an API key configured; if no provider is configured, all models are shown so the UI is never empty.

For public deployments, require authentication for the agent:

CHM_FEATURE_AGENT_ACCESS=authenticated

See Environment Variables and Feature Permissions for the full list.

Conversation history

The default history backend is browser localStorage. Enable server persistence with:

AGENT_CONVERSATION_PERSISTENCE=true
AGENT_CONVERSATION_STORE=auto

Supported stores are AgentState, Cloudflare D1, SQLite-backed Durable Objects, ClickHouse, Postgres, memory, and localStorage. Server stores require authenticated user identity; unauthenticated sessions keep local browser history. See Agent Conversation Storage for backend-specific setup and permissions.

Tools

Tools are the functions the agent calls to gather data. Every tool accepts an optional hostId (a numeric, 0-based index — 0, 1, 2) and defaults to the current host. The agent picks tools automatically; you never call them directly.

Schema & Exploration

ToolDescription
queryExecute a read-only SQL query (SELECT, WITH/CTE, DESCRIBE).
list_databasesList all databases.
list_tablesList tables in a database with sizes and row counts.
get_table_schemaColumn definitions for a table.
explore_table_schemaSchema exploration with relationship discovery.
discover_data_sourcesFind tables/columns relevant to a topic.

Query Analysis

ToolDescription
get_running_queriesCurrently executing queries with elapsed time.
get_slow_queriesSlowest completed queries.
get_failed_queriesRecent failed queries with error details.
get_expensive_queriesTop queries by memory, bytes read, or duration.
get_query_patternsAggregated query fingerprints with frequency.
explain_queryEXPLAIN plan / pipeline / indexes for a query.
analyze_query_optimizationOptimization opportunities for a SQL query.
repair_querySelf-fix a broken or slow read-only query.
spot_issuesScan recent history for likely problems.

System Health

ToolDescription
get_metricsServer version, uptime, connections, memory.
get_system_resourcesCPU, memory, disk, network, thread metrics.
get_disk_usagePer-disk free/total/used percentage.
get_errorsRecent system errors with counts.
get_crash_logServer crash history.
detect_anomaliesCompare recent vs baseline metrics for anomalies.
generate_health_reportCollect all key metrics into one report.
investigate_incidentAutomated root-cause analysis for an incident.

Findings (persistence)

Findings are short, structured records of noteworthy conditions, persisted to an app-owned monitoring_findings table (MergeTree, 30-day TTL). Writes are best-effort and silently no-op on read-only clusters. Also exposed at GET /api/v1/findings.

ToolDescription
record_findingPersist a finding (severity, category, source, title, metric, value).
list_recent_findingsList recent findings, newest first; filter by severity and time window.

Storage, Merges & Mutations

ToolDescription
get_table_partsPart-level sizes, rows, compression ratio.
get_detached_partsDetached parts needing attention.
get_top_tables_by_sizeTop tables by compressed size.
get_merge_statusActive merge operations with progress.
get_mutationsPending and stuck mutations.
get_merge_performanceHistorical merge throughput.

Replication & Cluster

ToolDescription
get_replication_statusPer-table lag, queue size, leader/readonly.
get_replication_queuePending replication tasks.
get_clustersCluster topology: shards, replicas, hosts.
get_distributed_ddl_queuePending/failed distributed DDL operations.
get_zookeeper_infoZooKeeper/Keeper node data.

Security & Audit

ToolDescription
get_active_sessionsCurrent sessions with user/client info.
get_login_attemptsRecent login successes/failures.
get_users_and_rolesUsers, roles, and access grants.

Schema Migration

ToolDescription
analyze_schema_changeAssess impact of a proposed ALTER TABLE.
get_column_usageFind queries referencing a column (blast radius).
recommend_table_designSuggest ORDER BY, types, indexes, MVs.

Comparison, Insights & Visualization

ToolDescription
compare_time_periodsCompare metrics between two time periods.
compare_hostsCompare two hosts side-by-side.
get_query_insightsHighlight stats: largest scans, peak memory, etc.
get_table_insightsTable-level insights: size, rows, compression.
forecast_capacityForecast days-until-disk-full from 30-day trends.
query_and_visualizeRun SQL and return an interactive chart config.

Settings, Logs & Navigation

ToolDescription
get_settingsServer settings changed from defaults.
get_mergetree_settingsMergeTree settings changed from defaults.
get_text_logServer log entries by level and pattern.
get_stack_tracesCurrent thread stack traces.
get_dashboard_pagesList dashboard pages and routes.
get_chart_dataFetch data from a specific dashboard chart.

Control Actions (destructive)

Disabled unless AGENT_ENABLE_CONTROL_TOOLS=true. The agent always confirms before using them.

ToolDescription
optimize_tableTrigger OPTIMIZE TABLE.
kill_queryKill a running query by id.
kill_mutationCancel a running mutation on a table.

Interaction & Knowledge

ToolDescription
ask_userAsk a structured clarifying question.
load_skillLoad a specialized expert guide (see Skills below).
get_contextRuntime context snapshot: version, user, Keeper, memory, settings, capabilities.
list_workflowsList the available dynamic workflow templates.
start_workflowStart a workflow template as a live plan (see Workflow Harness above).
update_planCreate and update a live workflow checklist (see Workflow Harness above).

Skills

Skills are bundled expert guides the agent loads on demand to deepen its answers. They live in .agents/skills/<name>/SKILL.md and are compiled into a registry. The agent loads the relevant skill before answering domain questions.

SkillCovers
clickhouse-best-practicesSchema design, query tuning, operational guidelines.
query-optimizationPREWHERE, JOIN patterns, materialized views, EXPLAIN, indexes.
system-tables-referenceExact columns of system.processes, query_log, parts, merges, replicas, metrics; when to prefer dedicated tools over raw SQL.
troubleshootingOOM, slow merges, stuck mutations, error-code diagnosis.
replication-guideReplicatedMergeTree, failover, lag diagnosis, Keeper.
cluster-operationsDistributed tables, resharding, node management, topology.
storage-optimizationCompression codecs, TTL, tiered storage, part management.
migration-patternsSchema migrations, ALTER patterns, zero-downtime changes.
security-hardeningRBAC, row policies, quotas, audit logging.

The available skills are exposed at GET /api/v1/agent/skills.

Examples

Ask questions the way you would ask a teammate:

  • "Which queries are running right now and how long have they been executing?" The agent calls get_running_queries and lists query id, user, elapsed time, and memory, sorted by duration.

  • "What were the 10 slowest queries in the last 24 hours?" Uses get_slow_queries, then offers to explain_query on any of them.

  • "Why is the cluster slow right now?" Loads the troubleshooting skill, runs spot_issues and detect_anomalies, and correlates merges, errors, and expensive queries into a root-cause summary.

  • "Show me query volume per hour over the last day as a chart." Uses query_and_visualize to run the aggregation and render a line chart.

  • "Compare disk usage and query load between host 0 and host 1." Calls compare_hosts with hostId1: 0 and hostId2: 1.

  • "Is it safe to drop the event_payload column from analytics.events?" Runs get_column_usage to assess blast radius and analyze_schema_change to classify the ALTER risk before recommending.

  • "Forecast when our biggest disk will fill up." Uses forecast_capacity on 30-day storage trends.

API

The agent is also available over HTTP for integrations.

curl -X POST https://your-host/api/v1/agent \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $AGENT_API_TOKEN" \
  -d '{
    "message": "Which queries are running right now?",
    "hostId": 0
  }'

hostId may be a number or a numeric string; it defaults to 0. List loadable skills with GET /api/v1/agent/skills.

Security Notes

  • The agent inherits the configured ClickHouse user's grants — scope that user to read-only system access for safe exposure.
  • Keep destructive control tools off (AGENT_ENABLE_CONTROL_TOOLS=false) unless the ClickHouse user is trusted and you intend to allow kill/optimize.
  • Require authentication on public deployments with CHM_FEATURE_AGENT_ACCESS=authenticated.
  • Keep LLM keys server-side — never in NEXT_PUBLIC_ variables.