Semantic Search & Vectorization
Enable agents to search your codebase semantically using AI embeddings. Ask "How does authentication work?" instead of grep-ing for "auth".
Semantic Search
Natural language queries across code, docs, and database schemas
49% Fewer Failures
Contextual Retrieval improves search accuracy
Database Aware
Agents understand your schema and config tables
What is Vectorization?
Vectorization converts your codebase into embeddings—numerical representations that capture meaning—and stores them in a local vector database (LanceDB). This enables semantic search: finding code by meaning rather than exact text matching.
Traditional Search (grep)
grep -r "auth"
- Finds "auth", "authenticate", "authorization"
- Misses "login", "session", "JWT", "token"
- No understanding of what you actually need
Semantic Search
"How does authentication work?"
- Finds auth middleware, JWT verification
- Finds login handlers and session management
- Finds architecture docs explaining the flow
How It Works
The vectorization system uses a hybrid approach combining vector similarity search with BM25 keyword matching for optimal results.
AST Chunking
Code is split into semantic chunks using Tree-sitter—functions, classes, methods, and exports are kept intact.
Contextual Retrieval (Optional)
Claude Haiku adds a brief description to each chunk explaining its role in the codebase. This improves retrieval by 49%.
"This function is the main authentication middleware. It extracts the JWT token from the Authorization header and verifies it..."
Embedding & Storage
Chunks are converted to vectors using OpenAI's text-embedding-3-small model and stored in LanceDB alongside a BM25 keyword index.
Hybrid Search
Queries combine semantic similarity (70%) with keyword matching (30%) for best results. Agents automatically use this via the semantic_search tool.
Setup Guide
Enable vectorization for your project in a few steps.
Prerequisites
OPENAI_API_KEY— for embeddingsANTHROPIC_API_KEY— for Contextual RetrievalDATABASE_URL— for database schema indexingDuring Project Bootstrap
When you bootstrap a new project with @planner, you'll be prompted to enable vectorization.
# Bootstrap prompt
Enable semantic search (vectorization)?
[Y] Yes — recommended
[N] No — use grep/glob only
Manual Setup (Existing Projects)
Run the vectorize init command from your project root:
# Initialize vectorization
$ npx @yo-go/vectorize init
# Or via Builder
$ @builder enable vectorization
What vectorize init Does
- 1Checks for required API keys in environment
- 2Adds vectorization section to project.json
- 3Creates .vectorindex/ directory (gitignored)
- 4Scans codebase and creates initial index
- 5Installs git post-commit hook for automatic updates
- 6Optionally indexes database schema
Configuration Reference
The vectorization config lives in docs/project.json:
{
"vectorization": {
"enabled": true,
"storage": "local",
"embeddingModel": "openai",
"contextualRetrieval": "auto",
"codebase": {
"include": ["src/**", "lib/**", "docs/**"],
"exclude": ["node_modules/**", "dist/**", "*.test.ts"],
"chunkStrategy": "ast"
},
"database": {
"enabled": true,
"connection": "env:DATABASE_URL",
"type": "postgres",
"schema": {
"include": ["public.*"],
"exclude": ["public.migrations"]
},
"configTables": [
{
"table": "public.pricing_tiers",
"description": "Subscription pricing and feature limits",
"sampleRows": 10
}
]
},
"search": {
"hybridWeight": 0.7,
"topK": 20
},
"refresh": {
"onGitChange": true,
"onSessionStart": true,
"maxAge": "24h"
}
}
}| Option | Default | Description |
|---|---|---|
contextualRetrieval | auto | Enable contextual descriptions. "auto" = enabled if ANTHROPIC_API_KEY set |
hybridWeight | 0.7 | Weight for semantic vs keyword search (0.7 = 70% semantic) |
topK | 20 | Number of results to return per query |
maxAge | 24h | How old the index can be before prompting for refresh |
onGitChange | true | Auto-refresh index after git commits via post-commit hook |
CLI Commands
The vectorize CLI provides commands to manage your index.
vectorize init
Initialize vectorization for the current project. Creates config, builds initial index, installs git hooks.
vectorize refresh
Rebuild the vector index. Use --full for complete rebuild.
$ vectorize refresh # incremental (only changed files)
$ vectorize refresh --full # full rebuild
vectorize search <query>
Test semantic search from the command line.
vectorize status
Show index statistics, health, and storage usage.
Agent Integration
When vectorization is enabled, agents automatically gain access to the semantic_search tool.
How Agents Use Semantic Search
Before implementing a feature, agents search for existing patterns:
Before implementing, let me search for existing patterns:
semantic_search("How is authentication implemented?")
→ Found middleware in src/auth/middleware.ts
→ Found provider in src/auth/providers/supabase.ts
→ Found architecture docs explaining the flow
Now I can implement consistently with existing patterns.
Tool Signature
semantic_search({
query: string, // Natural language query
filters?: {
filePatterns?: string[], // e.g., ["src/auth/**", "*.ts"]
languages?: string[], // e.g., ["typescript", "python"]
contentType?: "code" | "schema" | "config" | "docs"
},
topK?: number // Override default (20)
})
// Returns
{
results: [
{
content: string, // Chunk content
filePath: string, // e.g., "src/auth/middleware.ts"
lineRange: [45, 89], // Start and end lines
language: string, // e.g., "typescript"
score: number, // Relevance score (0-1)
type: "code" | "schema" | "config" | "docs"
}
],
indexAge: string, // e.g., "2 hours ago"
queryTime: number // Milliseconds
}Database Indexing
Vectorization can index your database schema and configuration data, giving agents full awareness of your data model and config-driven behaviors.
Why This Matters
Many applications have configuration-driven behavior where rendering, logic, and features are determined by database values, not just code. Without access to this data, agents can only see the "how" (code) but not the "what" (the actual configurations that drive behavior).
By indexing config tables, agents understand not just your schema structure, but the actual settings, rules, and configurations that make your application work the way it does.
Schema Extraction
The database structure itself — what tables exist and how they relate.
- Table names and descriptions
- Column names, types, constraints
- Foreign key relationships
- Indexes and table comments
Config Data Extraction
Actual row data from configuration tables — the values that drive behavior.
- Sample rows from designated tables
- Table descriptions for context
- Configurable row limits per table
Understanding Config-Driven Behavior
In many applications, code is generic and behavior is determined by database configurations. Agents need access to this data to understand how your application actually works.
Feature Flags
feature_flags
Which features are enabled, for which users/tiers, rollout percentages
Pricing & Limits
pricing_tiers, subscription_limits
Plan names, prices, quotas, feature access by tier
Permissions
roles, permissions, role_permissions
What each role can do, permission hierarchies
Dynamic Forms
form_definitions, field_configs
Form fields, validation rules, conditional logic stored in DB
Workflows
workflow_steps, state_machines
State transitions, approval chains, automation rules
UI Configuration
menu_items, dashboard_widgets
Navigation structure, layout configs, theme settings
Example: Code Alone Isn't Enough
The code is generic:
function canAccessFeature(user, featureKey) {
const tier = user.subscriptionTier;
const tierConfig = await db.pricing_tiers
.findOne({ name: tier });
return tierConfig.features
.includes(featureKey);
}Without config data, an agent doesn't know what tiers exist or what features each tier includes.
The config data explains behavior:
-- pricing_tiers (indexed)
| name | features |
|------------|----------------------------|
| free | ["basic_search"] |
| pro | ["basic_search", "api", |
| | "export", "webhooks"] |
| enterprise | ["*"] |Now an agent knows: "API access requires Pro tier, Enterprise gets everything."
Configuring Config Tables
Designate which tables contain configuration data in your project.json:
{
"vectorization": {
"database": {
"enabled": true,
"configTables": [
{
"table": "public.pricing_tiers",
"description": "Subscription pricing and feature limits",
"sampleRows": 10
},
{
"table": "public.feature_flags",
"description": "Feature toggles and rollout config",
"sampleRows": 50
},
{
"table": "public.roles",
"description": "User roles and permission sets",
"sampleRows": 20
},
{
"table": "public.workflow_states",
"description": "State machine definitions for approval flows",
"sampleRows": 30
}
]
}
}
}Each config table is indexed with its description and sample rows, making the data searchable alongside your code.
How Agents Use Config Data
Agent query:
"What features are available on the Pro plan?"
→ Returns pricing_tiers row for Pro with features array
Agent query:
"How does the approval workflow work?"
→ Returns workflow_states showing state transitions + code that implements them
Agent query:
"What permissions does the editor role have?"
→ Returns roles and role_permissions data for "editor"
Cost Estimates
Initial indexing costs are one-time. Incremental updates cost ~1% of full index per commit.
| Codebase Size | Files | Chunks | Embedding | Contextual | Total |
|---|---|---|---|---|---|
| Small | 500 | 3k | ~$0.01 | ~$1.50 | ~$1.51 |
| Medium | 2k | 12k | ~$0.02 | ~$6.00 | ~$6.02 |
| Large | 10k | 60k | ~$0.10 | ~$30.00 | ~$30.10 |
Reduce Costs
- •Disable Contextual Retrieval:
contextualRetrieval: "never" - •Reduce include patterns to essential directories only
- •Run
vectorize init --dry-runto see estimates before committing