Semantic Search & Vectorization

Enable agents to search your codebase semantically using AI embeddings. Ask "How does authentication work?" instead of grep-ing for "auth".

Semantic Search

Natural language queries across code, docs, and database schemas

49% Fewer Failures

Contextual Retrieval improves search accuracy

Database Aware

Agents understand your schema and config tables

What is Vectorization?

Vectorization converts your codebase into embeddings—numerical representations that capture meaning—and stores them in a local vector database (LanceDB). This enables semantic search: finding code by meaning rather than exact text matching.

Traditional Search (grep)

grep -r "auth"

  • Finds "auth", "authenticate", "authorization"
  • Misses "login", "session", "JWT", "token"
  • No understanding of what you actually need

Semantic Search

"How does authentication work?"

  • Finds auth middleware, JWT verification
  • Finds login handlers and session management
  • Finds architecture docs explaining the flow

How It Works

The vectorization system uses a hybrid approach combining vector similarity search with BM25 keyword matching for optimal results.

1

AST Chunking

Code is split into semantic chunks using Tree-sitter—functions, classes, methods, and exports are kept intact.

1,247 files → 8,453 chunks
2

Contextual Retrieval (Optional)

Claude Haiku adds a brief description to each chunk explaining its role in the codebase. This improves retrieval by 49%.

"This function is the main authentication middleware. It extracts the JWT token from the Authorization header and verifies it..."

3

Embedding & Storage

Chunks are converted to vectors using OpenAI's text-embedding-3-small model and stored in LanceDB alongside a BM25 keyword index.

Vector Index: 38MB
BM25 Index: 4MB
4

Hybrid Search

Queries combine semantic similarity (70%) with keyword matching (30%) for best results. Agents automatically use this via the semantic_search tool.

Setup Guide

Enable vectorization for your project in a few steps.

Prerequisites

RequiredOPENAI_API_KEY— for embeddings
OptionalANTHROPIC_API_KEY— for Contextual Retrieval
OptionalDATABASE_URL— for database schema indexing
1

During Project Bootstrap

When you bootstrap a new project with @planner, you'll be prompted to enable vectorization.

# Bootstrap prompt

Enable semantic search (vectorization)?

[Y] Yes — recommended

[N] No — use grep/glob only

2

Manual Setup (Existing Projects)

Run the vectorize init command from your project root:

# Initialize vectorization

$ npx @yo-go/vectorize init

# Or via Builder

$ @builder enable vectorization

What vectorize init Does

  1. 1Checks for required API keys in environment
  2. 2Adds vectorization section to project.json
  3. 3Creates .vectorindex/ directory (gitignored)
  4. 4Scans codebase and creates initial index
  5. 5Installs git post-commit hook for automatic updates
  6. 6Optionally indexes database schema

Configuration Reference

The vectorization config lives in docs/project.json:

{
  "vectorization": {
    "enabled": true,
    "storage": "local",
    "embeddingModel": "openai",
    "contextualRetrieval": "auto",
    
    "codebase": {
      "include": ["src/**", "lib/**", "docs/**"],
      "exclude": ["node_modules/**", "dist/**", "*.test.ts"],
      "chunkStrategy": "ast"
    },
    
    "database": {
      "enabled": true,
      "connection": "env:DATABASE_URL",
      "type": "postgres",
      "schema": {
        "include": ["public.*"],
        "exclude": ["public.migrations"]
      },
      "configTables": [
        {
          "table": "public.pricing_tiers",
          "description": "Subscription pricing and feature limits",
          "sampleRows": 10
        }
      ]
    },
    
    "search": {
      "hybridWeight": 0.7,
      "topK": 20
    },
    
    "refresh": {
      "onGitChange": true,
      "onSessionStart": true,
      "maxAge": "24h"
    }
  }
}
OptionDefaultDescription
contextualRetrievalautoEnable contextual descriptions. "auto" = enabled if ANTHROPIC_API_KEY set
hybridWeight0.7Weight for semantic vs keyword search (0.7 = 70% semantic)
topK20Number of results to return per query
maxAge24hHow old the index can be before prompting for refresh
onGitChangetrueAuto-refresh index after git commits via post-commit hook

CLI Commands

The vectorize CLI provides commands to manage your index.

vectorize init

Initialize vectorization for the current project. Creates config, builds initial index, installs git hooks.

$ vectorize init

vectorize refresh

Rebuild the vector index. Use --full for complete rebuild.

$ vectorize refresh # incremental (only changed files)

$ vectorize refresh --full # full rebuild

vectorize search <query>

Test semantic search from the command line.

$ vectorize search "How does user authentication work?"

vectorize status

Show index statistics, health, and storage usage.

$ vectorize status

Agent Integration

When vectorization is enabled, agents automatically gain access to the semantic_search tool.

How Agents Use Semantic Search

Before implementing a feature, agents search for existing patterns:

Before implementing, let me search for existing patterns:

semantic_search("How is authentication implemented?")

→ Found middleware in src/auth/middleware.ts

→ Found provider in src/auth/providers/supabase.ts

→ Found architecture docs explaining the flow

Now I can implement consistently with existing patterns.

Tool Signature

semantic_search({
  query: string,           // Natural language query
  filters?: {
    filePatterns?: string[], // e.g., ["src/auth/**", "*.ts"]
    languages?: string[],    // e.g., ["typescript", "python"]
    contentType?: "code" | "schema" | "config" | "docs"
  },
  topK?: number            // Override default (20)
})

// Returns
{
  results: [
    {
      content: string,      // Chunk content
      filePath: string,     // e.g., "src/auth/middleware.ts"
      lineRange: [45, 89],  // Start and end lines
      language: string,     // e.g., "typescript"
      score: number,        // Relevance score (0-1)
      type: "code" | "schema" | "config" | "docs"
    }
  ],
  indexAge: string,         // e.g., "2 hours ago"
  queryTime: number         // Milliseconds
}

Database Indexing

Vectorization can index your database schema and configuration data, giving agents full awareness of your data model and config-driven behaviors.

Why This Matters

Many applications have configuration-driven behavior where rendering, logic, and features are determined by database values, not just code. Without access to this data, agents can only see the "how" (code) but not the "what" (the actual configurations that drive behavior).

By indexing config tables, agents understand not just your schema structure, but the actual settings, rules, and configurations that make your application work the way it does.

Schema Extraction

The database structure itself — what tables exist and how they relate.

  • Table names and descriptions
  • Column names, types, constraints
  • Foreign key relationships
  • Indexes and table comments

Config Data Extraction

Actual row data from configuration tables — the values that drive behavior.

  • Sample rows from designated tables
  • Table descriptions for context
  • Configurable row limits per table

Understanding Config-Driven Behavior

In many applications, code is generic and behavior is determined by database configurations. Agents need access to this data to understand how your application actually works.

Feature Flags

feature_flags

Which features are enabled, for which users/tiers, rollout percentages

Pricing & Limits

pricing_tiers, subscription_limits

Plan names, prices, quotas, feature access by tier

Permissions

roles, permissions, role_permissions

What each role can do, permission hierarchies

Dynamic Forms

form_definitions, field_configs

Form fields, validation rules, conditional logic stored in DB

Workflows

workflow_steps, state_machines

State transitions, approval chains, automation rules

UI Configuration

menu_items, dashboard_widgets

Navigation structure, layout configs, theme settings

Example: Code Alone Isn't Enough

The code is generic:

function canAccessFeature(user, featureKey) {
  const tier = user.subscriptionTier;
  const tierConfig = await db.pricing_tiers
    .findOne({ name: tier });
  
  return tierConfig.features
    .includes(featureKey);
}

Without config data, an agent doesn't know what tiers exist or what features each tier includes.

The config data explains behavior:

-- pricing_tiers (indexed)
| name       | features                    |
|------------|----------------------------|
| free       | ["basic_search"]           |
| pro        | ["basic_search", "api",    |
|            |  "export", "webhooks"]     |
| enterprise | ["*"]                       |

Now an agent knows: "API access requires Pro tier, Enterprise gets everything."

Configuring Config Tables

Designate which tables contain configuration data in your project.json:

{
  "vectorization": {
    "database": {
      "enabled": true,
      "configTables": [
        {
          "table": "public.pricing_tiers",
          "description": "Subscription pricing and feature limits",
          "sampleRows": 10
        },
        {
          "table": "public.feature_flags",
          "description": "Feature toggles and rollout config",
          "sampleRows": 50
        },
        {
          "table": "public.roles",
          "description": "User roles and permission sets",
          "sampleRows": 20
        },
        {
          "table": "public.workflow_states",
          "description": "State machine definitions for approval flows",
          "sampleRows": 30
        }
      ]
    }
  }
}

Each config table is indexed with its description and sample rows, making the data searchable alongside your code.

How Agents Use Config Data

Agent query:

"What features are available on the Pro plan?"

→ Returns pricing_tiers row for Pro with features array

Agent query:

"How does the approval workflow work?"

→ Returns workflow_states showing state transitions + code that implements them

Agent query:

"What permissions does the editor role have?"

→ Returns roles and role_permissions data for "editor"

Cost Estimates

Initial indexing costs are one-time. Incremental updates cost ~1% of full index per commit.

Codebase SizeFilesChunksEmbeddingContextualTotal
Small5003k~$0.01~$1.50~$1.51
Medium2k12k~$0.02~$6.00~$6.02
Large10k60k~$0.10~$30.00~$30.10

Reduce Costs

  • Disable Contextual Retrieval: contextualRetrieval: "never"
  • Reduce include patterns to essential directories only
  • Run vectorize init --dry-run to see estimates before committing