ML/AI Integration Fundamentals

Introduction
Understanding LLMs
AI Integration Patterns
RAG Architecture
Prompt Engineering
Model Selection
Training vs Inference
ML Pipelines
Vector Databases
AI Agents & Autonomous Systems
Cost Optimization
Production Best Practices
How SpecWeave Fits In
Common Pitfalls
Next Steps

Introduction

Artificial Intelligence and Machine Learning have transformed from research curiosities to essential tools powering modern applications. From ChatGPT's conversational abilities to GitHub Copilot's code generation, AI is fundamentally changing how we build software.

This guide teaches you:

How Large Language Models (LLMs) work and when to use them
AI integration patterns: APIs, embeddings, and fine-tuning
Retrieval-Augmented Generation (RAG) architecture
Prompt engineering techniques for better AI outputs
Model selection: choosing the right model for your use case
Building production-ready ML pipelines
Vector databases and semantic search
AI agents and autonomous code generation (like SpecWeave!)
Cost optimization strategies

Understanding LLMs

Large Language Models (LLMs) are neural networks trained on massive text datasets to predict the next word in a sequence.

How LLMs Work (Simplified)

Key Concepts:

Tokens: Words or subwords (1 token ≈ 0.75 words)
- "Hello world" = 2 tokens
- "Artificial Intelligence" = 3 tokens
Context Window: Maximum input length
- GPT-5: 128K tokens (~96K words)
- Claude Sonnet 4.5: 200K tokens (~150K words)
- Claude Haiku 4.5: 200K tokens (ultra-fast, simple tasks)
Temperature: Randomness of output
- 0.0 = Deterministic (same output every time)
- 1.0 = Creative (varied outputs)

LLM Capabilities

What LLMs Can Do:

✅ Text generation (articles, code, emails)
✅ Question answering (based on training data)
✅ Text summarization
✅ Language translation
✅ Code generation and debugging
✅ Classification (sentiment analysis, categorization)
✅ Extraction (entities, relationships)

What LLMs Cannot Do:

❌ Access real-time data (unless via API calls)
❌ Browse the internet (without plugins/tools)
❌ Perform precise calculations (use tools for math)
❌ Access private data (unless provided in context)
❌ Guarantee factual accuracy (can hallucinate)

AI Integration Patterns

Pattern 1: Direct API Calls

Simplest approach: Call LLM API directly

// Using OpenAI API
import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

async function generateProductDescription(product) {
  const response = await openai.chat.completions.create({
    model: 'gpt-5',
    messages: [
      {
        role: 'system',
        content: 'You are a marketing expert creating product descriptions.'
      },
      {
        role: 'user',
        content: `Create a compelling description for: ${product.name}`
      }
    ],
    temperature: 0.7,
    max_tokens: 200
  });

  return response.choices[0].message.content;
}

// Usage
const description = await generateProductDescription({
  name: 'Wireless Headphones',
  features: ['Noise canceling', '30-hour battery', 'Bluetooth 5.0']
});

Using Anthropic Claude:

import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY
});

async function analyzeCodeQuality(code) {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-5',
    max_tokens: 1024,
    messages: [
      {
        role: 'user',
        content: `Analyze this code for quality and suggest improvements:\n\n${code}`
      }
    ]
  });

  return response.content[0].text;
}

Pattern 2: Embeddings & Semantic Search

Embeddings: Convert text to vectors (numerical representations)

// Generate embeddings
async function getEmbedding(text) {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: text
  });

  return response.data[0].embedding; // Array of 1536 numbers
}

// Calculate similarity
function cosineSimilarity(vec1, vec2) {
  const dotProduct = vec1.reduce((sum, val, i) => sum + val * vec2[i], 0);
  const mag1 = Math.sqrt(vec1.reduce((sum, val) => sum + val * val, 0));
  const mag2 = Math.sqrt(vec2.reduce((sum, val) => sum + val * val, 0));
  return dotProduct / (mag1 * mag2);
}

// Find similar documents
async function findSimilarDocs(query, documents) {
  const queryEmbedding = await getEmbedding(query);

  const results = await Promise.all(
    documents.map(async (doc) => {
      const docEmbedding = await getEmbedding(doc.text);
      const similarity = cosineSimilarity(queryEmbedding, docEmbedding);
      return { doc, similarity };
    })
  );

  return results
    .sort((a, b) => b.similarity - a.similarity)
    .slice(0, 5); // Top 5 results
}

Pattern 3: Function Calling (Tool Use)

LLMs can call functions to access external data:

const tools = [
  {
    type: 'function',
    function: {
      name: 'get_weather',
      description: 'Get current weather for a location',
      parameters: {
        type: 'object',
        properties: {
          location: {
            type: 'string',
            description: 'City name, e.g. San Francisco'
          },
          unit: {
            type: 'string',
            enum: ['celsius', 'fahrenheit']
          }
        },
        required: ['location']
      }
    }
  }
];

async function chatWithTools(userMessage) {
  const response = await openai.chat.completions.create({
    model: 'gpt-5',
    messages: [{ role: 'user', content: userMessage }],
    tools: tools,
    tool_choice: 'auto'
  });

  const message = response.choices[0].message;

  // Check if LLM wants to call a function
  if (message.tool_calls) {
    const toolCall = message.tool_calls[0];
    const functionName = toolCall.function.name;
    const args = JSON.parse(toolCall.function.arguments);

    // Execute function
    let result;
    if (functionName === 'get_weather') {
      result = await getWeather(args.location, args.unit);
    }

    // Send function result back to LLM
    const finalResponse = await openai.chat.completions.create({
      model: 'gpt-5',
      messages: [
        { role: 'user', content: userMessage },
        message, // Original assistant message with tool call
        {
          role: 'tool',
          tool_call_id: toolCall.id,
          content: JSON.stringify(result)
        }
      ]
    });

    return finalResponse.choices[0].message.content;
  }

  return message.content;
}

// Usage
const response = await chatWithTools("What's the weather in London?");
// LLM calls get_weather('London'), then generates natural response

RAG Architecture

Retrieval-Augmented Generation (RAG) combines information retrieval with LLM generation.

Why RAG?

Problem: LLMs don't know about your private data

Company documentation
Customer support tickets
Internal knowledge bases

Solution: Retrieve relevant context, then ask LLM to answer

RAG Flow

RAG Implementation

import { Pinecone } from '@pinecone-database/pinecone';
import OpenAI from 'openai';

const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

// 1. Index documents (one-time setup)
async function indexDocuments(documents) {
  const index = pinecone.index('knowledge-base');

  for (const doc of documents) {
    // Generate embedding
    const embedding = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: doc.text
    });

    // Store in Pinecone
    await index.upsert([
      {
        id: doc.id,
        values: embedding.data[0].embedding,
        metadata: { text: doc.text, title: doc.title }
      }
    ]);
  }
}

// 2. RAG query
async function answerQuestion(question) {
  const index = pinecone.index('knowledge-base');

  // Step 1: Retrieve relevant documents
  const queryEmbedding = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question
  });

  const searchResults = await index.query({
    vector: queryEmbedding.data[0].embedding,
    topK: 5,
    includeMetadata: true
  });

  // Step 2: Build context from retrieved docs
  const context = searchResults.matches
    .map(match => match.metadata.text)
    .join('\n\n');

  // Step 3: Generate answer using context
  const response = await openai.chat.completions.create({
    model: 'gpt-5',
    messages: [
      {
        role: 'system',
        content: 'Answer questions based on the provided context. If the answer is not in the context, say so.'
      },
      {
        role: 'user',
        content: `Context:\n${context}\n\nQuestion: ${question}`
      }
    ]
  });

  return {
    answer: response.choices[0].message.content,
    sources: searchResults.matches.map(m => ({
      title: m.metadata.title,
      similarity: m.score
    }))
  };
}

// Usage
const result = await answerQuestion('How do I reset my password?');
console.log(result.answer);
console.log('Sources:', result.sources);

RAG Best Practices

1. Chunk documents strategically:

function chunkDocument(text, maxTokens = 500) {
  const paragraphs = text.split('\n\n');
  const chunks = [];
  let currentChunk = '';

  for (const paragraph of paragraphs) {
    const tokens = countTokens(currentChunk + paragraph);

    if (tokens > maxTokens) {
      chunks.push(currentChunk.trim());
      currentChunk = paragraph;
    } else {
      currentChunk += '\n\n' + paragraph;
    }
  }

  if (currentChunk) chunks.push(currentChunk.trim());
  return chunks;
}

2. Add metadata for filtering:

await index.upsert([
  {
    id: doc.id,
    values: embedding,
    metadata: {
      text: doc.text,
      title: doc.title,
      category: 'support', // Filter by category
      dateCreated: '2025-01-15',
      language: 'en'
    }
  }
]);

// Query with filters
const results = await index.query({
  vector: queryEmbedding,
  topK: 5,
  filter: {
    category: { $eq: 'support' },
    language: { $eq: 'en' }
  }
});

3. Hybrid search (keywords + semantic):

// Combine traditional search with vector search
const keywordResults = await elasticsearchClient.search({
  index: 'docs',
  body: {
    query: { match: { text: question } }
  }
});

const vectorResults = await pinecone.query({
  vector: queryEmbedding,
  topK: 5
});

// Merge and re-rank results
const combinedResults = mergeAndRerank(keywordResults, vectorResults);

Prompt Engineering

Prompt engineering is the art of crafting effective instructions for LLMs.

Principles

1. Be Specific:

// ❌ Vague
const prompt = 'Write a function';

// ✅ Specific
const prompt = `Write a JavaScript function that:
1. Takes an array of numbers
2. Filters out negative numbers
3. Returns the sum of remaining numbers
4. Includes JSDoc comments`;

2. Provide Examples (Few-Shot Learning):

const prompt = `Extract product names and prices from receipts.

Examples:
Input: "MacBook Pro - $1,999.00"
Output: { product: "MacBook Pro", price: 1999.00 }

Input: "iPhone 15 Pro (256GB) $999.99"
Output: { product: "iPhone 15 Pro", price: 999.99 }

Now extract from: "${receiptText}"`;

3. Use System Prompts:

const messages = [
  {
    role: 'system',
    content: 'You are a senior software architect. Provide detailed, production-ready code with error handling and [tests](/docs/glossary/terms/unit-testing).'
  },
  {
    role: 'user',
    content: 'Build a REST API for user authentication'
  }
];

4. Constrain Output Format:

const prompt = `Analyze the sentiment of this review: "${review}"

Respond in JSON format:
{
  "sentiment": "positive" | "neutral" | "negative",
  "confidence": 0.0 to 1.0,
  "keywords": ["word1", "word2"]
}`;

Advanced Techniques

Chain of Thought (CoT):

const prompt = `Solve this problem step by step:

Problem: A store has 15 apples. They sell 40% of them. How many apples remain?

Let's think step by step:
1. Calculate 40% of 15: 0.40 × 15 = 6 apples sold
2. Subtract from total: 15 - 6 = 9 apples remain
Answer: 9 apples

Now solve this problem step by step:
Problem: ${userProblem}`;

Self-Consistency:

// Generate multiple answers, pick most common
async function selfConsistentAnswer(question) {
  const answers = await Promise.all(
    Array(5).fill(null).map(() =>
      openai.chat.completions.create({
        model: 'gpt-5',
        messages: [{ role: 'user', content: question }],
        temperature: 0.8 // Higher temperature for variety
      })
    )
  );

  const results = answers.map(r => r.choices[0].message.content);
  return mostCommonAnswer(results); // Voting mechanism
}

Model Selection

Model Comparison

Model	Provider	Context	Strengths	Cost (per 1M tokens)
Claude Sonnet 4.5	Anthropic	200K	Best coding, long context, research	$3 in / $15 out
GPT-5	OpenAI	128K	General purpose, reasoning, creative	$10 in / $30 out
Claude Haiku 4.5	Anthropic	200K	Ultra-fast, cheap, simple tasks	$0.25 in / $1.25 out
o1	OpenAI	128K	Complex reasoning, math, science	$15 in / $60 out
Llama 3.1 405B	Meta	128K	Open source, self-hosted	Free (compute costs)

When to Use Each Model

Claude Sonnet 4.5: Best for coding, long documents, technical writing, research GPT-5: Complex reasoning, creative content, general-purpose tasks Claude Haiku 4.5: Real-time chat, simple classifications, high-volume processing o1: Advanced reasoning tasks, mathematical proofs, scientific analysis Llama 3.1: Privacy-sensitive data, no API costs, offline usage, customization

Model Selection Strategy

function selectModel(taskType, contextLength, budget) {
  // Ultra-cheap tasks
  if (budget === 'low' && taskType === 'simple') {
    return 'claude-haiku-4-5'; // Ultra-fast and cheap
  }

  // Long context
  if (contextLength > 100000) {
    return 'claude-sonnet-4-5'; // 200K context
  }

  // Complex reasoning
  if (taskType === 'complex' || taskType === 'reasoning') {
    return 'o1'; // Best for complex reasoning
  }

  // Creative tasks
  if (taskType === 'creative') {
    return 'gpt-5';
  }

  // Coding tasks
  if (taskType === 'code') {
    return 'claude-sonnet-4-5'; // Best for code
  }

  // Default
  return 'claude-sonnet-4-5';
}

Training vs Inference

Inference (Most Common)

Using pre-trained models via API:

// No training required - just call API
const response = await openai.chat.completions.create({
  model: 'gpt-5',
  messages: [{ role: 'user', content: 'Explain quantum computing' }]
});

Pros:

✅ No training data required
✅ No GPU infrastructure needed
✅ Instant availability
✅ Always up-to-date models

Cons:

❌ Recurring API costs
❌ Rate limits
❌ No control over model behavior

Fine-Tuning

Training a model on custom data:

// 1. Prepare training data
const trainingData = [
  {
    messages: [
      { role: 'system', content: 'You are a customer support bot for TechCorp.' },
      { role: 'user', content: 'How do I reset my password?' },
      { role: 'assistant', content: 'To reset your password:\n1. Visit techcorp.com/reset\n2. Enter your email\n3. Check your email for reset link' }
    ]
  },
  // ... hundreds more examples
];

// 2. Create fine-tuning job
const fineTune = await openai.fineTuning.jobs.create({
  training_file: trainingFileId,
  model: 'gpt-3.5-turbo',
  hyperparameters: {
    n_epochs: 3
  }
});

// 3. Use fine-tuned model
const response = await openai.chat.completions.create({
  model: fineTune.fine_tuned_model,
  messages: [{ role: 'user', content: 'Reset password help' }]
});

When to Fine-Tune:

✅ Specific domain language (medical, legal)
✅ Consistent tone/style
✅ High volume (cost savings)
✅ Specialized task (classification, extraction)

When NOT to Fine-Tune:

❌ Small datasets (<100 examples)
❌ Rapidly changing requirements
❌ General-purpose tasks (prompt engineering is enough)

Self-Hosting (Advanced)

Run open-source models on your infrastructure:

# Using Hugging Face Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

inputs = tokenizer("Explain machine learning", return_tensors="pt")
outputs = model.generate(**inputs, max_length=200)
print(tokenizer.decode(outputs[0]))

Pros:

✅ No API costs
✅ Full control
✅ Data privacy
✅ No rate limits

Cons:

❌ Requires GPU infrastructure ($$$)
❌ Model management overhead
❌ Slower than API (unless massive scale)

ML Pipelines

Typical ML Pipeline

Example: Text Classification Pipeline

// 1. Data preparation
async function prepareTrainingData() {
  const rawData = await db.supportTickets.findMany();

  const trainingData = rawData.map(ticket => ({
    text: ticket.message,
    label: ticket.category // 'billing', 'technical', 'sales'
  }));

  // Split into train/test
  const trainSize = Math.floor(trainingData.length * 0.8);
  const trainData = trainingData.slice(0, trainSize);
  const testData = trainingData.slice(trainSize);

  return { trainData, testData };
}

// 2. Training (fine-tuning GPT-3.5)
async function trainClassifier(trainData) {
  const formattedData = trainData.map(item => ({
    messages: [
      { role: 'system', content: 'Classify support tickets into: billing, technical, or sales' },
      { role: 'user', content: item.text },
      { role: 'assistant', content: item.label }
    ]
  }));

  const file = await openai.files.create({
    file: createJSONL(formattedData),
    purpose: 'fine-tune'
  });

  const fineTune = await openai.fineTuning.jobs.create({
    training_file: file.id,
    model: 'gpt-3.5-turbo'
  });

  return fineTune.fine_tuned_model;
}

// 3. Evaluation
async function evaluateModel(modelName, testData) {
  let correct = 0;

  for (const item of testData) {
    const response = await openai.chat.completions.create({
      model: modelName,
      messages: [
        { role: 'system', content: 'Classify support tickets into: billing, technical, or sales' },
        { role: 'user', content: item.text }
      ]
    });

    const prediction = response.choices[0].message.content;
    if (prediction === item.label) correct++;
  }

  const accuracy = correct / testData.length;
  console.log(`Accuracy: ${(accuracy * 100).toFixed(2)}%`);
  return accuracy;
}

// 4. Production deployment
async function classifyTicket(text) {
  const response = await openai.chat.completions.create({
    model: PRODUCTION_MODEL, // Fine-tuned model
    messages: [
      { role: 'system', content: 'Classify support tickets into: billing, technical, or sales' },
      { role: 'user', content: text }
    ]
  });

  return response.choices[0].message.content;
}

// 5. Monitoring & feedback loop
async function logPrediction(text, prediction, wasCorrect) {
  await db.predictions.create({
    data: {
      input: text,
      prediction,
      wasCorrect,
      timestamp: new Date()
    }
  });

  // If accuracy drops below threshold, trigger retraining
  const recentAccuracy = await calculateRecentAccuracy();
  if (recentAccuracy < 0.85) {
    await triggerRetraining();
  }
}

Vector Databases

Vector databases store embeddings for fast similarity search.

Popular Vector Databases

Database	Type	Strengths
Pinecone	Cloud	Managed, scalable, easy setup
Weaviate	Self-hosted	Open source, GraphQL API
Chroma	Embedded	Local development, simple
Qdrant	Self-hosted	Fast, Rust-based
Milvus	Self-hosted	Enterprise-scale

Vector Database Operations

import { ChromaClient } from 'chromadb';

const client = new ChromaClient();

// Create collection
const collection = await client.createCollection({
  name: 'documents',
  metadata: { description: 'Company knowledge base' }
});

// Add documents
await collection.add({
  ids: ['doc1', 'doc2', 'doc3'],
  documents: [
    'How to reset password...',
    'Billing FAQ...',
    'Technical support guide...'
  ],
  metadatas: [
    { category: 'auth', date: '2025-01-15' },
    { category: 'billing', date: '2025-01-20' },
    { category: 'technical', date: '2025-02-01' }
  ]
});

// Query (semantic search)
const results = await collection.query({
  queryTexts: ['password reset help'],
  nResults: 3,
  where: { category: 'auth' } // Filter by metadata
});

console.log(results.documents); // Relevant docs
console.log(results.distances); // Similarity scores

AI Agents & Autonomous Systems

AI agents use LLMs to make decisions and take actions autonomously.

How SpecWeave Uses AI Agents

SpecWeave demonstrates production AI agent architecture:

Agent Implementation Pattern

class BaseAgent {
  constructor(llmClient, role, expertise) {
    this.llm = llmClient;
    this.role = role;
    this.expertise = expertise;
  }

  async execute(task) {
    const systemPrompt = this.buildSystemPrompt();
    const userPrompt = this.buildUserPrompt(task);

    const response = await this.llm.chat.completions.create({
      model: 'claude-sonnet-4-5',
      messages: [
        { role: 'system', content: systemPrompt },
        { role: 'user', content: userPrompt }
      ]
    });

    return this.parseResponse(response.choices[0].message.content);
  }

  buildSystemPrompt() {
    return `You are a ${this.role} with expertise in ${this.expertise}.
    Your responsibilities:
    ${this.getResponsibilities().join('\n')}

    Guidelines:
    ${this.getGuidelines().join('\n')}`;
  }

  abstract getResponsibilities();
  abstract getGuidelines();
  abstract buildUserPrompt(task);
  abstract parseResponse(response);
}

// Concrete agent
class PMAgent extends BaseAgent {
  constructor(llmClient) {
    super(llmClient, 'Product Manager', 'user needs, requirements analysis');
  }

  getResponsibilities() {
    return [
      '- Understand user needs and business goals',
      '- Define clear acceptance criteria',
      '- Prioritize features based on value',
      '- Ensure specifications are testable'
    ];
  }

  buildUserPrompt(task) {
    return `Create a specification for: ${task.description}

    User Story Format:
    As a [user type], I want [feature] so that [benefit]

    Include:
    1. User stories with acceptance criteria
    2. Functional requirements
    3. Non-functional requirements (performance, security)
    4. Success metrics`;
  }

  parseResponse(response) {
    // Extract structured spec from LLM response
    return {
      userStories: this.extractUserStories(response),
      acceptanceCriteria: this.extractAC(response),
      requirements: this.extractRequirements(response)
    };
  }
}

// Multi-agent orchestration
class IncrementPlanner {
  constructor(pmAgent, architectAgent, techLeadAgent) {
    this.pmAgent = pmAgent;
    this.architectAgent = architectAgent;
    this.techLeadAgent = techLeadAgent;
  }

  async planIncrement(userRequest) {
    // Phase 1: PM creates spec
    const spec = await this.pmAgent.execute({
      description: userRequest
    });

    // Phase 2: Architect creates plan (uses spec as input)
    const plan = await this.architectAgent.execute({
      spec: spec,
      constraints: this.getArchitecturalConstraints()
    });

    // Phase 3: Tech Lead creates tasks (uses spec + plan)
    const tasks = await this.techLeadAgent.execute({
      spec: spec,
      plan: plan,
      testStrategy: this.getTestStrategy()
    });

    return { spec, plan, tasks };
  }
}

Cost Optimization

Token Usage Patterns

// Expensive: Including entire codebase in context
const prompt = `Here's my entire 50,000-line codebase:\n${entireCodebase}\n\nAdd a login feature`;
// Cost: ~$5 per request (50K tokens input)

// Optimized: Relevant context only
const relevantFiles = extractRelevantFiles(codebase, 'login');
const prompt = `Relevant files:\n${relevantFiles}\n\nAdd a login feature`;
// Cost: ~$0.10 per request (1K tokens input)

Caching Strategies

// Cache embeddings (don't regenerate on every query)
async function getCachedEmbedding(text) {
  const cacheKey = `embedding:${hash(text)}`;
  let embedding = await redis.get(cacheKey);

  if (!embedding) {
    embedding = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: text
    });
    await redis.set(cacheKey, JSON.stringify(embedding), 'EX', 86400); // 24h
  }

  return embedding;
}

// Cache LLM responses for identical queries
async function getCachedCompletion(prompt) {
  const cacheKey = `completion:${hash(prompt)}`;
  let response = await redis.get(cacheKey);

  if (!response) {
    response = await openai.chat.completions.create({
      model: 'gpt-4',
      messages: [{ role: 'user', content: prompt }]
    });
    await redis.set(cacheKey, JSON.stringify(response), 'EX', 3600); // 1h
  }

  return response;
}

Model Cascading

// Try cheap model first, fallback to expensive if needed
async function generateWithCascade(prompt) {
  // Try Haiku 4.5 first (cheap and fast)
  const cheapResponse = await anthropic.messages.create({
    model: 'claude-haiku-4-5',
    messages: [{ role: 'user', content: prompt }]
  });

  // Check quality
  const quality = await assessQuality(cheapResponse);

  if (quality > 0.8) {
    return cheapResponse; // Good enough!
  }

  // Fallback to Sonnet 4.5 (more capable)
  return await anthropic.messages.create({
    model: 'claude-sonnet-4-5',
    messages: [{ role: 'user', content: prompt }]
  });
}

Production Best Practices

Error Handling

async function robustLLMCall(prompt, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await openai.chat.completions.create({
        model: 'gpt-5',
        messages: [{ role: 'user', content: prompt }],
        timeout: 30000 // 30 seconds
      });
    } catch (error) {
      if (error.code === 'rate_limit_exceeded') {
        await sleep(2000 * (i + 1)); // Exponential backoff
        continue;
      }

      if (error.code === 'context_length_exceeded') {
        prompt = truncatePrompt(prompt, 0.8); // Reduce by 20%
        continue;
      }

      throw error; // Unrecoverable error
    }
  }

  throw new Error('Max retries exceeded');
}

Monitoring

// Track token usage
function trackUsage(response, context) {
  await db.llmUsage.create({
    data: {
      model: response.model,
      promptTokens: response.usage.prompt_tokens,
      completionTokens: response.usage.completion_tokens,
      totalTokens: response.usage.total_tokens,
      cost: calculateCost(response.usage, response.model),
      context: context, // e.g., 'user-chat', 'code-generation'
      timestamp: new Date()
    }
  });
}

// Monitor quality
function trackQuality(input, output, userFeedback) {
  await db.llmQuality.create({
    data: {
      input,
      output,
      userRating: userFeedback.rating, // 1-5
      wasHelpful: userFeedback.helpful,
      issueReported: userFeedback.issue || null
    }
  });
}

Rate Limiting

import Bottleneck from 'bottleneck';

// Limit: 500 requests per minute
const limiter = new Bottleneck({
  minTime: 120, // 120ms between requests
  maxConcurrent: 10 // Max 10 parallel requests
});

const rateLimitedCompletion = limiter.wrap(async (prompt) => {
  return await openai.chat.completions.create({
    model: 'gpt-5',
    messages: [{ role: 'user', content: prompt }]
  });
});

How SpecWeave Fits In

AI-Powered Development Workflow

SpecWeave uses AI throughout the development lifecycle:

1. Requirements → Specifications (PM Agent)

User: "Build user authentication with OAuth"
PM Agent: Generates spec.md with user stories, AC, success criteria

2. Specifications → Architecture (Architect Agent)

Architect Agent: Reviews spec, generates plan.md with system design, database schema, API contracts

3. Architecture → Implementation Plan (Tech Lead Agent)

Tech Lead Agent: Creates tasks.md with embedded BDD tests, coverage targets

4. Implementation → Code (Developer + AI)

Developer: Implements tasks, AI assists with code generation

5. Code → Living Docs (Documentation Agent)

Hooks: Auto-update architecture docs, ADRs, API contracts after task completion

AI Integration Documentation

SpecWeave increments document AI usage:

# .specweave/increments/0042-ai-code-review/spec.md

## US1: Automated Code Review

**Acceptance Criteria**:
- AC-US1-01: AI analyzes pull requests for code quality
- AC-US1-02: AI suggests improvements (security, performance, style)
- AC-US1-03: AI provides explanations for suggestions

**AI Integration**:
- Model: Claude Sonnet 4.5 (200K context for large PRs)
- Prompt: "Analyze this code for quality issues..."
- Fallback: GPT-5 if Claude unavailable
- Cost estimate: $0.50 per PR review
- Caching: Cache identical file reviews for 24h

**Test Plan**:
- Given PR with security vulnerability → When AI reviews → Then flag vulnerability
- Given PR with performance issue → When AI reviews → Then suggest optimization

Common Pitfalls

1. Not Handling Hallucinations

❌ Wrong: Trust LLM output blindly

const code = await generateCode(prompt);
deployToProduction(code); // Dangerous!

✅ Correct: Validate output

const code = await generateCode(prompt);
await runTests(code); // Verify correctness
await securityScan(code); // Check for vulnerabilities
if (allChecksPassed) {
  deployToProduction(code);
}

2. Exceeding Context Limits

❌ Wrong: Send entire database

const data = await db.findAll(); // 1M records
const prompt = `Analyze this data: ${JSON.stringify(data)}`;
// Error: Context length exceeded

✅ Correct: Sample or chunk data

const sample = await db.findMany({ take: 100 });
const prompt = `Analyze this sample: ${JSON.stringify(sample)}`;

3. Ignoring Costs

❌ Wrong: No cost tracking

// Called 1000x/day with 50K token context
await openai.chat.completions.create({ /* large prompt */ });
// Cost: $500/day = $15K/month!

✅ Correct: Optimize context

const relevantContext = extractRelevant(largeContext);
await openai.chat.completions.create({ /* small prompt */ });
// Cost: $50/day = $1.5K/month

Next Steps

Deepen Your Knowledge:

Backend Fundamentals - Build APIs for AI integration
Frontend Fundamentals - Create UIs for AI features
Testing Fundamentals - Test AI systems effectively

Hands-On Practice:

Build RAG system for your docs
Fine-tune a model for classification
Create an AI agent for task automation
Implement semantic search with vector DB
Optimize token usage and costs

SpecWeave Integration:

Create AI increment: /specweave:increment "ai-code-review"
Document model selection in ADRs
Track AI costs in increment reports
Use BDD tests for AI behavior validation

Further Reading:

Table of Contents​

Introduction​

Understanding LLMs​

How LLMs Work (Simplified)​

LLM Capabilities​

AI Integration Patterns​

Pattern 1: Direct API Calls​

Pattern 2: Embeddings & Semantic Search​

Pattern 3: Function Calling (Tool Use)​

RAG Architecture​

Why RAG?​

RAG Flow​

RAG Implementation​

RAG Best Practices​

Prompt Engineering​

Principles​

Advanced Techniques​

Model Selection​

Model Comparison​

When to Use Each Model​

Model Selection Strategy​

Training vs Inference​

Inference (Most Common)​

Fine-Tuning​

Self-Hosting (Advanced)​

ML Pipelines​

Typical ML Pipeline​

Example: Text Classification Pipeline​

Vector Databases​

Popular Vector Databases​

Vector Database Operations​

AI Agents & Autonomous Systems​

How SpecWeave Uses AI Agents​

Agent Implementation Pattern​

Cost Optimization​

Token Usage Patterns​

Caching Strategies​

Model Cascading​

Production Best Practices​

Error Handling​

Monitoring​

Rate Limiting​

How SpecWeave Fits In​

AI-Powered Development Workflow​

AI Integration Documentation​

Common Pitfalls​

1. Not Handling Hallucinations​

2. Exceeding Context Limits​

3. Ignoring Costs​

Next Steps​

Table of Contents