Infrastructure as Code Fundamentals

Target Audience: Developers learning DevOps and infrastructure management Reading Time: 25 minutes Prerequisites: Basic understanding of cloud computing, command line, and version control

What is Infrastructure?

Before diving into Infrastructure as Code (IaC), let's understand what infrastructure means in modern software development.

Infrastructure encompasses all the resources needed to run your application:

Compute: Servers (EC2, Azure VMs, GCP Compute Engine)
Storage: Databases (RDS, DynamoDB), object storage (S3, Azure Blob)
Networking: VPCs, subnets, load balancers, firewalls
Security: IAM roles, security groups, SSL certificates
Content Delivery: CDNs (CloudFront, Cloudflare)
Monitoring: Logging, metrics, alerting systems

Traditional Infrastructure Management

The Old Way: Manual setup through web consoles

Problems with Manual Infrastructure:

No version control: Changes aren't tracked in Git
Human error: Typos, missed configurations, inconsistent settings
Snowflake servers: Each environment is slightly different
No documentation: Setup lives in people's heads, not code
Hard to replicate: Creating identical staging environment takes hours
No disaster recovery: Can't rebuild infrastructure from scratch
Team bottlenecks: Only ops team can provision resources

The Modern Way: Infrastructure as Code

The New Way: Infrastructure defined in code files

What is Infrastructure as Code (IaC)?

Definition: Infrastructure as Code (IaC) is the practice of managing and provisioning infrastructure through machine-readable definition files, rather than physical hardware configuration or interactive configuration tools.

In Simple Terms: Your servers, databases, and networks are defined in code files (like .tf, .yaml, .json) that you can version, review, and deploy just like application code.

Before vs After: Real-World Example

Scenario: Deploy a web application with database and load balancer

Before IaC (Manual Setup - 4 Hours)

Total Time: 8+ hours for staging + production Documentation: Outdated wiki page with screenshots Consistency: Staging slightly different from production (human error) Recovery Time: 8+ hours to rebuild from scratch

After IaC (Automated - 10 Minutes)

Total Time: 10 minutes for staging + production Documentation: The code IS the documentation Consistency: 100% identical environments Recovery Time: 10 minutes to rebuild from code

The Problem: Manual Infrastructure Creates Chaos

Let's look at real-world problems that IaC solves.

Problem 1: The "Works on My Machine" Problem

With IaC: Same code → Same infrastructure everywhere

Problem 2: The Bus Factor

Day 1: Senior DevOps Engineer sets up production
      (All knowledge in their head)

Day 30: Only they know how to deploy

Day 60: They go on vacation

Day 61: Production crashes
        Team can't rebuild infrastructure
        Business loses $10,000/hour

With IaC: Infrastructure knowledge is in Git, not people's heads

Problem 3: The Compliance Nightmare

Auditor: "Show me your production security group rules from 6 months ago."

Without IaC: "Uh... let me check AWS Console... I think it was... maybe...?"
❌ FAILED AUDIT

With IaC: "Here's the Git commit from June 15th."
✅ PASSED AUDIT

Benefits of Infrastructure as Code

1. Version Control - Infrastructure in Git

# Your infrastructure has full Git history
git log infrastructure/production/

commit 3f5a2b1c
Author: DevOps Team
Date:   2025-10-15
feat: add production load balancer

commit 8d2e9f4a
Author: DevOps Team
Date:   2025-10-10
fix: increase RDS instance size from t3.micro to t3.medium

commit 1a4b7c3d
Author: DevOps Team
Date:   2025-10-01
chore: initial production infrastructure

Benefits:

See who changed what and when
Rollback to previous versions
Code review infrastructure changes
Blame/praise specific team members

2. Reproducibility - Same Code = Same Infrastructure

# This code creates IDENTICAL infrastructure every time
resource "aws_instance" "web" {
  count         = 3
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.micro"

  tags = {
    Name = "web-server-${count.index}"
  }
}

Benefits:

Staging is exact replica of production
Disaster recovery: Rebuild in minutes
Multi-region deployment: Copy code to new region
Testing: Spin up temporary environments

3. Documentation - Code IS the Documentation

Traditional Documentation:

# Production Setup Guide (Last updated: ???)

Login to AWS Console
Go to EC2 → Launch Instance
Select Ubuntu 22.04 (which AMI? 🤷)
Choose t3.micro (or was it t3.small? 🤔)
Configure security group (can't remember the rules 😅)
...15 more steps with screenshots that are now outdated

IaC Documentation:

# This IS the documentation - always up to date!
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"  # Ubuntu 22.04
  instance_type = "t3.micro"                # 1 vCPU, 1 GB RAM

  vpc_security_group_ids = [aws_security_group.web.id]

  tags = {
    Name        = "production-web-server"
    Environment = "production"
  }
}

4. Automation - CI/CD for Infrastructure

Benefits:

Infrastructure changes deploy automatically
No manual steps = no human error
Consistent deployment process
Fast rollbacks if issues occur

5. Cost Management - Track Infrastructure Changes

# Git diff shows cost impact
git diff main infrastructure/

- instance_type = "t3.micro"   # $7.50/month
+ instance_type = "t3.large"   # $60/month

💡 This change increases monthly cost by $52.50

Benefits:

See cost impact before applying changes
Track who approved expensive resources
Optimize costs over time (Git history shows trends)
Budget alerts based on infrastructure code

6. Disaster Recovery - Rebuild from Code

# Production datacenter destroyed by fire? 🔥
# No problem - rebuild in 10 minutes:

cd infrastructure/production
terraform init
terraform apply -auto-approve

# Infrastructure recreated from code!

Benefits:

Minutes to recover (not days)
No knowledge loss
Automatic backup (Git is the backup)

IaC Tools Comparison

The Four Main Players

Comparison Table

Feature	Terraform	CloudFormation	Pulumi	Ansible
Language	HCL (declarative)	JSON/YAML (declarative)	TypeScript/Python/Go	YAML (imperative)
Cloud Support	AWS, Azure, GCP, 100+	AWS only	AWS, Azure, GCP, K8s	Any (via SSH)
State Management	Required (S3, Terraform Cloud)	AWS-managed	Required (S3, Pulumi Service)	Optional
Learning Curve	Medium (new language)	Medium (AWS-specific)	Low (use familiar language)	Low (YAML)
Community	Large (most popular)	AWS-specific	Growing	Large (config mgmt)
Best For	Multi-cloud, team collaboration	AWS-only projects	Developers preferring code	Configuration management
Type Safety	Limited	None	Full (TypeScript, Go)	None
Tooling	Excellent (IDE support, linters)	Good (AWS-specific)	Excellent (native IDE support)	Good
Cost	Free (Terraform Cloud paid)	Free	Free (Pulumi Service paid)	Free (Tower paid)

When to Choose Each Tool

Choose Terraform if:

✅ Multi-cloud or hybrid cloud
✅ Team needs to learn one tool for all clouds
✅ Strong ecosystem (modules, providers)
✅ Declarative approach preferred

Choose CloudFormation if:

✅ AWS-only infrastructure
✅ Deep AWS integration required
✅ Using AWS-managed state (no external state storage)

Choose Pulumi if:

✅ Team prefers TypeScript/Python/Go over HCL
✅ Need strong type safety
✅ Sharing code between app and infrastructure
✅ Complex logic (loops, conditionals easier in real languages)

Choose Ansible if:

✅ Configuration management (install software on servers)
✅ Existing servers (not creating new infrastructure)
✅ Simple automation tasks
❌ Not recommended for infrastructure provisioning

Tool Combination Strategy

Most teams use multiple tools:

Example Workflow:

Terraform: Provision AWS infrastructure (VPC, EC2, RDS)
Ansible: Configure servers (install Docker, setup monitoring)
Kubernetes: Deploy and orchestrate containers

Terraform Deep Dive

Why Terraform is the Industry Standard

Market Share (2024):

Terraform: 76% (most popular)
CloudFormation: 12%
Pulumi: 8%
Others: 4%

Why Terraform Won:

Multi-cloud support (AWS, Azure, GCP, 3000+ providers)
Declarative syntax (describe what you want, not how)
Strong community (100K+ modules)
Mature tooling (IDE support, CI/CD integrations)

Basic Concepts

1. Providers - Connect to Cloud APIs

# Providers are plugins that interact with APIs
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"  # Use version 5.x.x
    }
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

# Configure AWS provider
provider "aws" {
  region = "us-west-2"

  # Best practice: Use AWS profiles, not hardcoded credentials
  profile = "production"
}

# Configure Azure provider
provider "azurerm" {
  features {}
  subscription_id = var.azure_subscription_id
}

2. Resources - Infrastructure Components

# Resources are the actual infrastructure components
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"

  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name        = "production-vpc"
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

resource "aws_subnet" "public" {
  count = 2  # Create 2 subnets

  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.${count.index}.0/24"
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true

  tags = {
    Name = "public-subnet-${count.index + 1}"
  }
}

3. State - Current Infrastructure State

# Terraform tracks infrastructure in state file
# NEVER edit manually! Use terraform commands.

# Backend configuration for remote state
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-west-2"
    encrypt        = true
    dynamodb_table = "terraform-locks"  # Prevents concurrent modifications
  }
}

Why State Matters:

Maps real infrastructure to code
Tracks resource dependencies
Improves performance (caches attributes)
Enables team collaboration

4. Modules - Reusable Components

# modules/web-service/main.tf - Reusable web service module
variable "service_name" {
  description = "Name of the web service"
  type        = string
}

variable "instance_count" {
  description = "Number of instances"
  type        = number
  default     = 2
}

resource "aws_instance" "web" {
  count         = var.instance_count
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t3.micro"

  tags = {
    Name = "${var.service_name}-${count.index}"
  }
}

output "instance_ids" {
  value = aws_instance.web[*].id
}

# -----------------------------------------------
# main.tf - Using the module
module "api_service" {
  source = "./modules/web-service"

  service_name   = "api"
  instance_count = 3
}

module "admin_service" {
  source = "./modules/web-service"

  service_name   = "admin"
  instance_count = 1
}

Complete AWS Infrastructure Example

Let's build a production-ready infrastructure with Terraform:

# main.tf - Complete AWS infrastructure for web application

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  # Remote state in S3
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "production/web-app/terraform.tfstate"
    region         = "us-west-2"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      Project     = "web-app"
      Environment = var.environment
      ManagedBy   = "terraform"
      Increment   = "0012-aws-infrastructure"
    }
  }
}

# -----------------------------------------------
# NETWORKING
# -----------------------------------------------

# VPC - Virtual Private Cloud
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"

  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "${var.environment}-vpc"
  }
}

# Internet Gateway - Allows internet access
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "${var.environment}-igw"
  }
}

# Public Subnets - For load balancer
resource "aws_subnet" "public" {
  count = 2

  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.${count.index}.0/24"
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.environment}-public-subnet-${count.index + 1}"
    Type = "public"
  }
}

# Private Subnets - For application servers
resource "aws_subnet" "private" {
  count = 2

  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.${count.index + 10}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = {
    Name = "${var.environment}-private-subnet-${count.index + 1}"
    Type = "private"
  }
}

# Route Table - Public subnets route to internet
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

  tags = {
    Name = "${var.environment}-public-rt"
  }
}

resource "aws_route_table_association" "public" {
  count = 2

  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

# -----------------------------------------------
# SECURITY GROUPS
# -----------------------------------------------

# Load Balancer Security Group - Allow HTTP/HTTPS from internet
resource "aws_security_group" "alb" {
  name_prefix = "${var.environment}-alb-"
  description = "Security group for Application Load Balancer"
  vpc_id      = aws_vpc.main.id

  ingress {
    description = "HTTP from internet"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "HTTPS from internet"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    description = "Allow all outbound"
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "${var.environment}-alb-sg"
  }
}

# Application Security Group - Allow traffic from ALB only
resource "aws_security_group" "app" {
  name_prefix = "${var.environment}-app-"
  description = "Security group for application servers"
  vpc_id      = aws_vpc.main.id

  ingress {
    description     = "HTTP from ALB"
    from_port       = 3000
    to_port         = 3000
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    description = "Allow all outbound"
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "${var.environment}-app-sg"
  }
}

# Database Security Group - Allow PostgreSQL from app servers only
resource "aws_security_group" "db" {
  name_prefix = "${var.environment}-db-"
  description = "Security group for RDS database"
  vpc_id      = aws_vpc.main.id

  ingress {
    description     = "PostgreSQL from app servers"
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]
  }

  egress {
    description = "Allow all outbound"
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  tags = {
    Name = "${var.environment}-db-sg"
  }
}

# -----------------------------------------------
# COMPUTE (EC2)
# -----------------------------------------------

# EC2 Instances - Application servers
resource "aws_instance" "app" {
  count = var.app_instance_count

  ami           = data.aws_ami.ubuntu.id
  instance_type = var.app_instance_type

  subnet_id              = aws_subnet.private[count.index % 2].id
  vpc_security_group_ids = [aws_security_group.app.id]

  user_data = templatefile("${path.module}/user-data.sh", {
    environment = var.environment
  })

  root_block_device {
    volume_size = 20
    volume_type = "gp3"
    encrypted   = true
  }

  tags = {
    Name = "${var.environment}-app-${count.index + 1}"
  }
}

# -----------------------------------------------
# LOAD BALANCER
# -----------------------------------------------

# Application Load Balancer
resource "aws_lb" "main" {
  name               = "${var.environment}-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = aws_subnet.public[*].id

  enable_deletion_protection = var.environment == "production"

  tags = {
    Name = "${var.environment}-alb"
  }
}

# Target Group - App servers
resource "aws_lb_target_group" "app" {
  name     = "${var.environment}-app-tg"
  port     = 3000
  protocol = "HTTP"
  vpc_id   = aws_vpc.main.id

  health_check {
    enabled             = true
    path                = "/health"
    port                = "3000"
    protocol            = "HTTP"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
  }

  tags = {
    Name = "${var.environment}-app-tg"
  }
}

# Register EC2 instances with target group
resource "aws_lb_target_group_attachment" "app" {
  count = var.app_instance_count

  target_group_arn = aws_lb_target_group.app.arn
  target_id        = aws_instance.app[count.index].id
  port             = 3000
}

# Listener - HTTP (redirect to HTTPS)
resource "aws_lb_listener" "http" {
  load_balancer_arn = aws_lb.main.arn
  port              = 80
  protocol          = "HTTP"

  default_action {
    type = "redirect"

    redirect {
      port        = "443"
      protocol    = "HTTPS"
      status_code = "HTTP_301"
    }
  }
}

# Listener - HTTPS
resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.main.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS-1-2-2017-01"
  certificate_arn   = aws_acm_certificate.main.arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.app.arn
  }
}

# -----------------------------------------------
# DATABASE (RDS)
# -----------------------------------------------

# DB Subnet Group
resource "aws_db_subnet_group" "main" {
  name       = "${var.environment}-db-subnet-group"
  subnet_ids = aws_subnet.private[*].id

  tags = {
    Name = "${var.environment}-db-subnet-group"
  }
}

# RDS PostgreSQL Instance
resource "aws_db_instance" "postgres" {
  identifier = "${var.environment}-postgres"

  engine         = "postgres"
  engine_version = "15.3"
  instance_class = var.db_instance_class

  allocated_storage     = var.db_allocated_storage
  storage_type          = "gp3"
  storage_encrypted     = true

  db_name  = var.db_name
  username = var.db_username
  password = var.db_password  # Better: Use AWS Secrets Manager

  vpc_security_group_ids = [aws_security_group.db.id]
  db_subnet_group_name   = aws_db_subnet_group.main.name

  backup_retention_period = var.environment == "production" ? 7 : 1
  backup_window           = "03:00-04:00"
  maintenance_window      = "mon:04:00-mon:05:00"

  multi_az                   = var.environment == "production"
  deletion_protection        = var.environment == "production"
  skip_final_snapshot        = var.environment != "production"
  final_snapshot_identifier  = "${var.environment}-postgres-final-snapshot-${timestamp()}"

  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]

  tags = {
    Name = "${var.environment}-postgres"
  }
}

# -----------------------------------------------
# OUTPUTS
# -----------------------------------------------

output "vpc_id" {
  description = "VPC ID"
  value       = aws_vpc.main.id
}

output "alb_dns_name" {
  description = "Load balancer DNS name"
  value       = aws_lb.main.dns_name
}

output "db_endpoint" {
  description = "RDS endpoint"
  value       = aws_db_instance.postgres.endpoint
  sensitive   = true
}

output "app_instance_ips" {
  description = "Private IPs of app instances"
  value       = aws_instance.app[*].private_ip
}

Terraform Workflow

Step-by-Step Workflow:

# 1. Initialize Terraform (first time only)
terraform init
# Downloads providers (AWS, Azure, etc.)
# Configures backend (S3 state storage)

# 2. Validate syntax
terraform validate
# Checks .tf files for errors
# Output: Success! The configuration is valid.

# 3. Preview changes (ALWAYS do this!)
terraform plan
# Shows what will be created/modified/destroyed
# Output:
#   Plan: 15 to add, 0 to change, 0 to destroy.

# 4. Apply changes
terraform apply
# Creates actual infrastructure
# Prompts: Do you want to perform these actions? yes

# 5. View state
terraform show
# Shows current infrastructure state

# 6. List resources
terraform state list
# Lists all managed resources

# 7. Format code
terraform fmt
# Auto-formats .tf files

# 8. Destroy infrastructure (cleanup)
terraform destroy
# Removes all resources
# Prompts: Do you really want to destroy? yes

Workspaces - Multiple Environments

# Workspaces allow different environments with same code
terraform workspace new development
terraform workspace new staging
terraform workspace new production

# Switch workspace
terraform workspace select production

# List workspaces
terraform workspace list
#   default
#   development
#   staging
# * production  ← Current

# Use workspace in code
resource "aws_instance" "app" {
  instance_type = terraform.workspace == "production" ? "t3.large" : "t3.micro"

  tags = {
    Name        = "app-${terraform.workspace}"
    Environment = terraform.workspace
  }
}

Kubernetes Basics

What is Kubernetes?

Definition: Kubernetes (K8s) is an open-source platform for automating deployment, scaling, and management of containerized applications across clusters of machines.

In Simple Terms: Kubernetes is like an orchestra conductor for containers - it ensures the right containers are running, restarts them if they crash, scales them up/down based on load, and handles networking between them.

Why Kubernetes?

Without Kubernetes (Docker only):

# Manual management - painful at scale!
docker run app1  # Crashed? You must manually restart
docker run app2  # Need 10 replicas? Run command 10 times
docker run db    # Networking? Configure manually

# Load balancing? Write custom scripts
# Auto-scaling? Not happening
# Self-healing? Hope and pray

With Kubernetes:

# Declarative - describe what you want
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 10  # K8s ensures 10 pods always running
  template:
    spec:
      containers:
      - name: app
        image: myapp:v1
        resources:
          requests:
            cpu: 100m
            memory: 128Mi

# K8s automatically:
# ✅ Restarts crashed containers
# ✅ Maintains 10 replicas
# ✅ Load balances traffic
# ✅ Scales up/down based on CPU
# ✅ Rolling updates with zero downtime

Key Concepts

Pods - Smallest Deployable Unit

# pod.yaml - A Pod wraps one or more containers
apiVersion: v1
kind: Pod
metadata:
  name: web-app
  labels:
    app: web
    version: v1
spec:
  containers:
  - name: nginx
    image: nginx:1.25
    ports:
    - containerPort: 80
    resources:
      requests:
        memory: "128Mi"  # Minimum required
        cpu: "100m"      # 0.1 CPU cores
      limits:
        memory: "256Mi"  # Maximum allowed
        cpu: "200m"      # 0.2 CPU cores

    # Health checks - K8s restarts if unhealthy
    livenessProbe:
      httpGet:
        path: /health
        port: 80
      initialDelaySeconds: 30
      periodSeconds: 10

    # Readiness check - K8s routes traffic when ready
    readinessProbe:
      httpGet:
        path: /ready
        port: 80
      initialDelaySeconds: 10
      periodSeconds: 5

Deployments - Manage Replicas

# deployment.yaml - Deployment manages Pods
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  namespace: production
spec:
  replicas: 3  # Always maintain 3 Pods

  selector:
    matchLabels:
      app: api

  strategy:
    type: RollingUpdate  # Zero-downtime updates
    rollingUpdate:
      maxSurge: 1        # Max 4 Pods during update (3 + 1)
      maxUnavailable: 1  # Min 2 Pods running during update (3 - 1)

  template:
    metadata:
      labels:
        app: api
        version: v1.2.0
    spec:
      containers:
      - name: api
        image: myregistry/api:v1.2.0
        ports:
        - containerPort: 8080

        # Environment variables
        env:
        - name: ENVIRONMENT
          value: "production"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url

Services - Networking and Load Balancing

# service.yaml - Stable endpoint for Pods
apiVersion: v1
kind: Service
metadata:
  name: api-service
spec:
  type: LoadBalancer  # Exposes to internet (AWS creates ELB)

  selector:
    app: api  # Routes to Pods with label app=api

  ports:
  - protocol: TCP
    port: 80        # External port
    targetPort: 8080  # Container port

  sessionAffinity: ClientIP  # Sticky sessions

Complete Microservices Deployment Example

# Complete K8s setup for web application with database

# -----------------------------------------------
# NAMESPACE - Logical isolation
# -----------------------------------------------
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    environment: production

---
# -----------------------------------------------
# CONFIGMAP - Non-sensitive configuration
# -----------------------------------------------
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  namespace: production
data:
  app.properties: |
    server.port=8080
    logging.level=INFO
    feature.dark-mode=true

  nginx.conf: |
    server {
      listen 80;
      location / {
        proxy_pass http://api-service:80;
      }
    }

---
# -----------------------------------------------
# SECRET - Sensitive data (base64 encoded)
# -----------------------------------------------
apiVersion: v1
kind: Secret
metadata:
  name: db-credentials
  namespace: production
type: Opaque
data:
  username: YWRtaW4=  # echo -n "admin" | base64
  password: cGFzc3dvcmQxMjM=  # echo -n "password123" | base64
  url: cG9zdGdyZXM6Ly9hZG1pbkBwb3N0Z3Jlczpwb3J0L2FwcGRi

---
# -----------------------------------------------
# DEPLOYMENT - Frontend (Web)
# -----------------------------------------------
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
  namespace: production
spec:
  replicas: 2

  selector:
    matchLabels:
      app: web
      tier: frontend

  template:
    metadata:
      labels:
        app: web
        tier: frontend
        version: v1.0.0
    spec:
      containers:
      - name: web
        image: myregistry/web:v1.0.0
        ports:
        - containerPort: 80

        volumeMounts:
        - name: config
          mountPath: /etc/nginx/nginx.conf
          subPath: nginx.conf

        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"

      volumes:
      - name: config
        configMap:
          name: app-config

---
# -----------------------------------------------
# SERVICE - Frontend (LoadBalancer)
# -----------------------------------------------
apiVersion: v1
kind: Service
metadata:
  name: web-service
  namespace: production
spec:
  type: LoadBalancer
  selector:
    app: web
    tier: frontend
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80

---
# -----------------------------------------------
# DEPLOYMENT - Backend (API)
# -----------------------------------------------
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-backend
  namespace: production
spec:
  replicas: 3

  selector:
    matchLabels:
      app: api
      tier: backend

  template:
    metadata:
      labels:
        app: api
        tier: backend
        version: v2.1.0
    spec:
      containers:
      - name: api
        image: myregistry/api:v2.1.0
        ports:
        - containerPort: 8080

        env:
        - name: NODE_ENV
          value: "production"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url
        - name: LOG_LEVEL
          valueFrom:
            configMapKeyRef:
              name: app-config
              key: logging.level

        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10

        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"

---
# -----------------------------------------------
# SERVICE - Backend (ClusterIP - internal only)
# -----------------------------------------------
apiVersion: v1
kind: Service
metadata:
  name: api-service
  namespace: production
spec:
  type: ClusterIP  # Internal only (not exposed to internet)
  selector:
    app: api
    tier: backend
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080

---
# -----------------------------------------------
# STATEFULSET - Database (PostgreSQL)
# -----------------------------------------------
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: production
spec:
  serviceName: postgres-service
  replicas: 1

  selector:
    matchLabels:
      app: postgres
      tier: database

  template:
    metadata:
      labels:
        app: postgres
        tier: database
    spec:
      containers:
      - name: postgres
        image: postgres:15
        ports:
        - containerPort: 5432

        env:
        - name: POSTGRES_USER
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: username
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: password
        - name: POSTGRES_DB
          value: "appdb"

        volumeMounts:
        - name: postgres-data
          mountPath: /var/lib/postgresql/data

        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"

  volumeClaimTemplates:
  - metadata:
      name: postgres-data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 10Gi

---
# -----------------------------------------------
# SERVICE - Database (ClusterIP)
# -----------------------------------------------
apiVersion: v1
kind: Service
metadata:
  name: postgres-service
  namespace: production
spec:
  type: ClusterIP
  selector:
    app: postgres
    tier: database
  ports:
  - protocol: TCP
    port: 5432
    targetPort: 5432

---
# -----------------------------------------------
# HORIZONTAL POD AUTOSCALER - Auto-scaling
# -----------------------------------------------
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-backend

  minReplicas: 2
  maxReplicas: 10

  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale up at 70% CPU
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80  # Scale up at 80% memory

Kubernetes Architecture

Docker Fundamentals

What is Docker?

Definition: Docker is a platform for building, shipping, and running applications in lightweight, portable containers that package software with all dependencies.

In Simple Terms: Docker containers are like shipping containers for software - they package your app with everything it needs (code, libraries, system tools) so it runs identically everywhere (laptop, server, cloud).

Containers vs Virtual Machines

Key Differences:

Aspect	Virtual Machines	Docker Containers
Size	GBs (4-8 GB)	MBs (50-500 MB)
Startup	Minutes (boot OS)	Seconds
Performance	Slower (full OS overhead)	Near-native (shared kernel)
Isolation	Complete (separate OS)	Process-level (shared OS)
Density	10-20 VMs per server	100+ containers per server
Best For	Running different OS	Microservices, consistency

Dockerfile - Blueprint for Images

# Multi-stage build - Best practice for production

# -----------------------------------------------
# STAGE 1: Build (includes build tools)
# -----------------------------------------------
FROM node:18-alpine AS builder

# Set working directory
WORKDIR /app

# Copy package files first (leverage Docker cache)
COPY package*.json ./

# Install ALL dependencies (dev + prod)
RUN npm ci

# Copy source code
COPY . .

# Build TypeScript to JavaScript
RUN npm run build

# -----------------------------------------------
# STAGE 2: Production (minimal runtime)
# -----------------------------------------------
FROM node:18-alpine

# Install dumb-init (proper signal handling)
RUN apk add --no-cache dumb-init

# Create non-root user (security best practice)
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

# Set working directory
WORKDIR /app

# Copy ONLY production files from builder stage
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/package*.json ./

# Switch to non-root user (DON'T RUN AS ROOT!)
USER nodejs

# Expose port (documentation only, doesn't actually publish)
EXPOSE 3000

# Health check - Docker can monitor container health
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD node -e "require('http').get('http://localhost:3000/health', (r) => process.exit(r.statusCode === 200 ? 0 : 1))"

# Use dumb-init to handle signals properly
ENTRYPOINT ["dumb-init", "--"]

# Start application
CMD ["node", "dist/main.js"]

Size Comparison:

Builder stage: 800 MB (includes npm, TypeScript, build tools)
Final image: 150 MB (only Node.js runtime + compiled code)
81% size reduction!

Docker Compose - Multi-Container Applications

# docker-compose.yml - Complete development environment

version: '3.8'

services:
  # -----------------------------------------------
  # PostgreSQL Database
  # -----------------------------------------------
  postgres:
    image: postgres:15-alpine
    container_name: app-db

    environment:
      POSTGRES_DB: appdb
      POSTGRES_USER: admin
      POSTGRES_PASSWORD: ${DB_PASSWORD}  # From .env file

    volumes:
      - postgres-data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql:ro

    ports:
      - "5432:5432"

    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U admin"]
      interval: 10s
      timeout: 5s
      retries: 5

    networks:
      - app-network

  # -----------------------------------------------
  # Redis Cache
  # -----------------------------------------------
  redis:
    image: redis:7-alpine
    container_name: app-cache

    ports:
      - "6379:6379"

    volumes:
      - redis-data:/data

    command: redis-server --appendonly yes

    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 5

    networks:
      - app-network

  # -----------------------------------------------
  # API Service
  # -----------------------------------------------
  api:
    build:
      context: .
      dockerfile: Dockerfile
      args:
        NODE_ENV: development

    container_name: api-server

    environment:
      NODE_ENV: development
      DATABASE_URL: postgres://admin:${DB_PASSWORD}@postgres:5432/appdb
      REDIS_URL: redis://redis:6379
      PORT: 3000

    ports:
      - "3000:3000"

    volumes:
      # Hot-reload in development
      - ./src:/app/src:ro
      - ./logs:/app/logs

    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy

    restart: unless-stopped

    networks:
      - app-network

  # -----------------------------------------------
  # NGINX Reverse Proxy
  # -----------------------------------------------
  nginx:
    image: nginx:alpine
    container_name: nginx-proxy

    ports:
      - "80:80"
      - "443:443"

    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
      - ./ssl:/etc/nginx/ssl:ro

    depends_on:
      - api

    networks:
      - app-network

# -----------------------------------------------
# Volumes - Persistent data storage
# -----------------------------------------------
volumes:
  postgres-data:
    driver: local
  redis-data:
    driver: local

# -----------------------------------------------
# Networks - Container communication
# -----------------------------------------------
networks:
  app-network:
    driver: bridge

Usage:

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f

# Restart API service
docker-compose restart api

# Execute command in container
docker-compose exec api npm test

# Stop all services
docker-compose down

# Stop and delete volumes
docker-compose down -v

CI/CD Pipelines

What is CI/CD?

CI (Continuous Integration): Automatically build and test code on every commit

CD (Continuous Deployment): Automatically deploy to production after CI passes

Complete CI/CD Pipeline with GitHub Actions

# .github/workflows/deploy.yml

name: CI/CD Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

env:
  AWS_REGION: us-west-2
  ECR_REPOSITORY: myapp
  EKS_CLUSTER: production

jobs:
  # -----------------------------------------------
  # JOB 1: Build and Test
  # -----------------------------------------------
  build-and-test:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v3

    - name: Setup Node.js
      uses: actions/setup-node@v3
      with:
        node-version: '18'
        cache: 'npm'

    - name: Install dependencies
      run: npm ci

    - name: Lint code
      run: npm run lint

    - name: Run unit tests
      run: npm test

    - name: Run integration tests
      run: npm run test:integration

    - name: Check test coverage
      run: npm run test:coverage
      # Fails if coverage < 80%

    - name: Build application
      run: npm run build

    - name: Upload build artifacts
      uses: actions/upload-artifact@v3
      with:
        name: dist
        path: dist/

  # -----------------------------------------------
  # JOB 2: Build and Push Docker Image
  # -----------------------------------------------
  docker-build:
    needs: build-and-test
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v3

    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v2
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: ${{ env.AWS_REGION }}

    - name: Login to Amazon ECR
      id: login-ecr
      uses: aws-actions/amazon-ecr-login@v1

    - name: Build Docker image
      env:
        ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
        IMAGE_TAG: ${{ github.sha }}
      run: |
        docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
        docker tag $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG \
                   $ECR_REGISTRY/$ECR_REPOSITORY:latest

    - name: Scan image for vulnerabilities
      run: |
        docker scan $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG

    - name: Push to ECR
      env:
        ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
        IMAGE_TAG: ${{ github.sha }}
      run: |
        docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
        docker push $ECR_REGISTRY/$ECR_REPOSITORY:latest

  # -----------------------------------------------
  # JOB 3: Deploy to Staging
  # -----------------------------------------------
  deploy-staging:
    needs: docker-build
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/develop'

    steps:
    - name: Checkout code
      uses: actions/checkout@v3

    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v2
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: ${{ env.AWS_REGION }}

    - name: Update kubeconfig
      run: |
        aws eks update-kubeconfig \
          --region ${{ env.AWS_REGION }} \
          --name ${{ env.EKS_CLUSTER }}-staging

    - name: Deploy to Kubernetes (staging)
      env:
        IMAGE_TAG: ${{ github.sha }}
      run: |
        kubectl set image deployment/api-server \
          api=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG \
          -n staging

        kubectl rollout status deployment/api-server -n staging

    - name: Run smoke tests (staging)
      run: |
        npm run test:e2e:staging

  # -----------------------------------------------
  # JOB 4: Deploy to Production (Manual Approval)
  # -----------------------------------------------
  deploy-production:
    needs: [docker-build, deploy-staging]
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment:
      name: production
      url: https://api.example.com

    steps:
    - name: Checkout code
      uses: actions/checkout@v3

    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v2
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: ${{ env.AWS_REGION }}

    - name: Update kubeconfig
      run: |
        aws eks update-kubeconfig \
          --region ${{ env.AWS_REGION }} \
          --name ${{ env.EKS_CLUSTER }}

    - name: Deploy to Kubernetes (production)
      env:
        IMAGE_TAG: ${{ github.sha }}
      run: |
        kubectl set image deployment/api-server \
          api=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG \
          -n production

        kubectl rollout status deployment/api-server -n production

    - name: Run smoke tests (production)
      run: |
        npm run test:e2e:production

    - name: Notify deployment success
      uses: 8398a7/action-slack@v3
      with:
        status: ${{ job.status }}
        text: 'Production deployment successful! 🚀'
        webhook_url: ${{ secrets.SLACK_WEBHOOK }}

CI/CD Pipeline Visualization

Best Practices

1. Version Control Everything

# Infrastructure code belongs in Git
git add terraform/
git add k8s/
git add docker-compose.yml
git commit -m "feat: add production infrastructure"
git push origin main

# DON'T:
# ❌ Manual AWS Console changes (not tracked)
# ❌ kubectl apply without committing YAML first
# ❌ Docker run with unversioned images

2. Modularize Code

# BAD - Monolithic file (1000+ lines)
# infrastructure.tf
resource "aws_vpc" "main" { ... }
resource "aws_subnet" "public1" { ... }
resource "aws_subnet" "public2" { ... }
# ...997 more lines

# GOOD - Modular structure
# modules/networking/main.tf
resource "aws_vpc" "main" { ... }

# modules/compute/main.tf
resource "aws_instance" "app" { ... }

# main.tf - Compose modules
module "networking" {
  source = "./modules/networking"
}
module "compute" {
  source = "./modules/compute"
  vpc_id = module.networking.vpc_id
}

3. Separate Environments

# Directory structure for multiple environments
terraform/
├── modules/          # Reusable modules
├── environments/
│   ├── development/
│   │   ├── main.tf
│   │   └── terraform.tfvars  # dev-specific values
│   ├── staging/
│   │   ├── main.tf
│   │   └── terraform.tfvars  # staging values
│   └── production/
│       ├── main.tf
│       └── terraform.tfvars  # prod values

4. Use Remote State

# NEVER store state locally for teams!

# BAD - Local state (not shareable)
# terraform.tfstate in .git (merge conflicts!)

# GOOD - Remote state in S3
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "production/terraform.tfstate"
    region         = "us-west-2"
    encrypt        = true
    dynamodb_table = "terraform-locks"  # Prevents concurrent modifications
  }
}

5. Implement State Locking

# Prevents multiple people running terraform apply simultaneously

# Create DynamoDB table for locking
resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  tags = {
    Name = "Terraform State Lock Table"
  }
}

6. Tag All Resources

# Tags enable cost tracking, ownership, automation

locals {
  common_tags = {
    Project     = "web-app"
    Environment = var.environment
    ManagedBy   = "terraform"
    Increment   = "0012-aws-infrastructure"
    Team        = "platform"
    CostCenter  = "engineering"
    Owner       = "devops@company.com"
  }
}

resource "aws_instance" "app" {
  tags = merge(local.common_tags, {
    Name = "app-server-${var.environment}"
    Role = "application"
  })
}

7. Use Variables and Secrets Management

# variables.tf - Define variables
variable "environment" {
  description = "Environment name"
  type        = string

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "db_password" {
  description = "Database password"
  type        = string
  sensitive   = true  # Won't be shown in logs
}

# NEVER hardcode secrets!
# ❌ BAD
resource "aws_db_instance" "db" {
  password = "hardcoded123"  # TERRIBLE!
}

# ✅ GOOD - Use AWS Secrets Manager
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "production/db/password"
}

resource "aws_db_instance" "db" {
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

8. Document Everything

# Use comments to explain WHY, not WHAT

# BAD - Obvious comment
resource "aws_instance" "app" {
  instance_type = "t3.micro"  # Sets instance type to t3.micro
}

# GOOD - Explains rationale
resource "aws_instance" "app" {
  # t3.micro chosen over t3.small to save $15/month
  # Sufficient for current traffic (2K req/day)
  # Re-evaluate if traffic exceeds 10K req/day
  instance_type = "t3.micro"
}

9. Test Infrastructure Code

# Unit tests - Validate syntax
terraform validate

# Plan tests - Simulate changes
terraform plan -out=tfplan

# Integration tests - Apply to test environment
terraform workspace select test
terraform apply -auto-approve

# E2E tests - Verify functionality
curl https://test-api.example.com/health
# Expected: HTTP 200 OK

10. Implement Cost Controls

# Set resource limits to prevent cost overruns

variable "max_instance_count" {
  description = "Maximum number of instances"
  type        = number
  default     = 10

  validation {
    condition     = var.max_instance_count <= 10
    error_message = "Cannot exceed 10 instances (cost limit)"
  }
}

# Use lifecycle policies
resource "aws_instance" "app" {
  lifecycle {
    prevent_destroy = true  # Prevent accidental deletion

    # Ignore changes to tags (allow external tagging)
    ignore_changes = [tags]
  }
}

Common Pitfalls

1. Hardcoded Credentials

# ❌ TERRIBLE - Credentials in code!
provider "aws" {
  access_key = "AKIAIOSFODNN7EXAMPLE"
  secret_key = "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
}

# ✅ CORRECT - Use AWS profiles or environment variables
provider "aws" {
  profile = "production"  # From ~/.aws/credentials
}

# Or use IAM roles (best for CI/CD)
provider "aws" {
  # Uses IAM role attached to EC2/CodeBuild/GitHub Actions
}

2. No State Locking

# Problem: Two developers run terraform apply simultaneously
# Developer A: terraform apply  # Starts
# Developer B: terraform apply  # Also starts (no lock!)
# Result: Corrupted state, conflicting changes, DISASTER!

# Solution: DynamoDB state locking (see Best Practices #5)

3. Large Monolithic Files

# ❌ BAD - Everything in one file (10,000 lines!)
# main.tf
resource "aws_vpc" "main" { ... }
resource "aws_subnet" "public1" { ... }
# ...9,995 more lines
# Impossible to maintain!

# ✅ GOOD - Organized structure
terraform/
├── main.tf          # Entry point
├── variables.tf     # Variable definitions
├── outputs.tf       # Output values
├── networking.tf    # VPC, subnets, etc.
├── compute.tf       # EC2, ASG, etc.
├── database.tf      # RDS, DynamoDB, etc.
└── security.tf      # Security groups, IAM, etc.

4. No Disaster Recovery Plan

# ❌ BAD - No backup, no recovery plan
# Production destroyed by: rm -rf .terraform/
# Result: 8+ hours to rebuild manually

# ✅ GOOD - Infrastructure as Code
# Production destroyed? No problem!
git clone infrastructure-repo
cd terraform/production
terraform init
terraform apply -auto-approve
# Result: 10 minutes to rebuild automatically

5. Ignoring Drift Detection

# Drift = Manual changes made outside Terraform

# ❌ BAD - Ignore drift (state doesn't match reality)
# Someone manually changed security group in AWS Console
# Terraform doesn't know about it
# Next terraform apply might break things!

# ✅ GOOD - Regular drift detection
terraform plan -detailed-exitcode
# Exit codes:
# 0 = No drift
# 1 = Error
# 2 = Drift detected

# Import manually created resources
terraform import aws_instance.example i-1234567890abcdef0

How SpecWeave Tracks Infrastructure

Infrastructure Changes in Increments

## Increment 0012: AWS Infrastructure Setup

### User Stories
- US-001: Provision AWS VPC and subnets (P1, testable)
- US-002: Deploy RDS PostgreSQL database (P1, testable)
- US-003: Set up Application Load Balancer (P1, testable)
- US-004: Configure auto-scaling (P2, testable)

### Acceptance Criteria
- [ ] **AC-US1-01**: VPC created with CIDR 10.0.0.0/16
- [ ] **AC-US1-02**: 2 public + 2 private subnets across AZs
- [ ] **AC-US2-01**: RDS instance accessible from app subnets only
- [ ] **AC-US2-02**: Daily backups enabled (7-day retention)
- [ ] **AC-US3-01**: ALB routes traffic to healthy instances
- [ ] **AC-US4-01**: Auto-scaling maintains 2-10 instances based on CPU

Test Plan (Embedded in tasks)

## T-001: Create VPC and Subnets with Terraform

**AC**: AC-US1-01, AC-US1-02

**Test Plan** (BDD format):
- **Given** AWS account with Terraform configured
- **When** terraform apply executed
- **Then** VPC and subnets created with correct CIDR blocks

**Test Cases**:
- Unit: `terraform validate` passes (syntax validation)
- Integration: `terraform plan` shows expected 5 resources
- E2E: VPC and subnets visible in AWS Console after apply

**Implementation**:
1. Create `main.tf` with VPC resource
2. Define 2 public + 2 private subnets
3. Configure route tables and internet gateway
4. Run `terraform apply`
5. Verify in AWS Console

**Coverage**: 100% (infrastructure validation)

Living Documentation

After infrastructure deployment, SpecWeave updates architecture docs:

# Sync infrastructure to living docs
/specweave:sync-docs update

# Results in:
# 1. Updated HLD with infrastructure diagram
# 2. ADR documenting infrastructure decisions
# 3. Runbook for production operations
# 4. Cost estimate documentation

Example ADR:

# ADR-012: AWS Infrastructure for Production

## Status
Accepted

## Context
Need scalable, reliable infrastructure for web application with 10K users.

## Decision
Use AWS with Terraform for infrastructure as code.

## Architecture

### Components
- VPC with public + private subnets (multi-AZ)
- Application Load Balancer (internet-facing)
- EC2 Auto Scaling (2-10 instances)
- RDS PostgreSQL (Multi-AZ, automated backups)

### Cost Estimate
- EC2 (t3.micro × 2): $15/month
- ALB: $16/month
- RDS (db.t3.micro): $13/month
- Data transfer: ~$5/month
- **Total**: ~$49/month

## Alternatives Considered
- **Heroku**: Simpler but 3× cost ($150/month)
- **DigitalOcean**: Cheaper but less reliable
- **GCP**: Similar cost but team lacks experience

## Rationale
- Industry standard (AWS = 33% market share)
- Terraform enables multi-cloud future migration
- Auto-scaling handles traffic spikes
- Multi-AZ for high availability

## Consequences
- Learning curve for Terraform (2 weeks)
- AWS lock-in (mitigated by Terraform abstraction)
- Monthly cost ~$50 (acceptable for MVP)

Real-World Example: Full Stack Deployment

Let's deploy a complete production application using IaC best practices.

Architecture Overview

Complete Deployment Pipeline

Next Steps

Congratulations! You now understand Infrastructure as Code fundamentals.

To deepen your knowledge, explore these topics:

Terraform

Terraform guide - Complete Terraform reference
Terraform Registry - Browse modules and providers
Terraform Best Practices - Community best practices

Kubernetes

Kubernetes guide - Complete K8s reference
Helm guide - K8s package manager
Kubernetes Patterns - Design patterns for K8s

Docker

Docker guide - Complete Docker reference
Docker Hub - Public container registry
Docker Best Practices

CI/CD

CI/CD guide - Continuous integration and deployment
GitHub Actions guide - GitHub CI/CD
GitOps - Git-based deployment strategy

Microservices Architecture
Service Mesh
Observability
Site Reliability Engineering (SRE)

Hands-On Practice

Free Learning Resources:

Terraform Tutorial - Official HashiCorp tutorials
Kubernetes the Hard Way - Learn K8s internals
Play with Docker - Free Docker playground
Killercoda - Interactive K8s scenarios

SpecWeave Integration:

Use SpecWeave to track IaC increments
Document infrastructure decisions in ADRs
Embed infrastructure tests in tasks.md
Maintain living docs for runbooks

Last Updated: 2025-11-04 Category: Infrastructure & DevOps Reading Time: 25 minutes

What is Infrastructure?​

Traditional Infrastructure Management​

The Modern Way: Infrastructure as Code​

What is Infrastructure as Code (IaC)?​

Before vs After: Real-World Example​

Before IaC (Manual Setup - 4 Hours)​

After IaC (Automated - 10 Minutes)​

The Problem: Manual Infrastructure Creates Chaos​

Problem 1: The "Works on My Machine" Problem​

Problem 2: The Bus Factor​

Problem 3: The Compliance Nightmare​

Benefits of Infrastructure as Code​

1. Version Control - Infrastructure in Git​

2. Reproducibility - Same Code = Same Infrastructure​

3. Documentation - Code IS the Documentation​

4. Automation - CI/CD for Infrastructure​

5. Cost Management - Track Infrastructure Changes​

6. Disaster Recovery - Rebuild from Code​

IaC Tools Comparison​

The Four Main Players​

Comparison Table​

When to Choose Each Tool​

Tool Combination Strategy​

Terraform Deep Dive​

Why Terraform is the Industry Standard​

Basic Concepts​

1. Providers - Connect to Cloud APIs​

2. Resources - Infrastructure Components​

3. State - Current Infrastructure State​

4. Modules - Reusable Components​

Complete AWS Infrastructure Example​

Terraform Workflow​

Workspaces - Multiple Environments​

Kubernetes Basics​

What is Kubernetes?​

Why Kubernetes?​

Key Concepts​

Pods - Smallest Deployable Unit​

Deployments - Manage Replicas​

Services - Networking and Load Balancing​

Complete Microservices Deployment Example​

Kubernetes Architecture​

Docker Fundamentals​

What is Docker?​

Containers vs Virtual Machines​

Dockerfile - Blueprint for Images​

Docker Compose - Multi-Container Applications​

CI/CD Pipelines​

What is CI/CD?​

Complete CI/CD Pipeline with GitHub Actions​

CI/CD Pipeline Visualization​

Best Practices​

1. Version Control Everything​

2. Modularize Code​

3. Separate Environments​

4. Use Remote State​

5. Implement State Locking​

6. Tag All Resources​

7. Use Variables and Secrets Management​

8. Document Everything​

9. Test Infrastructure Code​

10. Implement Cost Controls​

Common Pitfalls​

1. Hardcoded Credentials​

2. No State Locking​

3. Large Monolithic Files​

4. No Disaster Recovery Plan​

5. Ignoring Drift Detection​

How SpecWeave Tracks Infrastructure​

Infrastructure Changes in Increments​

Test Plan (Embedded in tasks)​

Living Documentation​

Real-World Example: Full Stack Deployment​

Architecture Overview​

Complete Deployment Pipeline​

Next Steps​

Terraform​

Kubernetes​

Docker​

CI/CD​

What is Infrastructure?

Traditional Infrastructure Management

The Modern Way: Infrastructure as Code

What is Infrastructure as Code (IaC)?

Before vs After: Real-World Example

Before IaC (Manual Setup - 4 Hours)

After IaC (Automated - 10 Minutes)

The Problem: Manual Infrastructure Creates Chaos

Problem 1: The "Works on My Machine" Problem

Problem 2: The Bus Factor

Problem 3: The Compliance Nightmare

Benefits of Infrastructure as Code

1. Version Control - Infrastructure in Git

2. Reproducibility - Same Code = Same Infrastructure

3. Documentation - Code IS the Documentation

4. Automation - CI/CD for Infrastructure

5. Cost Management - Track Infrastructure Changes

6. Disaster Recovery - Rebuild from Code

IaC Tools Comparison

The Four Main Players

Comparison Table

When to Choose Each Tool

Tool Combination Strategy

Terraform Deep Dive

Why Terraform is the Industry Standard

Basic Concepts

1. Providers - Connect to Cloud APIs

2. Resources - Infrastructure Components

3. State - Current Infrastructure State

4. Modules - Reusable Components

Complete AWS Infrastructure Example

Terraform Workflow

Workspaces - Multiple Environments

Kubernetes Basics

What is Kubernetes?

Why Kubernetes?

Key Concepts

Pods - Smallest Deployable Unit

Deployments - Manage Replicas

Services - Networking and Load Balancing

Complete Microservices Deployment Example

Kubernetes Architecture

Docker Fundamentals

What is Docker?

Containers vs Virtual Machines

Dockerfile - Blueprint for Images

Docker Compose - Multi-Container Applications

CI/CD Pipelines

What is CI/CD?

Complete CI/CD Pipeline with GitHub Actions

CI/CD Pipeline Visualization

Best Practices

1. Version Control Everything

2. Modularize Code

3. Separate Environments

4. Use Remote State

5. Implement State Locking

6. Tag All Resources

7. Use Variables and Secrets Management

8. Document Everything

9. Test Infrastructure Code

10. Implement Cost Controls

Common Pitfalls

1. Hardcoded Credentials

2. No State Locking

3. Large Monolithic Files

4. No Disaster Recovery Plan

5. Ignoring Drift Detection

How SpecWeave Tracks Infrastructure

Infrastructure Changes in Increments

Test Plan (Embedded in tasks)

Living Documentation

Real-World Example: Full Stack Deployment

Architecture Overview

Complete Deployment Pipeline

Next Steps

Terraform

Kubernetes

Docker

CI/CD