This is part of the series Startup that Failed .

Infrastructure as Code: Why It’s Non-Negotiable

Start with this rule: If you’re touching AWS console to create resources, you’re doing it wrong.

Every resource goes through Terraform. IAM roles, security groups, RDS instances, S3 buckets, EKS clusters — everything. No exceptions.

This isn’t about following best practices. It’s about three concrete problems:

1. Reproducibility. You will need to rebuild your infrastructure. Whether it’s disaster recovery, creating new environments, or (in my case) rebranding, you’ll need to recreate everything. With Terraform, it’s terraform apply. Without it, you’re clicking through AWS console trying to remember what you configured six months ago.

2. Change tracking. Infrastructure changes should go through code review like application code. Someone shouldn’t be able to open a security group to 0.0.0.0/0 in production without anyone noticing. With Terraform in Git, every change is reviewed, documented, and traceable.

3. Drift detection. Terraform knows what should exist. When someone manually changes something (and they will), Terraform detects the drift and restores the intended state.

The Practical Implementation

Use Atlantis for Terraform automation. When you open a PR that changes Terraform configs, Atlantis automatically runs terraform plan and comments on the PR with what will change. After review and merge, it runs terraform apply.

This means:

  • No one needs Terraform installed locally
  • No one needs AWS credentials with admin access
  • All changes are peer-reviewed
  • The audit trail is automatic

Structure your Terraform with modules and workspaces:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
terraform/
├── modules/
   ├── eks-cluster/
   ├── rds-instance/
   ├── s3-backend/
   └── vpc-networking/
├── environments/
   ├── variables-staging.tfvars
   ├── variables-production.tfvars
   └── backend.tf
└── main.tf

Modules encapsulate reusable infrastructure components. Define your EKS cluster configuration once, use it across all environments.

Workspaces manage environment-specific state without code duplication. Each workspace maintains its own state file, isolating changes:

1
2
3
4
5
6
7
# Create and use workspace for staging
terraform workspace new staging
terraform apply -var-file=environments/variables-staging.tfvars

# Switch to production
terraform workspace select production
terraform apply -var-file=environments/variables-production.tfvars

The benefits:

  • Single codebase for all environments—no duplicate configuration files
  • State isolation per environment—changes in staging can’t accidentally affect production
  • Consistent deployments—same infrastructure definition with environment-specific variables
  • Easier testing—test infrastructure changes in staging workspace before applying to production

This pattern prevents the common anti-pattern of copy-pasting Terraform directories for each environment, which inevitably leads to configuration drift and maintenance nightmares.

For more details, see Terraform Workspaces Best Practices .

EKS Version Upgrades: The Right Way

Kubernetes versions change every 3-4 months. Upgrades are mandatory—AWS stops supporting old versions after 14 months.

With Terraform managing your EKS cluster, upgrading looks like:

1
cluster_version = "1.33"  # was "1.32"

Terraform handles:

  • Control plane upgrade
  • Node group updates with rolling deployments
  • Addon compatibility updates
  • API version migrations

In 18 months, I upgraded through five EKS versions. Each upgrade was a single PR, tested in staging first, then applied to production. Zero manual steps.

The alternative? Clicking through AWS console, coordinating node pool updates manually, praying you didn’t miss something. Hard pass.

Kubernetes Production Patterns on EKS

Resource Management: Requests, Limits, and Right-Sizing

Every container gets resource requests and limits. Not as a best practice, but because one runaway pod will consume all cluster resources and take everything down.

1
2
3
4
5
6
7
resources:
  requests:
    memory: "128Mi"
    cpu: "100m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Why both requests and limits matter:

  • Requests determine scheduling—Kubernetes places pods on nodes with sufficient available resources
  • Limits prevent resource exhaustion—containers can’t exceed their limits, protecting other workloads

Every namespace should have ResourceQuotas and LimitRanges. This prevents:

  • A single team/service consuming all cluster resources
  • Pods being scheduled without resource specifications
  • Memory leaks taking down unrelated services

Right-sizing is critical for cost optimization. AWS EKS cost optimization best practices emphasize analyzing actual resource usage and adjusting requests/limits accordingly. Over-provisioning wastes money; under-provisioning causes performance issues and evictions.

Use the Vertical Pod Autoscaler (VPA) to get recommendations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: "Off"  # Recommendation mode only

VPA will analyze actual usage and recommend optimal requests/limits. Start in recommendation mode, validate the numbers, then apply them to your deployments.

Cluster Autoscaling: Karpenter vs Cluster Autoscaler

For node autoscaling, you have two primary options on EKS:

Cluster Autoscaler (traditional approach):

  • Works with Auto Scaling Groups
  • Requires pre-configured node groups for different instance types
  • Scaling decisions based on pending pods and node utilization
  • Tightly coupled to Kubernetes versions

Karpenter (modern approach):

  • Provisions nodes directly based on pod requirements
  • No need to pre-configure multiple node groups
  • Faster scaling—launches right-sized instances in seconds
  • Considers spot instances, pricing, availability zones automatically

I went with traditional managed node groups and Cluster Autoscaler. Worked fine. But if I was starting today? Karpenter , no question.

Why:

  1. Application-first provisioning: Karpenter looks at pending pod requirements (CPU, memory, architecture, zone constraints) and provisions instances that fit exactly. No need to maintain dozens of node groups for different workload types.

  2. Consolidation: Karpenter automatically consolidates workloads onto fewer nodes when utilization is low, terminating empty nodes. This reduces costs without manual intervention.

  3. Spot instance handling: Built-in spot instance support with automatic diversification across instance types and AZs. When spot instances are interrupted, Karpenter provisions replacements before termination.

Example Karpenter NodePool configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
      nodeClassRef:
        name: default
  limits:
    cpu: 1000
    memory: 1000Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h

This tells Karpenter: Use spot or on-demand instances, AMD64 architecture, compute/memory/general-purpose instance families, consolidate when underutilized, and replace nodes after 30 days.

Network Policies: Defense in Depth

Default Kubernetes networking is flat—every pod can talk to every other pod. This is terrible for security.

Network policies define allowed traffic:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-network-policy
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database

The API service can only receive traffic from the frontend and only send traffic to the database. Everything else is denied by default.

This limits blast radius when something gets compromised. A compromised frontend can’t directly access the database; a compromised API service can’t reach the internet to exfiltrate data.

Health Checks That Matter

Don’t just check if the HTTP endpoint responds. Check if the service can actually do its job:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
func healthCheck(w http.ResponseWriter, r *http.Request) {
    // Check database connection
    if err := db.Ping(); err != nil {
        http.Error(w, "Database unreachable", 503)
        return
    }
    
    // Check external API reachability
    if err := checkExternalAPI(); err != nil {
        http.Error(w, "External API unreachable", 503)
        return
    }
    
    w.WriteHeader(200)
}

Configure both liveness and readiness probes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

Liveness determines if the container should be restarted. Readiness determines if the pod should receive traffic.

When readiness checks fail, Kubernetes stops routing traffic to that pod. This prevents cascading failures where traffic hits pods that can’t process requests.

Observability: You Can’t Fix What You Can’t See

AWS observability best practices emphasize three pillars: metrics, logs, and traces.

The setup I went with was a combo of:

  • Grafana
  • Loki
  • Tempo
  • Mimir (/w Prometheus)

Though this is an essay on it’s own, so I won’t cover it here.

GitOps with ArgoCD

Git is the source of truth for what’s deployed. ArgoCD watches your Git repository and keeps the cluster in sync.

The workflow:

  1. Push changes to main → ArgoCD deploys to staging automatically
  2. Open PR to production branch → Review what will change
  3. Merge PR → ArgoCD deploys to production

The killer feature: Automatic rollbacks. When a deployment fails health checks, ArgoCD automatically reverts to the previous version. No manual intervention. No debug session at 2am. Just automatic recovery.

This requires discipline:

  • All changes go through Git
  • No manual kubectl apply in production
  • Health checks must be comprehensive

But the payoff is massive. Production deployments become boring. Which is exactly what you want.

Personally this was the first time I completely owned, maintained and leveraged benefits of ArgoCD. And it can be a ton of fun when setup properly (without too much damn complexity).

Bazel for Monorepo Management

Bazel is overkill for small projects. For anything with multiple services, shared libraries, and complex dependencies, it’s the only sane choice.

The Problem Bazel Solves

You have:

  • 20 microservices in Go
  • Frontend in TypeScript - nextjs
  • Mobile apps in Swift & Kotlin
  • Dozens of shared libraries
  • Integration tests spanning multiple services
  • Unit tests for each service

Without Bazel, your options:

  1. Multiple repositories → Coordinating releases is hell
  2. Single repository with separate build tools → Dependency hell
  3. Manual dependency tracking → Good luck

With Bazel:

  • One build system for everything
  • Hermetic builds (reproducible everywhere)
  • Incremental builds (only rebuild what changed)
  • Dependency graph awareness (know exactly what breaks when you change something)

Practical Bazel Usage

The key insight: Bazel builds are deterministic. Same inputs → Same outputs. Always.

This means:

  • If it builds locally, it builds in CI
  • Test results are cacheable (same code + same tests = same result)
  • Remote caching works (share build artifacts across developers and CI)

When you change a shared library, bazel build //... shows you exactly which services fail to build. Fix those, run bazel test //..., and you know with certainty that everything still works.

Bzlmod: The New Dependency System

Bazel recently introduced Bzlmod, replacing the old WORKSPACE approach. It’s cleaner:

1
2
3
4
# MODULE.bazel
bazel_dep(name = "rules_go", version = "0.42.0")
bazel_dep(name = "gazelle", version = "0.35.0")
bazel_dep(name = "protobuf", version = "24.4")

Dependencies are explicit. Version resolution is automatic. Conflicts are detected early.

For a monorepo with multiple languages and dependencies everywhere, this changes everything. Upgrade a dependency once, Bazel shows you every place that breaks. No hunting. No guessing.

Why?

Bazel was partly a selfish decision. At Peer, Inc. we were stuck on Bazel 5.4, then 6.3 for what felt like forever. Tech debt kept us from migrating to 7, let alone 8.

So when I was in control? Bleeding edge it is. Went straight to the latest version and stayed there. Did it pay off? Yeah — I got deep expertise with modern Bazel that most teams haven’t touched yet.

Next.js with React Server Components

RSC in Next.js 14+ is actually good. Not just “new shiny framework feature” good, but legitimately better architecture.

Why RSC Matters

Traditional React: Send JavaScript to client → Client fetches data → Client renders

With RSC: Server fetches data → Server renders HTML → Stream to client → Client hydrates only interactive parts

The benefits:

  1. Faster initial render - Users see content immediately
  2. Less JavaScript shipped - Only interactive components need client-side JS
  3. Better SEO - Search engines see fully rendered HTML
  4. Simplified data fetching - No useState/useEffect dance for data

Practical RSC Patterns

Server Components (default):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
// app/dashboard/page.tsx
async function DashboardPage() {
  const data = await fetchDashboardData(); // Runs on server
  
  return (
    <div>
      <StaticContent data={data} />
      <InteractiveWidget data={data} /> {/* This is a Client Component */}
    </div>
  );
}

Client Components (only when needed):

1
2
3
4
5
6
7
// components/InteractiveWidget.tsx
'use client';

export function InteractiveWidget({ data }) {
  const [state, setState] = useState(data);
  // Interactive logic here
}

The rule: Server Components by default. Client Components only when you need interactivity.

Streaming SSR

Next.js supports streaming—send parts of the page as they’re ready:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import { Suspense } from 'react';

async function Page() {
  return (
    <div>
      <Header /> {/* Renders immediately */}
      <Suspense fallback={<Spinner />}>
        <SlowComponent /> {/* Streams in when ready */}
      </Suspense>
    </div>
  );
}

Users see the header instantly while the slow component loads. The page feels fast even when parts are slow.

gRPC for Service Communication

REST is simpler. gRPC is better for service-to-service communication. And, again, selfish reasons as I wanted to do property testing of a gRPC primary service.

Type Safety Across the Network

Define your service contract once:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
service PatientService {
  rpc GetPatient(PatientRequest) returns (PatientResponse);
  rpc StreamUpdates(stream UpdateRequest) returns (stream UpdateResponse);
}

message PatientRequest {
  string patient_id = 1;
}

message PatientResponse {
  string patient_id = 1;
  string name = 2;
  repeated Appointment appointments = 3;
}

Generate clients for every language:

  • TypeScript client for frontend
  • Go client for backend services
  • Swift client for iOS
    • Here I leveraged the new grpc-swift v2.x ; Migration from v1 to v2 was pure f*in insanity…

Change the proto definition? The compiler tells you everywhere that breaks. No runtime surprises.

Streaming for Real-Time Features

gRPC supports four types of streaming:

  1. Unary (request/response)
  2. Server streaming (one request, multiple responses)
  3. Client streaming (multiple requests, one response)
  4. Bidirectional streaming (both send multiple messages)

For real-time updates, bidirectional streaming is powerful:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
func (s *PatientService) StreamUpdates(stream pb.PatientService_StreamUpdatesServer) error {
    for {
        update, err := stream.Recv()
        if err != nil {
            return err
        }
        
        // Process update
        response := processUpdate(update)
        
        if err := stream.Send(response); err != nil {
            return err
        }
    }
}

Clients subscribe once and receive updates as they happen. No polling. Built-in backpressure handling. Clean and efficient.

HTTP/2 Benefits

gRPC runs on HTTP/2, which means:

  • Multiplexing (multiple streams on one connection)
  • Header compression
  • Binary protocol (smaller than JSON)
  • Lower latency than REST over HTTP/1.1

For internal service communication, these benefits add up.

iOS Development: SwiftUI and Offline-First

SwiftUI is the right choice for new iOS apps in 2024+. UIKit is still around, but why would you do that to yourself? Declarative UI is just… better.

Declarative UI

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
struct PatientListView: View {
    @State private var patients: [Patient] = []
    
    var body: some View {
        List(patients) { patient in
            NavigationLink(destination: PatientDetailView(patient: patient)) {
                PatientRow(patient: patient)
            }
        }
        .onAppear {
            loadPatients()
        }
    }
}

The UI is a function of state. Change state, UI updates automatically. No manual view controller coordination.

Offline-First Architecture

The iOS app had to work without network connectivity. This forced good architectural decisions:

Local-first data storage: Core Data as the source of truth. Network is for sync only.

Background sync: NSURLSession background tasks handle sync when network is available.

Conflict resolution: Last-write-wins with vector clocks for conflict detection.

This architecture made the app resilient. Spotty WiFi? App still works. Server down? App still works. Network comes back? Automatic sync.

The pattern applies beyond iOS: Build like the network is unreliable, because it is.

How does Bazel play into this?

Pretty straightforward: wrote a gomobile package that wrapped gRPC for mobile->backend communication. Local data storage, security, the works. Both native apps just import the same package.

Why? If the startup actually succeeded, I’d have one place to maintain instead of duplicating logic across Swift and Kotlin. Plus I could guarantee both platforms behaved identically. Spoiler: didn’t matter since we never got there, but the pattern was solid.


Patterns That Actually Matter

Technologies change. Patterns stick around longer:

1. IaC

Manual changes don’t scale. You will forget how you configured things. Automate everything.

2. Automated Deployments with Rollbacks

Humans are bad at deploying safely. Automate deployments. Make rollbacks automatic.

3. Type Safety Across Boundaries

Whether it’s gRPC, GraphQL, or TypeScript across frontend/backend—type-checked contracts prevent entire classes of bugs.

4. Hermetic Builds

If your build depends on “whatever’s on the build machine,” you have a time bomb. Make builds reproducible.

5. Comprehensive Testing in CI

If running tests is hard, you won’t run them. Make testing frictionless: one command, runs everything, fails fast.

6. Observability from Day One

Don’t add logging and monitoring later. Instrument from the start. You can’t debug what you can’t see.

These patterns apply whether you’re using my stack or something completely different.

What to Copy

If you’re starting a new project:

Definitely do:

  • IaC (Terraform + Atlantis)
  • GitOps with automatic rollbacks (ArgoCD or similar)
  • Comprehensive CI/CD from day one
  • Type-safe contracts between services

Consider carefully:

  • Kubernetes (adds complexity, worth it at scale)
  • Bazel (incredible for monorepos, steep learning curve)
  • gRPC (better than REST for service-to-service, more complex to debug)
  • Next.js with RSC (great for web apps, requires understanding new mental model)

Probably skip:

  • Building native mobile apps unless you need platform-specific features
    • React Native/Flutter get you 90% there
    • Or even just going mobile-first web app
    • If I could go back, personally I wouldn’t have gone native
  • Microservices until you actually need them (start with a monolith)

The Transferable Skill

The most valuable thing I learned isn’t any specific technology. It’s understanding how everything fits together:

  • How code actually moves from laptop to production
  • How to change infrastructure without breaking things
  • How distributed services coordinate (and fail)
  • How to debug when the problem could be anywhere
  • How to architect systems that don’t fall over

You only learn this by building something end-to-end and keeping it running. The business outcome is irrelevant for this kind of education. The technical foundation you build? That’s what stays with you.

That’s what transfers.


Next in series: SF: The AI Awakening — When to use AI as a tool and when to think for yourself. The distinction matters more than you think.