This is part of the series Startup that Failed .
Infrastructure as Code: Why It’s Non-Negotiable
Start with this rule: If you’re touching AWS console to create resources, you’re doing it wrong.
Every resource goes through Terraform. IAM roles, security groups, RDS instances, S3 buckets, EKS clusters — everything. No exceptions.
This isn’t about following best practices. It’s about three concrete problems:
1. Reproducibility. You will need to rebuild your infrastructure. Whether it’s disaster recovery, creating new environments, or (in my case) rebranding, you’ll need to recreate everything. With Terraform, it’s terraform apply
. Without it, you’re clicking through AWS console trying to remember what you configured six months ago.
2. Change tracking. Infrastructure changes should go through code review like application code. Someone shouldn’t be able to open a security group to 0.0.0.0/0 in production without anyone noticing. With Terraform in Git, every change is reviewed, documented, and traceable.
3. Drift detection. Terraform knows what should exist. When someone manually changes something (and they will), Terraform detects the drift and restores the intended state.
The Practical Implementation
Use Atlantis
for Terraform automation. When you open a PR that changes Terraform configs, Atlantis automatically runs terraform plan
and comments on the PR with what will change. After review and merge, it runs terraform apply
.
This means:
- No one needs Terraform installed locally
- No one needs AWS credentials with admin access
- All changes are peer-reviewed
- The audit trail is automatic
Structure your Terraform with modules and workspaces:
|
|
Modules encapsulate reusable infrastructure components. Define your EKS cluster configuration once, use it across all environments.
Workspaces manage environment-specific state without code duplication. Each workspace maintains its own state file, isolating changes:
|
|
The benefits:
- Single codebase for all environments—no duplicate configuration files
- State isolation per environment—changes in staging can’t accidentally affect production
- Consistent deployments—same infrastructure definition with environment-specific variables
- Easier testing—test infrastructure changes in staging workspace before applying to production
This pattern prevents the common anti-pattern of copy-pasting Terraform directories for each environment, which inevitably leads to configuration drift and maintenance nightmares.
For more details, see Terraform Workspaces Best Practices .
EKS Version Upgrades: The Right Way
Kubernetes versions change every 3-4 months. Upgrades are mandatory—AWS stops supporting old versions after 14 months.
With Terraform managing your EKS cluster, upgrading looks like:
|
|
Terraform handles:
- Control plane upgrade
- Node group updates with rolling deployments
- Addon compatibility updates
- API version migrations
In 18 months, I upgraded through five EKS versions. Each upgrade was a single PR, tested in staging first, then applied to production. Zero manual steps.
The alternative? Clicking through AWS console, coordinating node pool updates manually, praying you didn’t miss something. Hard pass.
Kubernetes Production Patterns on EKS
Resource Management: Requests, Limits, and Right-Sizing
Every container gets resource requests and limits. Not as a best practice, but because one runaway pod will consume all cluster resources and take everything down.
|
|
Why both requests and limits matter:
- Requests determine scheduling—Kubernetes places pods on nodes with sufficient available resources
- Limits prevent resource exhaustion—containers can’t exceed their limits, protecting other workloads
Every namespace should have ResourceQuotas and LimitRanges. This prevents:
- A single team/service consuming all cluster resources
- Pods being scheduled without resource specifications
- Memory leaks taking down unrelated services
Right-sizing is critical for cost optimization. AWS EKS cost optimization best practices emphasize analyzing actual resource usage and adjusting requests/limits accordingly. Over-provisioning wastes money; under-provisioning causes performance issues and evictions.
Use the Vertical Pod Autoscaler (VPA) to get recommendations:
|
|
VPA will analyze actual usage and recommend optimal requests/limits. Start in recommendation mode, validate the numbers, then apply them to your deployments.
Cluster Autoscaling: Karpenter vs Cluster Autoscaler
For node autoscaling, you have two primary options on EKS:
Cluster Autoscaler (traditional approach):
- Works with Auto Scaling Groups
- Requires pre-configured node groups for different instance types
- Scaling decisions based on pending pods and node utilization
- Tightly coupled to Kubernetes versions
Karpenter (modern approach):
- Provisions nodes directly based on pod requirements
- No need to pre-configure multiple node groups
- Faster scaling—launches right-sized instances in seconds
- Considers spot instances, pricing, availability zones automatically
I went with traditional managed node groups and Cluster Autoscaler. Worked fine. But if I was starting today? Karpenter , no question.
Why:
Application-first provisioning: Karpenter looks at pending pod requirements (CPU, memory, architecture, zone constraints) and provisions instances that fit exactly. No need to maintain dozens of node groups for different workload types.
Consolidation: Karpenter automatically consolidates workloads onto fewer nodes when utilization is low, terminating empty nodes. This reduces costs without manual intervention.
Spot instance handling: Built-in spot instance support with automatic diversification across instance types and AZs. When spot instances are interrupted, Karpenter provisions replacements before termination.
Example Karpenter NodePool configuration:
|
|
This tells Karpenter: Use spot or on-demand instances, AMD64 architecture, compute/memory/general-purpose instance families, consolidate when underutilized, and replace nodes after 30 days.
Network Policies: Defense in Depth
Default Kubernetes networking is flat—every pod can talk to every other pod. This is terrible for security.
Network policies define allowed traffic:
|
|
The API service can only receive traffic from the frontend and only send traffic to the database. Everything else is denied by default.
This limits blast radius when something gets compromised. A compromised frontend can’t directly access the database; a compromised API service can’t reach the internet to exfiltrate data.
Health Checks That Matter
Don’t just check if the HTTP endpoint responds. Check if the service can actually do its job:
|
|
Configure both liveness and readiness probes:
|
|
Liveness determines if the container should be restarted. Readiness determines if the pod should receive traffic.
When readiness checks fail, Kubernetes stops routing traffic to that pod. This prevents cascading failures where traffic hits pods that can’t process requests.
Observability: You Can’t Fix What You Can’t See
AWS observability best practices emphasize three pillars: metrics, logs, and traces.
The setup I went with was a combo of:
- Grafana
- Loki
- Tempo
- Mimir (/w Prometheus)
Though this is an essay on it’s own, so I won’t cover it here.
GitOps with ArgoCD
Git is the source of truth for what’s deployed. ArgoCD watches your Git repository and keeps the cluster in sync.
The workflow:
- Push changes to
main
→ ArgoCD deploys to staging automatically - Open PR to
production
branch → Review what will change - Merge PR → ArgoCD deploys to production
The killer feature: Automatic rollbacks. When a deployment fails health checks, ArgoCD automatically reverts to the previous version. No manual intervention. No debug session at 2am. Just automatic recovery.
This requires discipline:
- All changes go through Git
- No manual
kubectl apply
in production - Health checks must be comprehensive
But the payoff is massive. Production deployments become boring. Which is exactly what you want.
Personally this was the first time I completely owned, maintained and leveraged benefits of ArgoCD. And it can be a ton of fun when setup properly (without too much damn complexity).
Bazel for Monorepo Management
Bazel is overkill for small projects. For anything with multiple services, shared libraries, and complex dependencies, it’s the only sane choice.
The Problem Bazel Solves
You have:
- 20 microservices in Go
- Frontend in TypeScript -
nextjs
- Mobile apps in Swift & Kotlin
- Dozens of shared libraries
- Integration tests spanning multiple services
- Unit tests for each service
Without Bazel, your options:
- Multiple repositories → Coordinating releases is hell
- Single repository with separate build tools → Dependency hell
- Manual dependency tracking → Good luck
With Bazel:
- One build system for everything
- Hermetic builds (reproducible everywhere)
- Incremental builds (only rebuild what changed)
- Dependency graph awareness (know exactly what breaks when you change something)
Practical Bazel Usage
The key insight: Bazel builds are deterministic. Same inputs → Same outputs. Always.
This means:
- If it builds locally, it builds in CI
- Test results are cacheable (same code + same tests = same result)
- Remote caching works (share build artifacts across developers and CI)
When you change a shared library, bazel build //...
shows you exactly which services fail to build. Fix those, run bazel test //...
, and you know with certainty that everything still works.
Bzlmod: The New Dependency System
Bazel recently introduced Bzlmod, replacing the old WORKSPACE approach. It’s cleaner:
|
|
Dependencies are explicit. Version resolution is automatic. Conflicts are detected early.
For a monorepo with multiple languages and dependencies everywhere, this changes everything. Upgrade a dependency once, Bazel shows you every place that breaks. No hunting. No guessing.
Why?
Bazel was partly a selfish decision. At Peer, Inc. we were stuck on Bazel 5.4, then 6.3 for what felt like forever. Tech debt kept us from migrating to 7, let alone 8.
So when I was in control? Bleeding edge it is. Went straight to the latest version and stayed there. Did it pay off? Yeah — I got deep expertise with modern Bazel that most teams haven’t touched yet.
Next.js with React Server Components
RSC in Next.js 14+ is actually good. Not just “new shiny framework feature” good, but legitimately better architecture.
Why RSC Matters
Traditional React: Send JavaScript to client → Client fetches data → Client renders
With RSC: Server fetches data → Server renders HTML → Stream to client → Client hydrates only interactive parts
The benefits:
- Faster initial render - Users see content immediately
- Less JavaScript shipped - Only interactive components need client-side JS
- Better SEO - Search engines see fully rendered HTML
- Simplified data fetching - No useState/useEffect dance for data
Practical RSC Patterns
Server Components (default):
|
|
Client Components (only when needed):
|
|
The rule: Server Components by default. Client Components only when you need interactivity.
Streaming SSR
Next.js supports streaming—send parts of the page as they’re ready:
|
|
Users see the header instantly while the slow component loads. The page feels fast even when parts are slow.
gRPC for Service Communication
REST is simpler. gRPC is better for service-to-service communication. And, again, selfish reasons as I wanted to do property testing of a gRPC
primary service.
Type Safety Across the Network
Define your service contract once:
|
|
Generate clients for every language:
- TypeScript client for frontend
- Go client for backend services
- Swift client for iOS
- Here I leveraged the new grpc-swift v2.x ; Migration from v1 to v2 was pure f*in insanity…
Change the proto definition? The compiler tells you everywhere that breaks. No runtime surprises.
Streaming for Real-Time Features
gRPC supports four types of streaming:
- Unary (request/response)
- Server streaming (one request, multiple responses)
- Client streaming (multiple requests, one response)
- Bidirectional streaming (both send multiple messages)
For real-time updates, bidirectional streaming is powerful:
|
|
Clients subscribe once and receive updates as they happen. No polling. Built-in backpressure handling. Clean and efficient.
HTTP/2 Benefits
gRPC runs on HTTP/2, which means:
- Multiplexing (multiple streams on one connection)
- Header compression
- Binary protocol (smaller than JSON)
- Lower latency than REST over HTTP/1.1
For internal service communication, these benefits add up.
iOS Development: SwiftUI and Offline-First
SwiftUI is the right choice for new iOS apps in 2024+. UIKit is still around, but why would you do that to yourself? Declarative UI is just… better.
Declarative UI
|
|
The UI is a function of state. Change state, UI updates automatically. No manual view controller coordination.
Offline-First Architecture
The iOS app had to work without network connectivity. This forced good architectural decisions:
Local-first data storage: Core Data as the source of truth. Network is for sync only.
Background sync: NSURLSession background tasks handle sync when network is available.
Conflict resolution: Last-write-wins with vector clocks for conflict detection.
This architecture made the app resilient. Spotty WiFi? App still works. Server down? App still works. Network comes back? Automatic sync.
The pattern applies beyond iOS: Build like the network is unreliable, because it is.
How does Bazel play into this?
Pretty straightforward: wrote a gomobile
package that wrapped gRPC
for mobile->backend communication. Local data storage, security, the works. Both native apps just import the same package.
Why? If the startup actually succeeded, I’d have one place to maintain instead of duplicating logic across Swift and Kotlin. Plus I could guarantee both platforms behaved identically. Spoiler: didn’t matter since we never got there, but the pattern was solid.
Patterns That Actually Matter
Technologies change. Patterns stick around longer:
1. IaC
Manual changes don’t scale. You will forget how you configured things. Automate everything.
2. Automated Deployments with Rollbacks
Humans are bad at deploying safely. Automate deployments. Make rollbacks automatic.
3. Type Safety Across Boundaries
Whether it’s gRPC, GraphQL, or TypeScript across frontend/backend—type-checked contracts prevent entire classes of bugs.
4. Hermetic Builds
If your build depends on “whatever’s on the build machine,” you have a time bomb. Make builds reproducible.
5. Comprehensive Testing in CI
If running tests is hard, you won’t run them. Make testing frictionless: one command, runs everything, fails fast.
6. Observability from Day One
Don’t add logging and monitoring later. Instrument from the start. You can’t debug what you can’t see.
These patterns apply whether you’re using my stack or something completely different.
What to Copy
If you’re starting a new project:
Definitely do:
- IaC (Terraform + Atlantis)
- GitOps with automatic rollbacks (ArgoCD or similar)
- Comprehensive CI/CD from day one
- Type-safe contracts between services
Consider carefully:
- Kubernetes (adds complexity, worth it at scale)
- Bazel (incredible for monorepos, steep learning curve)
- gRPC (better than REST for service-to-service, more complex to debug)
- Next.js with RSC (great for web apps, requires understanding new mental model)
Probably skip:
- Building native mobile apps unless you need platform-specific features
- React Native/Flutter get you 90% there
- Or even just going mobile-first web app
- If I could go back, personally I wouldn’t have gone native
- Microservices until you actually need them (start with a monolith)
The Transferable Skill
The most valuable thing I learned isn’t any specific technology. It’s understanding how everything fits together:
- How code actually moves from laptop to production
- How to change infrastructure without breaking things
- How distributed services coordinate (and fail)
- How to debug when the problem could be anywhere
- How to architect systems that don’t fall over
You only learn this by building something end-to-end and keeping it running. The business outcome is irrelevant for this kind of education. The technical foundation you build? That’s what stays with you.
That’s what transfers.
Next in series: SF: The AI Awakening — When to use AI as a tool and when to think for yourself. The distinction matters more than you think.