Microservices Architecture on GCP: Lessons Learned
After spending months building and deploying a production microservices platform for pharmacy operations on GCP, I want to share the architecture decisions, challenges, and lessons learned along the way.
The Challenge
We needed to build a scalable platform that could handle real-time inventory management, process prescriptions with sub-second latency, scale to support 100+ concurrent users, and maintain 99.9% uptime.
Architecture Overview
Our platform consists of 6 main microservices: API Gateway, Authentication, Inventory, Prescription Processing, Notifications, and Analytics. All services communicate through Google Pub/Sub for event-driven architecture.
Key Architecture Decisions
Service Communication Pattern
Decision: Event-driven architecture using Google Pub/Sub
Why: Decouples services, enables async processing, provides natural retry mechanism, and makes it easy to add new subscribers.
from google.cloud import pubsub_v1
class EventBus:
def __init__(self):
self.publisher = pubsub_v1.PublisherClient()
def publish_event(self, topic: str, event_data: dict):
topic_path = self.publisher.topic_path(PROJECT_ID, topic)
message_json = json.dumps(event_data).encode('utf-8')
future = self.publisher.publish(topic_path, message_json)
return future.result()
Data Management Strategy
Each service owns its data with database per service pattern. We use PostgreSQL for transactional data, Redis for caching, and implement event sourcing for critical audit trails.
Deployment Strategy
We chose Google Cloud Run for automatic scaling, pay-per-use pricing, built-in TLS termination, and simple deployment.
name: Deploy to Cloud Run
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Build and Push
run: |
gcloud builds submit --tag gcr.io/$PROJECT_ID/$SERVICE_NAME
- name: Deploy
run: |
gcloud run deploy $SERVICE_NAME \
--image gcr.io/$PROJECT_ID/$SERVICE_NAME \
--platform managed \
--region us-central1
API Gateway Pattern
FastAPI-based API Gateway handles request routing, authentication/authorization, rate limiting, and request/response transformation.
Performance Optimizations
Caching Strategy
We implemented multi-layer caching using Redis with TTL-based invalidation, significantly reducing database load and improving response times.
Database Connection Pooling
Using SQLAlchemy's connection pooling with proper pool sizing helped us handle concurrent requests efficiently.
Async Processing
Leveraging Python's async/await throughout reduced processing time from 30s to 3s for batch operations of 100 items.
Observability
Structured Logging
Implemented structured logging with consistent fields across all services for easier debugging and analysis.
Metrics with Prometheus
Tracked key metrics including request counts, response times, error rates, and business-specific metrics.
Distributed Tracing
Used Google Cloud Trace for end-to-end request tracking across services.
Challenges and Solutions
Cold Starts
Cloud Run cold starts were causing 1-2s latency spikes. We solved this by setting minimum instances to 1 for critical services, lazy loading heavy dependencies, and optimizing startup probes.
Data Consistency
Eventual consistency across services was challenging. We implemented the Saga pattern for distributed transactions with compensating transactions for failures.
Cost Management
Initial costs were high due to over-provisioning. We right-sized instances, implemented aggressive caching, used Cloud Storage for static assets, and optimized database queries. Result: 60% cost reduction.
Key Metrics
After 6 months in production:
- Uptime: 99.94%
- P95 Latency: under 300ms
- Peak RPS: 500+ requests/second
- Monthly Cost: ~$800 for entire platform
- Deployment Frequency: 3-5x per week
Lessons Learned
Start Simple, Evolve: Don't over-engineer from day one. We started with 3 services and added more as needed.
Observability is Critical: You can't debug what you can't see. Invest early in structured logging, distributed tracing, and comprehensive metrics.
Database per Service is Worth It: Despite added complexity, service autonomy paid off when we needed to scale services independently and change schemas without coordination.
Event-Driven Architecture: Pub/Sub gave us natural decoupling, easy replay for debugging, and graceful degradation.
Automate Everything: CI/CD pipeline saved us countless hours and reduced deployment errors.
Conclusion
Building microservices on GCP has been a journey of continuous learning and iteration. The platform is now stable, scalable, and maintainable.
Key takeaways:
- Use managed services when possible (Cloud Run, Pub/Sub)
- Invest in observability early
- Start simple, add complexity as needed
- Automate deployment and testing
- Document architecture decisions
Check out the complete code on GitHub.
Questions? I'd love to discuss microservices architecture. Reach out on LinkedIn!