Microservices Architecture on GCP: Lessons Learned

After spending months building and deploying a production microservices platform for pharmacy operations on GCP, I want to share the architecture decisions, challenges, and lessons learned along the way.

The Challenge

We needed to build a scalable platform that could handle real-time inventory management, process prescriptions with sub-second latency, scale to support 100+ concurrent users, and maintain 99.9% uptime.

Architecture Overview

Our platform consists of 6 main microservices: API Gateway, Authentication, Inventory, Prescription Processing, Notifications, and Analytics. All services communicate through Google Pub/Sub for event-driven architecture.

Key Architecture Decisions

Service Communication Pattern

Decision: Event-driven architecture using Google Pub/Sub

Why: Decouples services, enables async processing, provides natural retry mechanism, and makes it easy to add new subscribers.

from google.cloud import pubsub_v1

class EventBus:
    def __init__(self):
        self.publisher = pubsub_v1.PublisherClient()

    def publish_event(self, topic: str, event_data: dict):
        topic_path = self.publisher.topic_path(PROJECT_ID, topic)
        message_json = json.dumps(event_data).encode('utf-8')
        future = self.publisher.publish(topic_path, message_json)
        return future.result()

Data Management Strategy

Each service owns its data with database per service pattern. We use PostgreSQL for transactional data, Redis for caching, and implement event sourcing for critical audit trails.

Deployment Strategy

We chose Google Cloud Run for automatic scaling, pay-per-use pricing, built-in TLS termination, and simple deployment.

name: Deploy to Cloud Run

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Build and Push
        run: |
          gcloud builds submit --tag gcr.io/$PROJECT_ID/$SERVICE_NAME
      - name: Deploy
        run: |
          gcloud run deploy $SERVICE_NAME \
            --image gcr.io/$PROJECT_ID/$SERVICE_NAME \
            --platform managed \
            --region us-central1

API Gateway Pattern

FastAPI-based API Gateway handles request routing, authentication/authorization, rate limiting, and request/response transformation.

Performance Optimizations

Caching Strategy

We implemented multi-layer caching using Redis with TTL-based invalidation, significantly reducing database load and improving response times.

Database Connection Pooling

Using SQLAlchemy's connection pooling with proper pool sizing helped us handle concurrent requests efficiently.

Async Processing

Leveraging Python's async/await throughout reduced processing time from 30s to 3s for batch operations of 100 items.

Observability

Structured Logging

Implemented structured logging with consistent fields across all services for easier debugging and analysis.

Metrics with Prometheus

Tracked key metrics including request counts, response times, error rates, and business-specific metrics.

Distributed Tracing

Used Google Cloud Trace for end-to-end request tracking across services.

Challenges and Solutions

Cold Starts

Cloud Run cold starts were causing 1-2s latency spikes. We solved this by setting minimum instances to 1 for critical services, lazy loading heavy dependencies, and optimizing startup probes.

Data Consistency

Eventual consistency across services was challenging. We implemented the Saga pattern for distributed transactions with compensating transactions for failures.

Cost Management

Initial costs were high due to over-provisioning. We right-sized instances, implemented aggressive caching, used Cloud Storage for static assets, and optimized database queries. Result: 60% cost reduction.

Key Metrics

After 6 months in production:

Uptime: 99.94%
P95 Latency: under 300ms
Peak RPS: 500+ requests/second
Monthly Cost: ~$800 for entire platform
Deployment Frequency: 3-5x per week

Lessons Learned

Start Simple, Evolve: Don't over-engineer from day one. We started with 3 services and added more as needed.

Observability is Critical: You can't debug what you can't see. Invest early in structured logging, distributed tracing, and comprehensive metrics.

Database per Service is Worth It: Despite added complexity, service autonomy paid off when we needed to scale services independently and change schemas without coordination.

Event-Driven Architecture: Pub/Sub gave us natural decoupling, easy replay for debugging, and graceful degradation.

Automate Everything: CI/CD pipeline saved us countless hours and reduced deployment errors.

Conclusion

Building microservices on GCP has been a journey of continuous learning and iteration. The platform is now stable, scalable, and maintainable.

Key takeaways:

Use managed services when possible (Cloud Run, Pub/Sub)
Invest in observability early
Start simple, add complexity as needed
Automate deployment and testing
Document architecture decisions

Check out the complete code on GitHub.

Questions? I'd love to discuss microservices architecture. Reach out on LinkedIn!