Monitoring Architecture
This document outlines the comprehensive monitoring strategy for the Stayzr hotel management system, covering application performance, infrastructure health, and business metrics.
Monitoring Overview
The Stayzr monitoring architecture provides observability across three key areas:
- Application Monitoring: Performance, errors, and user experience
- Infrastructure Monitoring: System resources and service health
- Business Monitoring: Key performance indicators and operational metrics
Monitoring Stack
Core Monitoring Components
Technology Stack
Metrics Collection:
- Prometheus: Time-series metrics storage
- Node Exporter: System metrics collection
- Application metrics: Custom business metrics
Logging:
- Winston: Application logging library
- Loki: Log aggregation and storage
- Alternative: ELK Stack (Elasticsearch, Logstash, Kibana)
Tracing:
- OpenTelemetry: Distributed tracing instrumentation
- Jaeger: Trace storage and analysis
Visualization:
- Grafana: Dashboards and visualization
- Custom dashboards: Business-specific metrics
Alerting:
- AlertManager: Alert routing and management
- PagerDuty: Incident management
- Slack: Team notifications
Application Monitoring
Key Performance Indicators
Response Time Metrics:
- HTTP request duration (P50, P95, P99)
- Database query execution time
- External API response time
- Background job processing time
Error Rate Metrics:
- HTTP error rates (4xx, 5xx)
- Database connection errors
- External service failures
- Application exceptions
Throughput Metrics:
- Requests per second
- Database queries per second
- Message processing rate
- Concurrent user sessions
Business Metrics:
- Guest check-ins per hour
- Service request response time
- Booking conversion rate
- System availability percentage
Application Metrics Implementation
// Custom metrics using Prometheus client
import { register, Counter, Histogram, Gauge } from 'prom-client';
// HTTP request metrics
export const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});
// Database metrics
export const databaseQueryDuration = new Histogram({
name: 'database_query_duration_seconds',
help: 'Duration of database queries in seconds',
labelNames: ['query_type', 'table'],
buckets: [0.001, 0.005, 0.015, 0.05, 0.1, 0.5, 1]
});
// Business metrics
export const guestCheckIns = new Counter({
name: 'guest_checkins_total',
help: 'Total number of guest check-ins',
labelNames: ['hotel_id', 'checkin_type']
});
export const activeReservations = new Gauge({
name: 'active_reservations',
help: 'Current number of active reservations',
labelNames: ['hotel_id', 'status']
});
Error Tracking
// Error tracking middleware
export const errorTrackingMiddleware = (error: Error, req: Request, res: Response, next: NextFunction) => {
// Log structured error
logger.error('Application error', {
error: error.message,
stack: error.stack,
requestId: req.headers['x-request-id'],
userId: req.user?.id,
route: req.route?.path,
method: req.method,
timestamp: new Date().toISOString()
});
// Increment error counter
httpErrorsTotal.inc({
method: req.method,
route: req.route?.path || 'unknown',
status_code: res.statusCode.toString()
});
next(error);
};
Infrastructure Monitoring
System Metrics
Server Metrics:
- CPU utilization (user, system, idle)
- Memory usage (used, available, cached)
- Disk usage and I/O operations
- Network traffic (bytes in/out)
Database Metrics:
- Connection pool usage
- Query performance statistics
- Lock contention and deadlocks
- Replication lag (if applicable)
Redis Metrics:
- Memory usage and hit rates
- Connection count
- Command statistics
- Keyspace information
Container Metrics:
- Container resource usage
- Container restart counts
- Image vulnerabilities
- Registry pull statistics
Infrastructure Configuration
# Docker Compose monitoring services
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
grafana:
image: grafana/grafana:latest
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
prometheus_data:
grafana_data:
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'stayzr-app'
static_configs:
- targets: ['app:3000']
scrape_interval: 5s
metrics_path: /metrics
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
Logging Strategy
Structured Logging
// Winston logger configuration
import winston from 'winston';
export const logger = winston.createLogger({
level: process.env.LOG_LEVEL || 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: 'stayzr-api',
version: process.env.APP_VERSION
},
transports: [
new winston.transports.Console({
format: winston.format.combine(
winston.format.colorize(),
winston.format.simple()
)
}),
new winston.transports.File({
filename: 'logs/error.log',
level: 'error'
}),
new winston.transports.File({
filename: 'logs/combined.log'
})
]
});
// Request logging middleware
export const requestLogger = (req: Request, res: Response, next: NextFunction) => {
const startTime = Date.now();
res.on('finish', () => {
const duration = Date.now() - startTime;
logger.info('HTTP Request', {
method: req.method,
url: req.url,
statusCode: res.statusCode,
duration,
userAgent: req.get('User-Agent'),
ip: req.ip,
userId: req.user?.id,
requestId: req.headers['x-request-id']
});
});
next();
};
Log Aggregation
# Loki configuration for log aggregation
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2023-01-01
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb:
directory: /loki/index
filesystem:
directory: /loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
Alerting Rules
Critical Alerts
# alert_rules.yml
groups:
- name: stayzr.critical
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors per second"
- alert: DatabaseDown
expr: up{job="postgres"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database is down"
description: "PostgreSQL database is not responding"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is {{ $value | humanizePercentage }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space"
description: "Disk space is {{ $value | humanizePercentage }} full"
Business Alerts
- name: stayzr.business
rules:
- alert: CheckInFailureRate
expr: rate(guest_checkins_failed_total[10m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High check-in failure rate"
description: "Check-in failure rate is {{ $value }} per minute"
- alert: ServiceRequestBacklog
expr: service_requests_pending > 50
for: 10m
labels:
severity: warning
annotations:
summary: "Service request backlog"
description: "{{ $value }} service requests are pending"
- alert: ReservationSystemDown
expr: up{job="pms-integration"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Reservation system integration down"
description: "PMS integration service is not responding"
Dashboard Configuration
Grafana Dashboards
{
"dashboard": {
"title": "Stayzr System Overview",
"panels": [
{
"title": "Request Rate",
"type": "stat",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{ method }} {{ route }}"
}
]
},
{
"title": "Response Time",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
},
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "50th percentile"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"4..|5..\"}[5m])",
"legendFormat": "{{ status }}"
}
]
}
]
}
}
Business Metrics Dashboard
Business Dashboard Panels:
- Guest Check-ins (hourly/daily)
- Active Reservations
- Service Request Response Time
- Staff Performance Metrics
- Revenue Metrics
- Guest Satisfaction Scores
- System Availability
- Popular Services
Performance Monitoring
Application Performance Monitoring (APM)
// OpenTelemetry tracing setup
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
const jaegerExporter = new JaegerExporter({
endpoint: process.env.JAEGER_ENDPOINT || 'http://localhost:14268/api/traces'
});
const sdk = new NodeSDK({
traceExporter: jaegerExporter,
instrumentations: [getNodeAutoInstrumentations()]
});
sdk.start();
Database Performance Monitoring
-- PostgreSQL monitoring queries
SELECT
query,
calls,
total_time,
mean_time,
rows,
100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;
-- Connection monitoring
SELECT
datname,
numbackends,
xact_commit,
xact_rollback,
blks_read,
blks_hit,
tup_returned,
tup_fetched,
tup_inserted,
tup_updated,
tup_deleted
FROM pg_stat_database;
Health Checks
Application Health Endpoints
// Health check implementation
export const healthController = {
basic: async (req: Request, res: Response) => {
res.json({
status: 'healthy',
timestamp: new Date().toISOString(),
uptime: process.uptime(),
version: process.env.APP_VERSION
});
},
detailed: async (req: Request, res: Response) => {
const checks = await Promise.allSettled([
checkDatabase(),
checkRedis(),
checkExternalServices()
]);
const healthStatus = {
status: checks.every(check => check.status === 'fulfilled') ? 'healthy' : 'unhealthy',
timestamp: new Date().toISOString(),
checks: {
database: checks[0].status === 'fulfilled',
redis: checks[1].status === 'fulfilled',
external: checks[2].status === 'fulfilled'
}
};
const statusCode = healthStatus.status === 'healthy' ? 200 : 503;
res.status(statusCode).json(healthStatus);
}
};
async function checkDatabase(): Promise<boolean> {
try {
await prisma.$queryRaw`SELECT 1`;
return true;
} catch (error) {
logger.error('Database health check failed', { error });
return false;
}
}
Monitoring Best Practices
Metric Guidelines
- Use appropriate metric types: Counters for totals, gauges for current values, histograms for distributions
- Add meaningful labels: Include relevant dimensions but avoid high cardinality
- Monitor business metrics: Track KPIs that matter to stakeholders
- Set up SLIs/SLOs: Define service level indicators and objectives
Alerting Guidelines
- Alert on symptoms, not causes: Focus on user-facing issues
- Avoid alert fatigue: Ensure alerts are actionable and not noisy
- Use escalation policies: Route alerts to appropriate teams
- Document runbooks: Provide clear response procedures
Dashboard Guidelines
- Keep dashboards focused: One dashboard per service or team
- Use consistent time ranges: Align all panels to the same time period
- Include context: Add annotations and descriptions
- Regular review: Update dashboards as system evolves
Troubleshooting Guide
Common Issues
-
High Memory Usage
# Check memory usage by process
docker stats
# Analyze heap dumps
node --inspect app.js -
Database Performance Issues
-- Check slow queries
SELECT query, mean_time, calls
FROM pg_stat_statements
ORDER BY mean_time DESC;
-- Check lock contention
SELECT * FROM pg_locks WHERE NOT granted; -
Network Issues
# Check network connectivity
curl -v http://localhost:3000/health
# Monitor network traffic
netstat -tuln