myMBR-OS — Utility Reliability Standards

Every utility built as part of myMBR-OS must conform to these standards. Derived from the my_backup system — a proven defense-in-depth architecture.

Reference: D:\FSS\Software\Utils\PythonUtils\my_backup\README.md

The Core Principle: Defense-in-Depth

No utility can be trusted to report its own success. Every critical process must have an independent verifier that runs at a different time and can detect silent failures.

Pattern from my_backup:

Main process runs at 2 AM (Windows Task Scheduler)
Independent verifier runs at 8 AM (WSL cron) — checks snapshot age, not just exit code
If the 2 AM run silently fails, the 8 AM check detects it and fires an alert

Applied to myMBR-OS:

Rate scraper runs at 06:00 ET (cron)
Health checker runs at 09:00 ET (independent cron) — verifies data freshness, record counts, anomaly thresholds
If scraper silently fails, health checker detects stale data and fires alert

Required Standards for Every Utility

1. Structured Logging

All output written to logs/[utility_name].log
Log rotation: keep last 5 runs
Format: [YYYY-MM-DD HH:MM:SS] LEVEL : message
Levels: INFO, WARNING, ERROR, CRITICAL
Every run starts with === RUN STARTED === and ends with RESULT: SUCCESS or RESULT: FAILED

2. Alerts on Failure

All critical failures trigger immediate alerts via notify_manager (email + Telegram)
Alert subjects: tiered by severity
- Calm (INFO/success): [myMBR-OS] [utility]: OK — no issues
- Loud (WARNING): [myMBR-OS] [utility]: WARNING — review required
- Critical (ERROR/CRITICAL): [myMBR-OS] [utility]: CRITICAL — immediate action needed
Notification failures are non-blocking — never abort the main process

3. Weekly Status Report

Weekly email summarizing the last 7 days of operation for all utilities
Calm subject on clean week; loud subject if any warnings or errors occurred
Delivered Sunday morning (mirror of my_backup’s pattern)

4. Independent Verification

Every utility with scheduled runs must have a separate verification job
Verifier runs at a different scheduled time than the main process
Verifier checks outcomes (data freshness, record counts, expected values) — not just exit codes
Verifier fires its own alert if the main process appears to have failed silently

5. Dry-Run Mode

Every utility supports --dry-run flag
Dry run logs what would happen without modifying any state
Required before any production deployment or config change

6. Test Suite

Every utility has a test suite in tests/
Tests cover: config validation, connectivity, data integrity, end-to-end pipeline
Includes at least one “restore/recovery” test that verifies data can actually be used downstream
Run: uv run pytest tests/ before any deployment

7. Config as SSOT

All paths, thresholds, credentials, and feature flags defined in config.yaml
Secrets in .env (never committed)
.env.example uses placeholder values only
No hardcoded values in Python code

8. Graceful Failure

Failures are logged and alerted, never silently swallowed
Pipeline continues what it can; accumulates failures
Final result: SUCCESS (all tasks passed), WARNING (some non-critical failures), or FAILED (any critical failure)
A FAILED result always triggers an alert

9. Human Escalation Thresholds

Define in config.yaml for each utility:

What changes are expected and can proceed automatically
What changes are anomalous and require human approval before proceeding
Example (rate scanner): >15% of rates changed since previous run → pause and alert

10. Annual Fire Drill

Once per year: manually test the full recovery path for each utility
For rate scanner: delete the SQLite database, verify it can be reconstructed from source
For any downstream consumer: verify it can operate correctly on reconstructed data
Annual reminder email delivered Jan 1 (mirror of my_backup’s send_manual_reminders)

notify_manager Integration

All myMBR-OS utilities use the existing notify_manager system for alerts.

Reference: D:\FSS\Software\Utils\PythonUtils\notify_manager\

Usage pattern (mirror of my_backup):

from notifications import notify_manager  # or equivalent import

# Critical failure
notify_manager.send_alert(
    subject="[myMBR-OS] rate-scanner: CRITICAL — scraper failed",
    body=error_details,
    level="critical"
)

Operational Cadence Template

Frequency	Time	Task
Daily	06:00 ET	Main process (e.g., rate scraper)
Daily	09:00 ET	Independent health check
Weekly	Sunday 09:00	Status report email
Monthly	1st of month	Full maintenance run
Annually	Jan 1	Manual reminders + fire drill checklist

Each utility adapts this template to its own needs.

myMBR-OS — Utility Reliability Standards

myMBR-OS — Utility Reliability Standards

The Core Principle: Defense-in-Depth

Required Standards for Every Utility

1. Structured Logging

2. Alerts on Failure

3. Weekly Status Report

4. Independent Verification

5. Dry-Run Mode

6. Test Suite

7. Config as SSOT

8. Graceful Failure

9. Human Escalation Thresholds

10. Annual Fire Drill

notify_manager Integration

Operational Cadence Template

Checklist: Before Shipping a New Utility