myMBR-OS — Utility Reliability Standards
myMBR-OS — Utility Reliability Standards
Section titled “myMBR-OS — Utility Reliability Standards”Every utility built as part of myMBR-OS must conform to these standards. Derived from the
my_backupsystem — a proven defense-in-depth architecture.Reference:
D:\FSS\Software\Utils\PythonUtils\my_backup\README.md
The Core Principle: Defense-in-Depth
Section titled “The Core Principle: Defense-in-Depth”No utility can be trusted to report its own success. Every critical process must have an independent verifier that runs at a different time and can detect silent failures.
Pattern from my_backup:
- Main process runs at 2 AM (Windows Task Scheduler)
- Independent verifier runs at 8 AM (WSL cron) — checks snapshot age, not just exit code
- If the 2 AM run silently fails, the 8 AM check detects it and fires an alert
Applied to myMBR-OS:
- Rate scraper runs at 06:00 ET (cron)
- Health checker runs at 09:00 ET (independent cron) — verifies data freshness, record counts, anomaly thresholds
- If scraper silently fails, health checker detects stale data and fires alert
Required Standards for Every Utility
Section titled “Required Standards for Every Utility”1. Structured Logging
Section titled “1. Structured Logging”- All output written to
logs/[utility_name].log - Log rotation: keep last 5 runs
- Format:
[YYYY-MM-DD HH:MM:SS] LEVEL : message - Levels:
INFO,WARNING,ERROR,CRITICAL - Every run starts with
=== RUN STARTED ===and ends withRESULT: SUCCESSorRESULT: FAILED
2. Alerts on Failure
Section titled “2. Alerts on Failure”- All critical failures trigger immediate alerts via
notify_manager(email + Telegram) - Alert subjects: tiered by severity
- Calm (INFO/success):
[myMBR-OS] [utility]: OK — no issues - Loud (WARNING):
[myMBR-OS] [utility]: WARNING — review required - Critical (ERROR/CRITICAL):
[myMBR-OS] [utility]: CRITICAL — immediate action needed
- Calm (INFO/success):
- Notification failures are non-blocking — never abort the main process
3. Weekly Status Report
Section titled “3. Weekly Status Report”- Weekly email summarizing the last 7 days of operation for all utilities
- Calm subject on clean week; loud subject if any warnings or errors occurred
- Delivered Sunday morning (mirror of my_backup’s pattern)
4. Independent Verification
Section titled “4. Independent Verification”- Every utility with scheduled runs must have a separate verification job
- Verifier runs at a different scheduled time than the main process
- Verifier checks outcomes (data freshness, record counts, expected values) — not just exit codes
- Verifier fires its own alert if the main process appears to have failed silently
5. Dry-Run Mode
Section titled “5. Dry-Run Mode”- Every utility supports
--dry-runflag - Dry run logs what would happen without modifying any state
- Required before any production deployment or config change
6. Test Suite
Section titled “6. Test Suite”- Every utility has a test suite in
tests/ - Tests cover: config validation, connectivity, data integrity, end-to-end pipeline
- Includes at least one “restore/recovery” test that verifies data can actually be used downstream
- Run:
uv run pytest tests/before any deployment
7. Config as SSOT
Section titled “7. Config as SSOT”- All paths, thresholds, credentials, and feature flags defined in
config.yaml - Secrets in
.env(never committed) .env.exampleuses placeholder values only- No hardcoded values in Python code
8. Graceful Failure
Section titled “8. Graceful Failure”- Failures are logged and alerted, never silently swallowed
- Pipeline continues what it can; accumulates failures
- Final result:
SUCCESS(all tasks passed),WARNING(some non-critical failures), orFAILED(any critical failure) - A
FAILEDresult always triggers an alert
9. Human Escalation Thresholds
Section titled “9. Human Escalation Thresholds”Define in config.yaml for each utility:
- What changes are expected and can proceed automatically
- What changes are anomalous and require human approval before proceeding
- Example (rate scanner): >15% of rates changed since previous run → pause and alert
10. Annual Fire Drill
Section titled “10. Annual Fire Drill”- Once per year: manually test the full recovery path for each utility
- For rate scanner: delete the SQLite database, verify it can be reconstructed from source
- For any downstream consumer: verify it can operate correctly on reconstructed data
- Annual reminder email delivered Jan 1 (mirror of my_backup’s
send_manual_reminders)
notify_manager Integration
Section titled “notify_manager Integration”All myMBR-OS utilities use the existing notify_manager system for alerts.
Reference: D:\FSS\Software\Utils\PythonUtils\notify_manager\
Usage pattern (mirror of my_backup):
from notifications import notify_manager # or equivalent import
# Critical failurenotify_manager.send_alert( subject="[myMBR-OS] rate-scanner: CRITICAL — scraper failed", body=error_details, level="critical")Operational Cadence Template
Section titled “Operational Cadence Template”| Frequency | Time | Task |
|---|---|---|
| Daily | 06:00 ET | Main process (e.g., rate scraper) |
| Daily | 09:00 ET | Independent health check |
| Weekly | Sunday 09:00 | Status report email |
| Monthly | 1st of month | Full maintenance run |
| Annually | Jan 1 | Manual reminders + fire drill checklist |
Each utility adapts this template to its own needs.
Checklist: Before Shipping a New Utility
Section titled “Checklist: Before Shipping a New Utility”- Structured logging implemented (rotation, format, levels)
- Alerts on failure wired to notify_manager
- Independent verifier job created (different cron time)
- Weekly status report covers this utility
- Dry-run mode implemented and tested
- Test suite written and passing (including end-to-end)
- config.yaml / .env.example created (no hardcoded values)
- Human escalation thresholds defined
- Annual fire drill added to reminder list
- README.md written (mirrors my_backup style)