Handling hardware failure

Hardware fails. Our job is to make that boring. Here's the runbook.

Detection: every dedicated server runs an on-chassis health agent that reports SMART status, ECC error rates, fan speeds, and PSU status to our control plane. Critical alerts open an internal ticket within 60 seconds.

Response: a Vintony engineer reviews the alert and decides between (a) a remote action (e.g. clear a soft SMART error), (b) a scheduled component swap during a maintenance window, or (c) an immediate hot-spare swap. For (c), we replace from a pool of hot-spare chassis we keep on-site in every region.

What you should do: configure dashboard alerts on hardware events. Most customers want to know about a failed drive even if we've already started the replacement; some don't. Both are fine. You can also opt into auto-migration to a spare chassis after N critical alerts within a window; the default is off because most fleets prefer not to be moved involuntarily.