Wednesday, March 22 • 5:00pm - 5:50pm
Predicting Storage Failures with Machine Learning - Ahmed El-Shimi, Minima

Disk drives fail at an average annual rate of ~2%. Any system with Availability and Durability requirements must mitigate for such failures through a redundancy technique such as RAID, Erasure Coding, Replication or Backup.

With the wealth of monitoring data available nowadays and the ability to process the data in near real-time, can we predict such failures? How well can we do it? And how would that impact how we design and operate large distributed systems?

We examine and motivate predictive failure detection in the context of Availability, Rebuild Times and Recovery Objectives of large systems. We then train and evaluate multiple models achieving favorable accuracy (97.5%) to common datacenter practices. We demonstrate how we can tune our learners to achieve different Precision and Recall objectives thus improving Availability, Protection or Operational Efficiency.

Ahmed El-Shimi

Founder, Minima
Ahmed El-Shimi has worked in Storage, Distributed Systems, and Cloud for over 15 years. He built technologies such as Deduplication, Automated Tiering, Hybrid Cloud Storage and Data Awareness. He is currently Co-Founder of Minima Inc. a Cloud Data Governance Startup. Prior he led Product for Microsoft's StorSimple Appliance and worked at Microsoft Research and on products such as Microsoft Azure and Windows Server. Ahmed has spoken at LinuxCon... Read More →

Wednesday March 22, 2017 5:00pm - 5:50pm
Paul Revere C

