no code implementations • 27 Feb 2024 • JunJie Huang, Jinyang Liu, Zhuangbin Chen, Zhihan Jiang, Yichen Li, Jiazhen Gu, Cong Feng, Zengyin Yang, Yongqiang Yang, Michael R. Lyu
To date, FaultProfIT has analyzed 10, 000+ incidents from 30+ cloud services, successfully revealing several fault trends that have informed system improvements.
1 code implementation • 10 Jan 2024 • Jinyang Liu, Wenwei Gu, Zhuangbin Chen, Yichen Li, Yuxin Su, Michael R. Lyu
These methods are evaluated with five multivariate KPI datasets that are publicly available.
no code implementations • 19 Aug 2023 • Jinyang Liu, Tianyi Yang, Zhuangbin Chen, Yuxin Su, Cong Feng, Zengyin Yang, Michael R. Lyu
As modern software systems continue to grow in terms of complexity and volume, anomaly detection on multivariate monitoring metrics, which profile systems' health status, becomes more and more critical and challenging.
1 code implementation • 20 Jul 2023 • Wenwei Gu, Jinyang Liu, Zhuangbin Chen, Jianping Zhang, Yuxin Su, Jiazhen Gu, Cong Feng, Zengyin Yang, Michael Lyu
Performance issues permeate large-scale cloud service systems, which can lead to huge revenue losses.
no code implementations • 8 Jun 2023 • Jinyang Liu, JunJie Huang, Yintong Huo, Zhihan Jiang, Jiazhen Gu, Zhuangbin Chen, Cong Feng, Minzhi Yan, Michael R. Lyu
System logs play a critical role in maintaining the reliability of software systems.
2 code implementations • 14 Feb 2023 • Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, Yongqiang Yang, Michael R. Lyu
Our study demonstrates that logs and metrics can manifest system anomalies collaboratively and complementarily, and neither of them only is sufficient.
1 code implementation • 27 Aug 2021 • Zhuangbin Chen, Jinyang Liu, Yuxin Su, Hongyu Zhang, Xuemin Wen, Xiao Ling, Yongqiang Yang, Michael R. Lyu
The proposed framework is evaluated with real-world incident data collected from a large-scale online service system of Huawei Cloud.
1 code implementation • 13 Jul 2021 • Zhuangbin Chen, Jinyang Liu, Wenwei Gu, Yuxin Su, Michael R. Lyu
To better understand the characteristics of different anomaly detectors, in this paper, we provide a comprehensive review and evaluation of five popular neural networks used by six state-of-the-art methods.
no code implementations • CUHK Course IERG5350 2020 • Zhuangbin Chen
For large-scale systems (e. g., cloud computing systems) with billions lines of codes, the majority of its maintenance effort is code management.