How to make cloud services more reliable?

Over the last decade, cloud computing technologies have become widely adopted in many different sectors, enabling cloud architects and software developers to provide advanced services on a level unlike ever before.

Built on the concepts of virtual machines and orchestration, Infrastructure as a Service (IaaS) based cloud systems facilitate the automated provisioning and maintenance of virtual research infrastructures. After fulfilling a set of functional and non-functional requirements, these infrastructures built with best practices in mind can be defined and shared as reference architectures.

Reference architectures act as proven blueprints of complex systems consisting of a mesh of software and services laid on top of virtual machines. With regards to the fulfillment of this role, the tiresome task of testing and debugging these reference architectures is essential – especially in the various fields of the health sector, including the iToBoS project. The complex process of creating, managing, and terminating just a single virtual machine can be a source of a variety of different erroneous states. In practice, reference architectures often consist of a dynamically scaling network of interdependent virtual machines, which increases the difficulty of debugging these architectural blueprints exponentially.

In order to tackle the outlined challenges, the Institute for Computer Science and Control (SZTAKI) in co-operation with Óbuda University began its research by employing a marcrostep-based debugging methodology enabling tracking and active control over the deployment of virtual machines. Facilitated by a cloud orchestrator (Occopus [1]) and a macrostep controller, the breakpoints strategically placed along the process of initialization allowed us to control the deployment of complex virtual infrastructures macrostep by macrostep. This methodology can help us to observe, reproduce, and control the various error situations and inconsistencies caused by the non-deterministic factors of cloud environments, such as abnormal process runtimes or unpredictable communication delays. By expanding the system with a graph database capable of handling large state-spaces (Neo4j [2]), the inspection and analysis of the different execution paths of orchestration was made possible. While this solution enables the thorough examination of execution paths of virtual infrastructure deployment and maintenance, the sheer amount of time required for analyzing just a single path makes performing the process manually unfeasible. Our plan for confronting this issue lies in increasing the level of automation (and later reliability) by applying deep learning methods.

Moving forward towards the involvement of neural networks, our research continued in collaboration with colleagues from the Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics. As a starting point for utilizing deep learning for debugging cloud orchestration, we had to think about training neural networks to make sense of execution trees and learn to draw conclusions from the data. Our goal for the trained model is the ability to determine which branches of execution might be heading towards an error, based on the historical data of labeled erroneous states. As a necessity of the training process, we prepared training data by relying on modeling and simulation. We chose a bottom-up approach, and began experimentations by modeling two basic, but highly relevant building blocks of virtual infrastructures: buffering and load balancing. We employed Continuous Time Markov Chain [3] modeling (CTMC) techniques and a probabilistic model checker called PRISM [4] for the automated generation, augmentation and labeling of training data.

The ongoing work of fitting neural networks to the generated data began with two main objectives: the detection of erroneous states and steering the debugging session towards them within the execution tree. For error detection, we chose to apply an autoencoder [5], which is a type of feedforward neural network capable of learning efficient data coding of unlabeled data. The autoencoder produced very promising results in the detection of erroneous states, and kept performing well even on training sets produced by models with more complexity.  As for steering towards erroneous states, we are looking to apply Long short-term memory [6] (LSTM), a type of neural network with feedback connections.

The current results are a milestone in the ongoing research towards automated cloud orchestration debugging powered by deep learning. Looking to the future, our aim is to model bigger, more complex, and realistic system architectures, which will serve as a training ground for the applied neural networks. The outcome of this research could be revolutionary for the debugging, testing, and maintenance of virtual research infrastructures such as the iToBoS Cloud, due to the increased test coverage and reliability.

The presented on-going work is partly supported by the UNKP-21-5 New National Excellence Program of the  Ministry For Innovation And Technology (Hungary) from the source of the National, Research, Development and Innovation Fund.

[1] Kovács, J., Kacsuk, P. Occopus: a Multi-Cloud Orchestrator to Deploy and Manage Complex Scientific Infrastructures. J Grid Computing 16, 19–37 (2018).

[2] Neo4j, NEO4J GRAPH DATA PLATFORM: Blazing-Fast Graph, Petabyte Scale
Available at

[3] Aziz, A., Sanwal, K., Singhal, V., and Brayton, R. 2000. Model-checking continuous-time Markov chains. ACM Trans. Comput. Logic 1, 1 (July 2000), 162–170.

[4] Kwiatkowska, M., Norman, G., Parker, D. (2011). PRISM 4.0: Verification of Probabilistic Real-Time Systems. In: Gopalakrishnan, G., Qadeer, S. (eds) Computer Aided Verification. CAV 2011. Lecture Notes in Computer Science, vol 6806. Springer, Berlin, Heidelberg.

[5] Liou, C.-Y., Cheng, W.-C., Liou, J.-W., & Liou, D.-R. (2014). Autoencoder for words. Neurocomputing, 139, 84–96.

[6] Hochreiter, S., & Schmidhuber, J"urgen. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.