The emergence of Infrastructure as a Service framework brings new opportunities, which also accompanies with new challenges in auto scaling, resource allocation, and security. A fundamental challenge underpinning these problems is the continuous tracking and monitoring of resource usage in the system. In this paper, we present ATOM, an efficient and effective framework to automatically track, monitor, and orchestrate resource usage in an Infrastructure as a Service (IaaS) system that is widely used in cloud infrastructure. We use novel tracking method to continuously track important system usage metrics with low overhead, and develop a Principal Component Analysis (PCA) based approach to continuously monitor and automatically find anomalies based on the approximated tracking results. We show how to dynamically set the tracking threshold based on the detection results, and further, how to adjust tracking algorithm to ensure its optimality under dynamic workloads. Lastly, when potential anomalies are identified, we use introspection tools to perform memory forensics on VMs guided by analyzed results from tracking and monitoring to identify malicious behavior inside a VM. We demonstrate the extensibility of ATOM through virtual machine (VM) clustering. The performance of our framework is evaluated in an open source IaaS system
Anomaly detection is a critical step towards building a secure and trustworthy system. The primary purpose of a system log is to record system states and significant events at various critical points to help debug system failures and perform root cause analysis. Such log data are universally available in nearly all computer systems. Therefore, log data is an important and valuable data source for understanding system status and performance issues, which means various system logs are naturally excellent source of information for online monitoring and anomaly detection. We propose DeepLog, a deep neural network model utilizing Long Short-Term Memory (LSTM), to model a system log as a natural language sequence. This allows DeepLog to automatically learn log patterns from normal execution, and detect anomalies when log patterns deviate from the model trained from log data under normal execution. In addition, we demonstrate how to incrementally update the DeepLog model in an online fashion so that it can adapt to new log patterns over time. Furthermore, DeepLog constructs workflows from the underlying system log so that once an anomaly is detected, users can diagnose the detected anomaly and perform root cause analysis effectively. Extensive experimental evaluations over large log data have shown that DeepLog has outperformed other existing log-based anomaly detection methods based on traditional data mining methodologies.
System event logs contain critical information for diagnosis and monitoring purposes with the growing complexity of modern computer systems. They have been frequently used as a valuable resource in data-driven approaches to enhance system health and stability. A typical procedure in system log analytics is to first parse unstructured logs to structured data, and then apply data mining and machine learning techniques and/or build workflow models from the resulting structured data. Previous work on parsing system event logs focused on offline, batch processing of raw log files. But increasingly, applications demand online monitoring and processing. As a result, a streaming method to parse unstructured logs is needed. We propose an online streaming method Spell, which utilizes a longest common subsequence based approach, to parse system event logs. We show how to dynamically extract log patterns from incoming logs and how to maintain a set of discovered message types in streaming fashion. Enhancement to find more accurate message types is also proposed. We compare Spell against two popular offline batched methods to extract patterns from system event logs on large real data. The results demonstrate that, even compared with the offline alternatives, Spell shows its superiority in terms of both efficiency and effectiveness.
We present ATOM, an efficient and effective framework to enable automated tracking, monitoring, and orchestration of resource usage in an Infrastructure as a Service (IaaS) system. We design a novel tracking method to continuously track important performance metrics with low overhead, and develop a Principal Component Analysis (PCA) approach with quality guarantees to continuously monitor and automatically find anomalies based on the approximate tracking results. Lastly, when potential anomalies are identified, we use introspection tools to perform memory forensics on virtual machines (VMs) to identify malicious behavior inside a VM. We deploy ATOM in an IaaS system to monitor VM resource usage, and to detect anomalies. Various attacks are used as an example to demonstrate how ATOM is both effective and efficient to track and monitor resource usage, detect anomalies, and orchestrate system resource usage.