SaTC: CORE: Medium: Large-Scale Data Driven Anomaly Detection and Diagnosis from System Logs


Detecting unusual and anomalous behavior in computer systems is a critical part of ensuring they are secure and trustworthy. System logs, which record actions taken by programs, are a promising source of data for such anomaly detection. However, existing practices and tools for doing log analysis require deep expertise, as well as heavy human involvement in both defining and interpreting possible anomalies, which limits their scalability and effectiveness. This project's goal is to improve the state of the art around log-based anomaly detection by developing a framework called DeepLog through (a) advancing natural language processing techniques to extract structured information from a wide variety of log files to support analysis across different data sources and across time, (b) developing new methods to model legitimate workflows and log event sequences over time, (c) adapting machine learning methods to identify deviations from those workflows that represent potential anomalies, and (d) creating tools for system administrators to help them diagnose possible security issues more effectively and efficiently. The work will be integrated into a freely available software package to benefit both other researchers and practicing system administrators and used to support both classroom and research-based educational activities at the investigators' institutions.

Toward log parsing, the team will adapt named entity recognition methods to parse unstructured logs as well as structured logs where the structure is not pre-defined by, e.g., regular expressions, into structured key-value pairs of log event types and parameters. This data can be seen as a multi-dimensional feature space whose contents are constrained by the execution of the underlying programs and thus reflects a hidden structure that defines the set of valid, non-anomalous execution sequences. To help articulate this hidden structure, the team will develop long-short-term-memory (LSTM)-based neural network models that use both the key and value elements to extract semantically meaningful subsequences of program behavior from data extracted from system runs known to be normal. Once these models are developed using known-good training data, they can be applied to anomaly detection by flagging for consideration new log entries that are unexpected given the current state of the system, logs, and model.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

This material is based upon work supported by the National Science Foundation under Grant No. 1801446


Funding


  • NSF SaTC Award 1801446 Official Link

  • People


    Feifei Li
    Professor


    Yuwei Wang
    Master student.


    Min Du
    PhD student. Research Interest: machine learning techniques for system mining.


    Guineng Zheng
    PhD Student. Research Interest: knowledge base construction and application.



    Publications


  • Spell: Streaming Parsing of System Event Logs, Talk
    By Min Du,    Feifei Li
    In Proceedings of In Proceedings of 16th IEEE International Conference on Data Mining (ICDM 2016),  pages 859-864,  Barcelona, Spain,  December,  2016.
    Abstract

    System event logs contain critical information for diagnosis and monitoring purposes with the growing complexity of modern computer systems. They have been frequently used as a valuable resource in data-driven approaches to enhance system health and stability. A typical procedure in system log analytics is to first parse unstructured logs to structured data, and then apply data mining and machine learning techniques and/or build workflow models from the resulting structured data. Previous work on parsing system event logs focused on offline, batch processing of raw log files. But increasingly, applications demand online monitoring and processing. As a result, a streaming method to parse unstructured logs is needed. We propose an online streaming method Spell, which utilizes a longest common subsequence based approach, to parse system event logs. We show how to dynamically extract log patterns from incoming logs and how to maintain a set of discovered message types in streaming fashion. Enhancement to find more accurate message types is also proposed. We compare Spell against two popular offline batched methods to extract patterns from system event logs on large real data. The results demonstrate that, even compared with the offline alternatives, Spell shows its superiority in terms of both efficiency and effectiveness.

  • DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning
    By Min Du,    Feifei Li,    Guineng Zheng,    Vivek Srikumar
    In Proceedings of 24th ACM Conference on Computer and Communications Security (CCS 2017),  pages 1285--1298,  November,  2017.
    Abstract

    Anomaly detection is a critical step towards building a secure and trustworthy system. The primary purpose of a system log is to record system states and significant events at various critical points to help debug system failures and perform root cause analysis. Such log data are universally available in nearly all computer systems. Therefore, log data is an important and valuable data source for understanding system status and performance issues, which means various system logs are naturally excellent source of information for online monitoring and anomaly detection. We propose DeepLog, a deep neural network model utilizing Long Short-Term Memory (LSTM), to model a system log as a natural language sequence. This allows DeepLog to automatically learn log patterns from normal execution, and detect anomalies when log patterns deviate from the model trained from log data under normal execution. In addition, we demonstrate how to incrementally update the DeepLog model in an online fashion so that it can adapt to new log patterns over time. Furthermore, DeepLog constructs workflows from the underlying system log so that once an anomaly is detected, users can diagnose the detected anomaly and perform root cause analysis effectively. Extensive experimental evaluations over large log data have shown that DeepLog has outperformed other existing log-based anomaly detection methods based on traditional data mining methodologies.

  • OpenTag: Open Attribute Value Extraction from Product Profiles
    By Guineng Zheng,    Subhabrata Mukherjee,    Xin Luna Dong,    Feifei Li
    In Proceedings of In Proceedings of 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2018),  pages 1049-1058,  August,  2018.
    Abstract

    Extraction of missing attribute values is to find values describing an attribute of interest from a free text input. Most past related work on extraction of missing attribute values work with a closed world assumption with the possible set of values known beforehand, or use dictionaries of values and hand-crafted features. How can we discover new attribute values that we have never seen before? Can we do this with limited human annotation or supervision? We study this problem in the context of product catalogs that often have missing values for many attributes of interest.

    We leverage product profile information such as titles and descriptions to discover missing values of product attributes. We develop a novel deep tagging model OpenTag for this extraction problem with the following contributions: (1) we formalize the problem as a sequence tagging task, and propose a joint model exploiting recurrent neural networks (specifically, bidirectional LSTM) to capture context and semantics, and Conditional Random Fields (CRF) to enforce tagging consistency; (2) we develop a novel attention mechanism to provide interpretable explanation for our model%u2019s decisions; (3) we propose a novel sampling strategy exploring active learning to reduce the burden of human annotation.OpenTag does not use any dictionary or hand-crafted features as in prior works. Extensive experiments in real-life datasets show that OpenTag with our active learning strategy discovers new attribute values from as few as 150 annotated samples (reduction in 3.3x amount of annotation effort) with a high f-score of 83%, outperforming state-of-the-art models.

  • Pcard: Personalized Restaurants Recommendation from Card Payment Transaction Records
    By Min Du,    Robert Christensen,    Wei Zhang,    Feifei Li
    In Proceedings of The 15th Web Conference (WWW),  pages 2687-2693,  May,  2019.
    Abstract

    Personalized Point of Interest (POI) recommendation that incorporates users%u2019 personal preferences is an important subject of research. However, challenges exist such as dealing with sparse rating data and spatial location factors. As one of the biggest card payment organizations in the United States, our company holds abundant card payment transaction records with numerous features. In this paper, using restaurant recommendation as a demonstrating example, we present a personalized POI recommendation system (Pcard) that learns user preferences based on user transaction history and restaurants%u2019 locations. With a novel embedding approach that captures user embeddings and restaurant embeddings, we model pairwise restaurant preferences with respect to each user based on their locations and dining histories. Finally, a ranking list of restaurants within a spatial region is presented to the user. The evaluation results show that the proposed approach is able to achieve high accuracy and present effective recommendations

  • Spell: Online Streaming Parsing of Large Unstructured System Logs
    By Min Du,    Feifei Li
    Vol.0, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE),  March ,  2019.
    Abstract

    System event logs have been frequently used as a valuable resource in data-driven approaches to enhance system health and stability. A typical procedure in system log analytics is to first parse unstructured logs to structured data, and then apply data mining and machine learning techniques and/or build workflow models from the resulting structured data. Previous work on parsing system event logs focused on offline, batch processing of raw log files. But increasingly, applications demand online monitoring and processing. As a result, a streaming method to parse unstructured logs is needed. We propose an online streaming method Spell, which utilizes a longest common subsequence based approach, to parse system event logs. We show how to dynamically extract log patterns from incoming logs and how to maintain a set of discovered message types in streaming fashion. Enhancement to find more accurate message types is also proposed. We also propose and evaluate a method to automatically discover semantic meanings for parameter fields identified by Spell. We compare Spell against state-of-the-art methods to extract patterns from system event logs on large real data. The results demonstrate that, compared with other log parsing alternatives, Spell shows its superiority in terms of both efficiency and effectiveness.