Detecting unusual and anomalous behavior in computer systems is a critical part of ensuring they are secure and trustworthy. System logs, which record actions taken by programs, are a promising source of data for such anomaly detection. However, existing practices and tools for doing log analysis require deep expertise, as well as heavy human involvement in both defining and interpreting possible anomalies, which limits their scalability and effectiveness. This project's goal is to improve the state of the art around log-based anomaly detection by developing a framework called DeepLog through (a) advancing natural language processing techniques to extract structured information from a wide variety of log files to support analysis across different data sources and across time, (b) developing new methods to model legitimate workflows and log event sequences over time, (c) adapting machine learning methods to identify deviations from those workflows that represent potential anomalies, and (d) creating tools for system administrators to help them diagnose possible security issues more effectively and efficiently. The work will be integrated into a freely available software package to benefit both other researchers and practicing system administrators and used to support both classroom and research-based educational activities at the investigators' institutions.
Toward log parsing, the team will adapt named entity recognition methods to parse unstructured logs as well as structured logs where the structure is not pre-defined by, e.g., regular expressions, into structured key-value pairs of log event types and parameters. This data can be seen as a multi-dimensional feature space whose contents are constrained by the execution of the underlying programs and thus reflects a hidden structure that defines the set of valid, non-anomalous execution sequences. To help articulate this hidden structure, the team will develop long-short-term-memory (LSTM)-based neural network models that use both the key and value elements to extract semantically meaningful subsequences of program behavior from data extracted from system runs known to be normal. Once these models are developed using known-good training data, they can be applied to anomaly detection by flagging for consideration new log entries that are unexpected given the current state of the system, logs, and model.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
This material is based upon work supported by the National Science Foundation under Grant No. 1801446
PhD student. Research Interest: machine learning techniques for system mining.
PhD Student. Research Interest: knowledge base construction and application.