SaTC: CORE: Medium: Large-Scale Data Driven Anomaly Detection and Diagnosis from System Logs: Detecting unusual and anomalous behavior in computer systems is a critical part of ensuring they are secure and trustworthy. System logs, which record actions taken by programs, are a promising source of data for such anomaly detection. However, existing practices and tools for doing log analysis require deep expertise, as well as heavy human involvement in both defining and interpreting possible anomalies, which limits their scalability and effectiveness. This project's goal is to improve the state of the art around log-based anomaly detection by developing a framework called DeepLog through (a) advancing natural language processing techniques to extract structured information from a wide variety of log files to support analysis across different data sources and across time, (b) developing new methods to model legitimate workflows and log event sequences over time, (c) adapting machine learning methods to identify deviations from those workflows that represent potential anomalies, and (d) creating tools for system administrators to help them diagnose possible security issues more effectively and efficiently. The work will be integrated into a freely available software package to benefit both other researchers and practicing system administrators and used to support both classroom and research-based educational activities at the investigators' institutions.
Toward log parsing, the team will adapt named entity recognition methods to parse unstructured logs as well as structured logs where the structure is not pre-defined by, e.g., regular expressions, into structured key-value pairs of log event types and parameters. This data can be seen as a multi-dimensional feature space whose contents are constrained by the execution of the underlying programs and thus reflects a hidden structure that defines the set of valid, non-anomalous execution sequences. To help articulate this hidden structure, the team will develop long-short-term-memory (LSTM)-based neural network models that use both the key and value elements to extract semantically meaningful subsequences of program behavior from data extracted from system runs known to be normal. Once these models are developed using known-good training data, they can be applied to anomaly detection by flagging for consideration new log entries that are unexpected given the current state of the system, logs, and model.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
This material is based upon work supported by the National Science Foundation under Grant No. 1801446
III: Small: Persistent Data Summaries: Temporal Analytics on Big Data Histories: Project and Award Information: NSF III: Small: Persistent Data Summaries: Temporal Analytics on Big Data Histories, NSF IIS 1816149, Official NSF Link, PI, 08/01/18- 08/31/21, $499,934. PI: Feifei Li, co-PI: Jeff Phillips.
Abstract: An increasing number of applications require the storage of and access to all historical data to support rich analytics, learning, and mining operations. This project develops a series of methods to summarize data so that it can be queried with respect to not just the full data set, as is standard, but with respect to the state of the data set at any historical time. These summaries integrate with large temporal databases, in both offline batched-processing and online streaming application scenarios. The effectiveness of these methods will be demonstrated on an enormous scientific database of atmospheric data collected for 20 years from over 40,000 weather stations. We will work with industry collaborators to help deploy our new algorithms, and the results will be integrated into education and outreach efforts surrounding the growth of data science initiatives.
More specifically, this project extends and combines approximate query processing with temporal big data. In particular, instead of (or on top of) using a multi-version database, this project designs and implements persistent data summaries (PDSs) that offer interactive temporal analytics with strong theoretical guarantees on their approximation quality. In additional to formalizing these models, this project develops practical PDS implementations for sampling-based summaries, data sketches, and core sets that support advanced analytical queries.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
This material is based upon work supported by the National Science Foundation under Grant No. 1816149
SaTC: CORE: Small: Efficient Hardware-Aware and Hardware-Enabled Algorithms for Secure In-Memory Databases: Many big-data workloads are hosted on the cloud to facilitate sharing and low cost. Many of these workloads deal with sensitive data, e.g., electronic health records. Cloud infrastructures are vulnerable to attacks from untrusted operators with physical access to the computers. These attacks can take many forms and may compromise privacy or integrity. To guarantee the security of sensitive data and ensure the cost-effectiveness of shared cyberinfrastructure, it is vital that these attacks be thwarted while not incurring a significant penalty in terms of performance, energy, or cost. Currently, the above attacks are thwarted with oblivious RAM (ORAM) and integrity trees. These solutions incur significant bandwidth and capacity overheads and can degrade performance by 1-2 orders of magnitude. These solutions also show different behaviors on different workloads, i.e., one cannot ignore system, hardware, and workload effects when evaluating algorithms for security. To address these problems, the PIs are designing novel secure hardware, and leveraging commodity secure hardware to design novel secure systems. The PIs are developing new secure algorithms that require hardware/software co-design because their behavior is a function of workload locality properties and hardware constraints, e.g., a distributed implementation of ORAM enabled by custom memory modules. These new algorithms are being integrated into scalable in-memory database systems that run on a cluster of state-of-the-art secure nodes. The project is thus attempting to secure cyberinfrastructure for sensitive applications, moving well-known approaches (ORAM and integrity trees) into the practical realm. The PIs are also continuing their outreach and education efforts.
III: Small: Towards a Database Engine for Interactive and Online Sampling and Analytics: Existing databases and data management systems are designed and optimized for executing queries and analytical jobs in their entirety. User interactions with such systems are limited to binary decisions: either wait for exact answers in the end, or terminate a job while it is still running and obtain little or even zero knowledge regarding the final output. This model no loner works well with big data as waiting for the exact query or analytical results may take a long time. The good news is that often users do not need exact results in big data computation
Simba: Spatial In-Memory Big data Analytics: Simba is a distributed in-memory spatial analytics engine based on Apache Spark. It extends the Spark SQL engine across the system stack to support rich spatial queries and analytics through both SQL and DataFrame query interfaces. Besides, Simba introduces native indexing support over RDDs in order to develop efficient spatial operators. It also extends Spark SQL's query optimizer with spatial-aware and cost-based optimizations to make the best use of existing indexes and statistics.
Medium: Collaborative: Seal: Secure Engine for AnaLytics - From Secure Similarity Search to Secure Data Analytics: Many organizations and individuals rely on the cloud to store their data and process their analytical queries. But such data may contain sensitive information. Not only do users want to conceal their data on a cloud, they may also want to hide analytical queries over their data, results of such queries, and data access patterns from a cloud service provider (that may be compromised either from within or by a third party).
CIF21 DIBBs: STORM: Spatio-Temporal Online Reasoning and Management of Large Data: A fundamental challenge for many research projects is the ability to handle large quantities of heterogeneous data. Data collected from different sources and time periods can be inconsistent, or stored in different formats and data management systems. Thus, a critical step in many projects is to develop a customized query and analytical engine to translate inputs. But for each new dataset, or for each new query type or analytic task for an existing dataset, a new query interface or program must be developed, requiring significant investments of time and effort. This project will develop an automatic engine for searching large, heterogeneous data collections for weather and meteorology, particularly from instruments in the western US, in a regional network called MesoWest.This project develops an automatic query and analytical engine for large, heterogeneous spatial and temporal data. This capability allows users to automatically deploy a query and analytical engine instance over their large, heterogeneous data with spatial and temporal dimensions. The system supports a simple search-box and map-like query interface that allows numerous powerful analytical queries. Techniques to make these queries robust, relevant, and highly scalable will be developed. The project also enables users to execute queries over multiple data sources simultaneously and seamlessly. The goal of the work is to dramatically simplify the management and analysis of large spatio-temporal data at different institutions, groups, and corporations.
BIGDATA: Small: DCM: DA: Building a Mergeable and Interactive Distributed Data Layer for Big Data Summarization Systems: Big data today is stored in a distributed fashion across many different machines or data sources. This poses new algorithmic and system challenges to performing efficient analysis on the full data set. To address these difficulties, the PIs are building the MIDDLE (Mergeable and Interactive Distributed Data LayEr) Summarization System and deploying it on large real-world datasets. The MIDDLE system builds and maintains a special class of summaries that can be efficiently constructed and updated while still allowing fine-grained analysis on the heavy tail. Mergeable summaries can represent any data set with a guaranteed tradeoff between size and accuracy, and any two such summaries can be merged to create a new summary with the same size-accuracy tradeoff.
TWC: Medium: TCloud: A Self-Defending, Self-Evolving and Self-Accounting Trustworthy Cloud Platform: The use of cloud computing has revolutionized the way in which cyber infrastructure is used and managed. The on-demand access to seemingly infinite resources provided by this paradigm has enabled technical innovation and indeed innovative business models and practices. This rosy picture is threatened, however, by increasing nefarious interest in cloud platforms. Specifically, the shared tenant, shared resource nature of cloud platforms, as well as the natural accrual of valuable information in cloud platforms, provide both the incentive and the possible means of exploitation.To address these concerns we are developing a self-defending, self-evolving, and self-accounting trustworthy cloud platform, the TCloud. Our approach in realizing TCloud holds to the following five tenets. First, defense-in-depth through innate containment, separation and diversification at the architectural level. Second, least authority by clear separation of functionality and associated privilege within the architecture. Third, explicit orchestration of security functions based on cloud-derived and external intelligence. Fourth, moving-target-defense through deception and dynamic evolution of the platform. Fifth, verifiable accountability through light weight validation and auditable monitoring, record keeping and analysis.Our approach to fundamentally refactor the cloud architecture to explicitly enable security related functionality lays the foundation for truly trustworthy cloud computing. Given the unrelenting push towards the use of cloud technologies our work has broad applicability across industry, healthcare, government and academia. All software we develop will be released to the community in open source form.
CSR: Medium: Energy-Efficient Architectures for Emerging Big-Data Workloads: In modern server architectures, the processor socket and the memory system
are implemented as separate modules. Data exchange between these modules
is expensive -- it is slow, it consumes a large amount of energy, and there are long wait times for narrow data links. Emerging big-data workloads will require especially large amounts of data movement between the processor and memory. To reduce the cost of data movement for big-data workloads, the project attempts to design new server architectures that can leverage 3D stacking technology. The proposed approach, referred to as Near Data Computing (NDC), reduces the distance between a subset of computational units and a subset of memory, and can yield high efficiency for workloads that exhibit locality. The project will also develop new big-data algorithms and runtime systems that can exploit the properties of the
The project will lead to technologies that can boost performance and reduce the energy demands of big-data workloads. Several reports have cited the importance of these workloads to national, industrial, and scientific computing infrastructures. The project outcomes will be integrated into University of Utah curricula and will play a significant role in a new degree program on datacenter design and operation. The PIs will broaden their impact by publicly distributing parts of their software infrastructure and by engaging in outreach programs that involve minorities and K-12 students.
Scholarships for Service at Florida State University: Focused on Information Assurance education, the Scholarship For Service program gives students scholarship funds in exchange for service in the federal government for a period equivalent to the length of their scholarship, typically two years. As a result of the SFS program, federal agencies are able to select from a highly qualified pool of student applicants for internships and permanent positions, and students get a guarantreed job. The SFS program is offered by the National Science Foundation (NSF) and co-sponsored by the Department of Homeland Security (DHS).
COFRS Award: Efficient Aggregate Similarity Search: For many geospatial and multimedia database applications, a fundamental challenge is to answer similaritysearch on large amounts of data, also known as the nearest neighbor (NN) query. Despite extensivestudies, little is known for aggregate similarity search that has witnessed an increasing number of applications.Given a data set P, an aggregate similarity query is specified by a group of query objects Q, anaggregator ? and a similarity function f involving at least the spatial distance between objects. The goalis to find an object p? ? P such that the aggregate similarity between p? and objects from Q is minimized.For example, when ? is sum over spatial distances, it is to find p? ? P that minimizes the sum of distancesbetween p? and every object from Q.The aggregate similarity search presents interesting and challenging research problems, in terms of bothquery semantics and query processing techniques. We will also explore parallel and relational (those thatcan be directly implemented by standard SQL statements) methods in this project.Due to the fundamental importance of the similarity search and the emerging applications for variousaggregate similarity search, our project will significantly improve the scientific work in geospatial intelligenceanalytics, including spatial intelligence, complex spatial data analysis, GIS, and location-based services.End-users may design customized aggregate similarity search to identify normal and discover anomalies incomplex geospatial and multimedia data.
Efficient Ranking and Aggregate Query Processing for Probabilistic Data: When dealing with massive quantities of data, ranking and aggregate queries are powerful techniques for focusing attention on the most important answers. Many applications that produce such massive quantities of data inherently introduce uncertainty in the same time, for example, probabilistic match in data integration, imprecise measurements from sensors, fuzzy duplicates in data cleaning, inconsistency in scientific data. Hence, the importance of these queries is even greater in probabilistic data, where a relation can encode exponentially many possible worlds. Uncertainty opens the gate to many possible definitions for ranking and aggregate queries. With the wide presence of probabilistic data, processing ranking and aggregate queries efficiently with the right semantics is of key importance for the successful deployment of probabilistic databases.
Towards Trustworthy Database Systems: Answers to database queries often form the basis for critical decision-making. To improve efficiency and reliability, answers to these queries can be provided by distributed servers close to the querying clients. However, because of the servers' ubiquity, the logistics associated with fully securing them may be prohibitive; moreover, when the servers are run by third parties, the clients may not trust them as much as they trust the original data owners. Thus, the authenticity of the answers provided by servers in response to clients' queries must be verifiable by the clients. More generally, database responses are more useful if they contain the evidence of their own correctness. For example, this enables a consumer to provide her own credit report to a creditor without having the creditor request it from the reporting agency to establish the validity of the report. This project is developing methods for authenticating the validity and authenticity of a variety of database queries, including general relational, data cube, and spatio-temporal queries. Furthermore, techniques that use powerful cryptographic primitives are being developed for providing authentication and confidentiality. This research will enable utilization of this infrastructure in applications where users must rely on the authenticity of the answer, such as in financial systems, network monitoring, traffic control, or applications yet to be imagined. The results of this project will be disseminated through publications in journals and conferences. Furthermore, source code of these methods, in the form of libraries, will be made available over the web. This is a collaborative project with the Datbase Lab at Boston University, with Prof. George Kollios and Prof. Leonid Reyzin.
Query Verification in Database Systems: Completed.
Building Trustworthy Database Systems: Completed.
CAREER: Novel Query Processing Techniques for Probabilistic Data: Data are increasingly generated, stored, and processed distributively. Meanwhile, when large amounts of data are generated, ambiguity, uncertainty, and errors are inherently introduced, especially in a distributed setup. It is best to represent such data in a distributed probabilistic database. In distributed data management, summary queries are useful tools for obtaining the most important answers from massive quantities of data effectively and efficiently, e.g., top-k queries, heavy hitters (aka frequent items), histograms and wavelets, threshold monitoring queries, etc. This project investigates novel query processing techniques for various, important summary queries in distributed probabilistic data. Broadly classified, this project examines both snapshot summary queries in static (i.e., no updates) distributed probabilistic databases, and continuous summary queries in dynamic (i.e., with updates) distributed probabilistic databases. A number of techniques are explored to design novel, communication and computation efficient algorithms for processing these queries. A distributed probabilistic data management system (DPDMS) prototype is implemented based on the query processing techniques developed in this project. This DPDMS is released to and used in practice by scientists and engineers from other science disciplines as well as industry. Graduate and undergraduate students, including those from minority groups, are actively involved in this project. Findings from the project have been integrated into different courses, demos, and educational projects.