System event logs have been frequently used as a valuable resource in data-driven approaches to enhance system health and stability. A typical procedure in system log analytics is to first parse unstructured logs to structured data, and then apply data mining and machine learning techniques and/or build workflow models from the resulting structured data. Previous work on parsing system event logs focused on offline, batch processing of raw log files. But increasingly, applications demand online monitoring and processing. As a result, a streaming method to parse unstructured logs is needed. We propose an online streaming method Spell, which utilizes a longest common subsequence based approach, to parse system event logs. We show how to dynamically extract log patterns from incoming logs and how to maintain a set of discovered message types in streaming fashion. Enhancement to find more accurate message types is also proposed. We also propose and evaluate a method to automatically discover semantic meanings for parameter fields identified by Spell. We compare Spell against state-of-the-art methods to extract patterns from system event logs on large real data. The results demonstrate that, compared with other log parsing alternatives, Spell shows its superiority in terms of both efficiency and effectiveness.
Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and even needs unrealistic assumptions (e.g., tuples in a table are stored in random order). This article proposes a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Compared with ripple join, wander join is particularly efficient for equality joins involving multiple tables, but also supports %u03B8joins. Selection predicates and group-by clauses can be handled as well. To demonstrate the usefulness of wander join, we have designed and implemented XDB (approXimate DB) by integrating wander join into various systems including PostgreSQL, Spark, and a stand-alone plug-in version using PL/SQL. The design and implementation of XDB has demonstrated wander join%u2019s practicality in a full-fledged database system. Extensive experiments using the TPC-H benchmark have demonstrated the superior performance of wander join over ripple join.
Efficient transaction processing over large databases is a key requirement for many mission-critical applications. Though modern databases have achieved good performance through horizontal partitioning, their performance deteriorates when cross-partition distributed transactions have to be executed. This paper presents SolarDB, a distributed relational database system that has been successfully tested at a large commercial bank. The key features of SolarDB include: 1) a shared-everything architecture based on a two-layer log-structured merge-tree; 2) a new concurrency control algorithm that works with the log-structured storage, which ensures efficient and non-blocking transaction processing even when the storage layer is compacting data among nodes in the background; 3) fine-grained data access to effectively minimize and balance network communication within the cluster. According to our empirical evaluations on TPC-C, Smallbank and a real-world workload, SolarDB outperforms the existing shared-nothing systems by up to 50x when there are close to or more than 5% distributed transactions.
Neighborhood characteristics are increasingly connected with health outcomes. Social processes affect health through the maintenance of social norms, stimulation of new interests, and dispersal of knowledge. We created zip code level indicators of happiness, food, and physical activity culture from geolocated Twitter data to examine the relationship between these neighborhood characteristics and obesity and diabetes diagnoses (Type 1 and Type 2). We collected 422,094 tweets sent from Utah between April 2015 and March 2016. We leveraged administrative and clinical records on 1.86 million individuals aged 20 years and older in Utah in 2015. Individuals living in zip codes with the greatest percentage of happy and physically-active tweets had lower obesity prevalence%u2014accounting for individual age, sex, nonwhite race, Hispanic ethnicity, education, and marital status, as well as zip code population characteristics. More happy tweets and lower caloric density of food tweets in a zip code were associated with lower individual prevalence of diabetes. Results were robust in sibling random effects models that account for family background characteristics shared between siblings. Findings suggest the possible influence of sociocultural factors on individual health. The study demonstrates the utility and cost-effectiveness of utilizing existing big data sources to conduct population health studies.
To leverage geotagged Twitter data to create national indicators of the social environment, with small-area indicators of prevalent sentiment and social modeling of health behaviors, and to test associations with county-level health outcomes, while controlling for demographic characteristics. We used Twitter's streaming application programming interface to continuously collect a random 1% subset of publicly available geo-located tweets in the contiguous United States. We collected approximately 80 million geotagged tweets from 603 363 unique Twitter users in a 12-month period (April 2015-March 2016). Across 3135 US counties, Twitter indicators of happiness, food, and physical activity were associated with lower premature mortality, obesity, and physical inactivity. Alcohol-use tweets predicted higher alcohol-use-related mortality. Social media represents a new type of real-time data that may enable public health officials to examine movement of norms, sentiment, and behaviors that may portend emerging issues or outbreaks-thus providing a way to intervene to prevent adverse health events and measure the impact of health interventions.
Joins are expensive, and online aggregation is an e↵ective approach to explore the tradeo↵ between query e"ciency and accuracy in a continuous, online fashion. However, the stateof-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and needs strong assumptions (e.g., the tuples in a table are stored in random order). This paper proposes a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph. We also design an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori. Selection predicates and group-by clauses can be handled as well. We have developed an online engine called XDB by integrating wander join in the latest version of PostgreSQL. Extensive experiments using the TPC-H benchmark have shown the superior performance of wander join. The XDB implementation has demonstrated its practicality in a full-fledged database system
The emergence of Infrastructure as a Service framework brings new opportunities, which also accompanies with new challenges in auto scaling, resource allocation, and security. A fundamental challenge underpinning these problems is the continuous tracking and monitoring of resource usage in the system. In this paper, we present ATOM, an efficient and effective framework to automatically track, monitor, and orchestrate resource usage in an Infrastructure as a Service (IaaS) system that is widely used in cloud infrastructure. We use novel tracking method to continuously track important system usage metrics with low overhead, and develop a Principal Component Analysis (PCA) based approach to continuously monitor and automatically find anomalies based on the approximated tracking results. We show how to dynamically set the tracking threshold based on the detection results, and further, how to adjust tracking algorithm to ensure its optimality under dynamic workloads. Lastly, when potential anomalies are identified, we use introspection tools to perform memory forensics on VMs guided by analyzed results from tracking and monitoring to identify malicious behavior inside a VM. We demonstrate the extensibility of ATOM through virtual machine (VM) clustering. The performance of our framework is evaluated in an open source IaaS system
Using publicly available, geotagged Twitter data, we created neighborhood indicators for happiness, food and physical activity for three large counties: Salt Lake, San Francisco and New York. Methods: We utilize 2.8 million tweets collected between FebruaryeAugust 2015 in our analysis. Geocoordinates of where tweets were sent allow us to spatially join them to 2010 census tract locations. We implemented quality control checks and tested associations between Twitter-derived variables and sociodemographic characteristics.
Aggregate similarity search, also known as aggregate nearest neighbor (Ann) query, finds many useful applications in spatial and multimedia databases. Given a group Q of M query objects, it retrieves from a database the objects most similar to Q, where the simi- larity is an aggregation (e.g., sum, max) of the distances between each retrieved object p and all the objects in Q. In this paper, we propose an added flexibility to the query definition, where the similarity is an aggre- gation over the distances between p and any subset of phi objects in Q for some support 0 < phi le 1. We call this new definition flexible aggregate similarity search, and accordingly refer to a query as a flexible aggre- gate nearest neighbor (Fann) query. We present algo- rithms for answering Fann queries exactly and approx- imately. Our approximation algorithms are especially appealing, which are simple, highly efficient, and work well in both low and high dimensions. They also return near-optimal answers with guaranteed constant-factor approximations in any dimensions. Extensive experi- ments on large real and synthetic datasets from 2 to 74 dimensions have demonstrated their superior efficiency and high quality.
Studies suggest that where people live, play, and work can influence health and well-being. However, the dearth of neighborhood data, especially data that is timely and consistent across geographies, hinders understanding of the effects of neighborhoods on health. Social media data represents a possible new data resource for neighborhood research. The aim of this study was to build, from geotagged Twitter data, a national neighborhood database with area-level indicators of well-being and health behaviors.
Optimal location (OL) queries are a type of spatial queries that are particularly useful for the strategic planning of resources. Given a set of existing facilities and a set of clients, an OL query asks for a location to build a new facility that optimizes a certain cost metric (defined based on the distances between the clients and the facilities). Several techniques have been proposed to address OL queries, assuming that all clients and facilities reside in an L p space. In practice, however, movements between spatial locations are usually confined by the underlying road network, and hence, the actual distance between two locations can differ significantly from their L p distance. Motivated by the deficiency of the existing techniques, this paper presents a comprehensive study on OL queries in road networks. We propose a unified framework that addresses three variants of OL queries that find important applications in practice, and we instantiate the framework with several novel query processing algorithms. We further extend our framework to efficiently monitor the OLs when locations for facilities and/or clients have been updated. Our dynamic update methods lead to efficient answering of continuous optimal location queries. We demonstrate the efficiency of our solutions through extensive experiments with large real data.
Keyword search is a useful tool for exploring large RDF datasets. Existing techniques either rely on constructing a distance matrix for pruning the search space or building summaries from the RDF graphs for query processing. In this work, we show that existing techniques have serious limitations in dealing with realistic, large RDF data with tens of millions of triples. Furthermore, the existing summarization techniques may lead to incorrect/incomplete results. To address these issues, we propose an effective summarization algorithm to summarize the RDF data. Given a keyword query, the summaries lend significant pruning powers to exploratory keyword search and result in much better efficiency compared to previous works. Unlike existing techniques, our search algorithms always return correct results. Besides, the summaries we built can be updated incrementally and efficiently. Experiments on both benchmark and large real RDF data sets show that our techniques are scalable and efficient.
MOVING COMPUTATION NEAR MEMORY HAS BECOME MORE PRACTICAL BECAUSE OF 3D STACKING TECHNOLOGY. THIS ARTICLE DISCUSSES IN-MEMORY MAPREDUCE IN THE CONTEXT OF NEAR-DATA COMPUTING (NDC). THE AUTHORS CONSIDER TWO NDC ARCHITECTURES: ONE THAT EXPLOITS HYBRID MEMORY CUBE DEVICES AND ONE THAT DOES NOT. THEY EXAMINE THE BENEFITS OF DIFFERENT NDC APPROACHES AND QUANTIFY THE POTENTIAL FOR IMPROVEMENT FOR AN IMPORTANT EMERGING BIG-DATA WORKLOAD.
This work deals with the approximate string search in large spatial databases. Speci%uFB01cally, we investigate range queries augmented with a string similarity search predicate in both Euclidean space and road networks. We dub this query the spatial approximate string (SAS) query. In Euclidean space, we propose an approximate solution, the MHR-tree, which embeds min-wise signatures into an R-tree. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on the set resemblance between the query string and the q-grams from the sub-trees of index nodes. We also discuss how to estimate the selectivity of a SAS query in Euclidean space, for which we present a novel adaptive algorithm to %uFB01nd balanced partitions using both the spatial and string information stored in the tree. For queries on road networks, we propose a novel exact method, RSASSOL, which signi%uFB01cantly outperforms the baseline algorithm in practice. The RSASSOL combines the q-gram based inverted lists and the reference nodes based pruning. Extensive experiments on large real data sets demonstrate the ef%uFB01ciency and effectiveness of our approaches.
Query execution assurance is an important concept in defeating lazy servers in the database as a service model. We show that extending query execution assurance to outsourced databases with multiple data owners is highly inefficient. To cope with lazy servers in the distributed setting, we propose query access assurance (QAA) that focuses on IO-bound queries. The goal in QAA is to enable clients to verify that the server has honestly accessed all records that are necessary to compute the correct query answer, thus eliminating the incentives for the server to be lazy if the query cost is dominated by the IO cost in accessing these records. We formalize this concept for distributed databases, and present two efficient schemes that achieve QAA with high success probabilities. The first scheme is simple to implement and deploy, but may incur excessive server to client communication cost and verification cost at the client side, when the query selectivity or the database size increases. The second scheme is more involved, but successfully addresses the limitation of the first scheme. Our design employs a few number theory techniques. Extensive experiments demonstrate the efficiency, effectiveness and usefulness of our schemes.
Given a set of points P and a query set Q, a group enclosing query (GEQ) fetches the point p such that the maximum distance of p to all points in Q is minimized. This problem is equivalent to the Min-Max case (minimizing the maximum distance) of aggregate nearest neighbor queries for spatial databases. This work first designs a new exact solution by exploring new geometric insights, such as the minimum enclosing ball, the convex hull and the furthest voronoi diagram of the query group. To further reduce the query cost, especially when the dimensionality increases, we turn to approximation algorithms. Our main approximation algorithm has a worst case p2-approximation ratio if one can find the exact nearest neighbor of a point. In practice, its approximation ratio never exceeds 1.05 for a large number of data sets up to six dimension.We also discuss how to extend it to higher dimensions (up to 74 in our experiment) and show that it still maintains a very good approximation quality (still close to 1) and low query cost. In fixed dimensions,we extend the p2-approximation algorithm to get a (1 epsilon)-approximate solution for the GEQ problem. Both approximation algorithms have O(logN M) query cost in any fixed dimension, where N and M are the sizes of the data set P and query group Q. Extensive experiments on both synthetic and real data sets, up to 10 million points and 74 dimensions, confirm the efficiency, effectiveness and scalability of the proposed algorithms, especially their significant improvement over the state-of-the-art method.
With the advance of wireless communication technology, it is quite common for people to view maps or get related servicesfrom the handheld devices, such as mobile phones and PDAs. Range queries, as one of the most commonly used tools, are oftenposed by the users to retrieve needful information from a spatial database. However, due to the limits of communication bandwidthand hardware power of handheld devices, displaying all the results of a range query on a handheld device is neither communicationefficient nor informative to the users. This is simply because that there are often too many results returned from a range query. In viewof this problem, we present a novel idea that a concise representation of a specified size for the range query results, while incurringminimal information loss, shall be computed and returned to the user. Such a concise range query not only reduces communicationcosts, but also offers better usability to the users, providing an opportunity for interactive exploration.The usefulness of the concise range queries is confirmed by comparing it with other possible alternatives, such as sampling andclustering. Unfortunately, we prove that finding the optimal representation with minimum information loss is an NP-hard problem.Therefore, we propose several effective and non-trivial algorithms to find a good approximate result. Extensive experiments on realworlddata have demonstrated the effectiveness and efficiency of the proposed techniques.
Recently, there have been several attempts to propose definitions and algorithms for ranking queries on probabilistic data. However, these lack many intuitive properties of a top-k over deterministic data. We define several fundamental properties, including exact-k, containment, unique-rank, value-invariance, and stability, which are satisfied by ranking queries on certain data. We argue these properties should also be carefully studied in defining ranking queries in probabilistic data, and fulfilled by definition for ranking uncertain data for most applications. We propose an intuitive new ranking definition based on the observation that the ranks of a tuple across all possible worlds represent a well-founded rank distribution. We studied the ranking definitions based on the expectation, the median and other statistics of this rank distribution for a tuple and derived the expected rank, median rank and quantile rank correspondingly. We are able to prove that the expected rank, median rank and quantile rank satisfy all these properties for a ranking query. We provide efficient solutions to compute such rankings across the major models of uncertain data, such as attribute-level and tuple-level uncertainty. Finally, a comprehensive experimental study confirms the effectiveness of our approach.
The database community has devoted extensiveamount of efforts to indexing and querying temporaldata in the past decades. However, insufficientamount of attention has been paid to temporal rankingqueries. More precisely, given any time instance t, thequery asks for the top-k objects at time t with respect tosome score attribute. Some generic indexing structuresbased on R-trees do support ranking queries on temporaldata, but as they are not tailored for such queries,the performance is far from satisfactory. We presentthe Seb-tree, a simple indexing scheme that supportstemporal ranking queries much more efficiently. TheSeb-tree answers a top-k query for any time instance tin the optimal number of I/Os in expectation, namely,O(log_B(N/B k/B)) I/Os, where N is the size of the data set and B is the disk block size. The index has near-linearsize (for constant and reasonable kmax values, where kmax is the maximum value for the possible values of the query parameter k), can be constructed in near-lineartime, and also supports insertions and deletions without affecting its query performance guarantee. Most ofall, the Seb-tree is especially appealing in practice dueto its simplicity as it uses the B-tree as the only building block. Extensive experiments on a number of largedata sets, show that the Seb-tree is more than an orderof magnitude faster than the R-tree based indexes for temporal ranking queries.
Query authentication is an essential component in outsourced database (ODB) systems. This arti-cle introduces efficient index structures for authenticating aggregation queries over large data sets. First, we design an index that features good performance characteristics for static environments.Then, we propose more involved structures for the dynamic case. Our structures feature excellentperformance for authenticating queries with multiple aggregate attributes and multiple selectionpredicates. Furthermore, our techniques cover a large number of aggregate types, including distributive aggregates (such as SUM, COUNT, MIN and MAX), algebraic aggregates (such as the AVG), and holistic aggregates (such as MEDIAN and QUANTILE). We have also addressed the issue of authenticating aggregation queries efficiently when the database is encrypted to protect data confidentiality. Finally, we implemented a working prototype of the proposed techniques andexperimentally validated the effectiveness and efficiency of our methods.
Due to the overwhelming flow of information in many data stream applications, data outsourcing isa natural and effective paradigm for individual businesses to address the issue of scale. In the standard data outsourcing model, the data owner outsources streaming data to one or more third-partyservers, which answer queries posed by a potentially large number of clients on the data owner's behalf. Data outsourcing intrinsically raises issues of trust, making outsourced query assurance on data streams a problem with important practical implications. Existing solutions proposed in this model all build upon cryptographic primitives such as signatures and collision-resistant hash functions, which only work for certain types of queries, for example, simple selection/aggregation queries.In this article, we consider another common type of queries, namely, “GROUP BY, SUM” queries, which previous techniques fail to support. Our new solutions are not based on cryptographic primitives, but instead use algebraic and probabilistic techniques to compute a small synopsis on the true query result, which is then communicated to the client so as to verify the correctness of the query result returned by the server. The synopsis uses a constant amount of space irrespective of the result size, has an extremely small probability of failure, and can be maintained using no extra space when the query result changes as elements stream by. We then generalize our synopsisto allow some tolerance on the number of erroneous groups, in order to support semantic load shedding on the server.When the number of erroneous groups is indeed tolerable, the synopsis can be strengthened so that we can locate and even correct these errors. Finally, we implement our techniques and perform an empirical evaluation using live network traffic.
In the emerging area of sensor-based systems, a significant challenge is to develop scalable, fault-tolerant methods to extract useful information from the data the sensors collect. An approach tothis data management problem is the use of sensor database systems, which allow users to performaggregation queries such as MIN, COUNT, and AVG on the readings of a sensor network. In addition, more advanced queries such as frequency counting and quantile estimation can be supported. Due to energy limitations in sensor-based networks, centralized data collection is generally impractical, so most systems use in-network aggregation to reduce network traffic. However, even these aggregation strategies remain bandwidth-intensive when combined with the fault-tolerant, multipath routing methods often used in these environments. To avoid this expense, we investigate the use of approximate in-network aggregation using small sketches.We present duplicate-insensitive sketching techniques that can be implemented efficiently on small sensor devices with limited hardware support and we analyze both their performance and accuracy. Finally, we present anexperimental evaluation that validates the effectiveness of our methods.
This work introduces novel polynomial algorithms for processing top-k queries in uncertain databases under the generally adopted model of x-relations. An x-relation consists of a number of x-tuples, and each x-tuple randomly instantiates into one tuple from one or more alternatives. Our results significantly improve the best known algorithms for top-k query processing in uncertain databases, in terms of both runtime and memory usage. In the single-alternative case, the new algorithms are 2 to 3 orders of magnitude faster than the previous algorithms. In the multialternative case, we introduce the first-known polynomial algorithms, while the current best algorithms have exponential complexity in both time and space. Our algorithms run in near linear or low polynomial time and cover both types of top-k queries in uncertain databases. We provide both the theoretical analysis and an extensive experimental evaluation to demonstrate the superiority of the new approaches over existing solutions.
In the emerging area of sensor-based systems, a significant challenge is to develop scalable, fault-tolerant methods to extract useful information from the data the sensors collect. An approach to this data management problem is the use of sensor “database” systems, which allow users to perform aggregation queries on the readings of a sensor network. Due to power and range constraints, centralized approaches are generally impractical, so most systems use in-network aggregation to reduce network traffic. However, these aggregation strategies become bandwidth intensive when combined with the fault-tolerant, multi-path routing methods often used in these environments. In order to avoid this expense, we investigate the use of approximate in-network aggregation using small sketches and we survey robust and scalable methods for computing duplicate-sensitive aggregates.