Adaptive Processing for Distributed Skyline Queries over Uncertain Data

Query processing over uncertain data has gained growing attention, because it is necessary to deal with uncertain data in many real-life applications. In this paper, we investigate skyline queries over uncertain data in distributed environments (DSUD query) whose research is only in an early stage. The state-of-the-art algorithm, called e-DSUD algorithm, is designed for processing this query. It has the desirable characteristics of progressiveness and minimum bandwidth consumption. However, it still needs to be perfected in three aspects. (1) Progressiveness. Each time it only returns one query result at most. (2) Efficiency. There are a significant amount of redundant I/O cost and numerous iterations which causes a long total query time. (3) Universality. It is restricted to the case where local skyline tuples are incomparability. To address these concerns, we first present a detailed analysis of the e-DSUD algorithm and then develop an improved framework for the DSUD query, namely IDSUD. Based on the new framework, we propose an adaptive algorithm, called ADSUD, for the DSUD query. In the algorithm, we redefine the approximate global skyline probability and choose local representative tuples due to minimum probabilistic bounding rectangle adaptively. Furthermore, we design a progressive pruning method and apply the reuse mechanism to improve its efficiency. The results of extensive experiments verify the better overall performance of our algorithm than the e-DSUD algorithm.

Introduction

Data’s are increasingly stored and processed distributively[1] [2], as a result of the wide deployment of computing infrastructures and the readily available network services. More and more applications collect data from distributed sites and derive results based on the collective view of the data from all sites. Examples include sensor networks, data integration from multiple data sources[7], and information retrieval from geographically separated data centers. In the aforementioned application domains, it is often very expensive to communicate the data set entirely from each site to the centralized server for processing, due to the large amounts of data available nowadays and the network delay incurred, as well as the economic cost associated with such communication. Fortunately, query semantics in many such applications rarely require reporting every piece of data in the system. Instead, only a fraction of data that are the most relevant to the user’s interest will appear in the query results. Skyline queries help users make intelligent decisions over complex data[3], where different and often conflicting criteria are considered. Such queries return a set of interesting data points that are not dominated by any other point on all dimensions. Distributed skyline computation[17], over uncertain data has many important applications. For instance, consider the stock market application where customers may want to select good deals (transactions) for a particular stock over all the distributed stock exchange centers. A deal is recorded by two attributes (price, volume) where price is the average price per share in the deal and volume is the number of shares. such scenario, before making trade decisions, a customer may want to know the top deals over all the distributed local sites. However, such a top deals are often difficult to find, especially since the user typically has no information on the database content. Moreover, complex decision making on real world data usually involves several dimensions of interest. As an alternative, the skyline query returns all offers which may be of interest. Therefore, a set of deals recorded in the database may be treated as a set of uncertain elements and some customers may only want to know “top” deals (skyline) among the entire deals over distributed sites; and thus we have to take the uncertainty of each deal into consideration. This is a scenario of distributed skyline queries over uncertain data. Consider, for instance, a database containing information about hotels. Assume a user is looking for hotels at a specific location that are as cheap as possible and as close as possible to the beach. In this case, it is not obvious whether the user would prefer a hotel that is very close to the beach but more expensive than others or rather a hotel that is very cheap but farther away from the beach. The skyline set contains all hotels that are not worse than any other hotel based on all criteria, without requiring a scoring function that defines the relative importance of the different criteria. Thus, the skyline set contains all tuples that represent the best trade-offs between the different criteria. Figure 1 shows an example, where each point represents a hotel with price per night and distance to the beach as coordinates; hotels p3, p1, p5 are in the skyline set.

Conclusion

The proposed algorithm focuses the subspace skyline queries for distributed data. The significant communication cost could be saved by analysing the difference between the global skyline and its approximate value. Also demonstrate how to alleviate the computation burden at each distributed node so that communication and computation efficiency are achieved simultaneously. Extensive experiments have been conducted to verify that the techniques can process skyline queries over distributed uncertain data both communication- and computation-effectively. In this paper an arbitrary horizontal partitioning is focused, where a local site has all the attributes but stores only a subset of the entire tuples. In future this work can be extended to process ad-hoc skyline query.

Adaptive Processing for Distributed Skyline Queries over Uncertain Data