Improving estimation accuracy of aggregate queries on data cubes

In this paper, we investigate the problem of estimation of a target database from summary databases derived from a base data cube. We show that such estimates can be derived by choosing a primary database which uses a proxy database to estimate the results. This technique is common in statistics, but an important issue we are addressing is the accuracy of these estimates. Specifically, given multiple primary and multiple proxy databases, that share the same summary measure, the problem is how to select the primary and proxy databases that will generate the most accurate target database estimation possible. We propose an algorithmic approach for determining the steps to select or compute the source databases from multiple summary databases, which makes use of the principles of information entropy. We show that the source databases with the largest number of cells in common provide the more accurate estimates. We prove that this is consistent with maximizing the entropy. We provide some experimental results on the accuracy of the target database estimation in order to verify our results.


INTRODUCTION
Providing exact answers to queries from large data cubes in OLAP applications can be too slow, and in some cases, the user may prefer a fast approximate answers.A more crucial * This work was supported by the Director, Office of Science, of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
case is when it is not possible to provide precise answers, such as in socio-economic applications because only summarized data is available for reasons of privacy.In such cases, it is quite useful to generate an estimate or approximate answers using approximate query processing techniques.A key issue is the accuracy of the estimates for aggregate queries (e.g., queries computing SUM or COUNT expressions), and was the focus of recent research activity (e.g., (Palpanas, Koudas & Mendelson 2005), (Pourabbas & Shoshani 2007)).
In (Pourabbas & Shoshani 2007), we discussed the estimation of summary queries, evaluated over multiple source summary databases.Such a summary query consists of requesting a summary measure of interest (e.g., household income), called target measure, over a set of category attributes, called target dimensions (e.g., State, Sex).In many cases, it may not be possible to evaluate such a query from a single source summary database, and two summary databases have to be used.For example, suppose that one database contains Income by (State,Age,Race) and the second contains population by (State,Age,Sex,Education level).It is possible to estimate the target database Income by (State,Sex) by using the first database as the "primary" database (since it has the target measure Income), and using the second database as a "proxy" database (since it has the additional desired target dimension Sex).Here the population sizes are considered a proxy for the measure Income.The estimation method used to generate the target database is the linear indirect estimator (see Appendix-(A)), which takes advantage of the fact that the summary databases were derived from the same base data, and consequently are correlated.The proposed method to estimate efficiently the target database was based on partitioning the dimensions of the source databases into three types: "target" , "common", and "non-common" dimensions.We first determine the target dimensions, and classify the remaining dimensions as common and non-common.In the example above, State and Sex are target dimensions, Age is a common dimension, and Race and Education level are non-common dimensions.
In that previous paper we examined two obvious computational methods for computing such a target database, called the "Full cross product" (F) and the "Pre-aggregation" (P) methods.Essentially, the estimation by the F method is achieved by first calculating the target measure over the full cross product of the dimensions from both databases using proportional estimation, and then aggregating over all the non-target dimensions.Since this method requires generating the full cross product, its cost is high.In contrast, the estimation by the P method consists of aggregating over all the non-target dimensions of both databases first, and only then generating the cross product using proportional estimation to obtain the result.The pre-aggregation reduces the size of the cross product greatly, and lowers the cost of generating the estimation.However, we showed that the P method, while computationally efficient, yields results that are not as accurate as the F method.We proposed a third method called "Partial Pre-aggregation" (PP) method, which consists of summarizing only the non-common dimensions first, and then applying the proportional estimation.Using a measure of accuracy, called Average Relative Error-ARE (see Appendix-(B)), we proved that the PP method yields the same accuracy as the F method, but reduces significantly the computational and space complexity.The reduction in cost is by a factor proportional to the multiplication of the cardinalities of the non-common dimensions.
In this paper, we consider an open question which was left as future challenge in (Pourabbas & Shoshani 2007).The question is how to select a primary and a proxy database given that there are multiple primary databases available with the same measure and multiple proxy databases with the desired target dimensions in order to get the most accurate estimation results.

The Problem
To explain the idea let us consider the following multiple primary databases: where the cardinalities of the dimensions are: |State| = 52, |Age|=4 , |Labor status|=2, and |Sex|=2.Note that the two categories of Labor status are In Labor Force and Not in Labor Force according to U.S. Census Bureau.Let Income(State,Labor status,Age,Sex) be the target database, which should be estimated from the sets of summary databases given above.If we select the first primary database, i.e.Income(State,Age), then we can apply DB P X2 , DB P X3 , and DB P X4 to estimate the target database since only these proxy databases contain auxiliary data on the dimensions Labor status and Sex.Similarly, if we choose the second primary database, we can only apply DB P X1 , DB P X3 , and DB P X4 .The third primary database needs auxiliary data on dimensions State and Sex, which are provided by DB P X1 , DBP X2, and DBP X4.Whereas, for the last primary database all four proxy databases can be applied.This is labeled as Case 1 in Table 1, where we assume that all four primary databases exist, as well as all four proxy databases exist.We also include in Table 1 three additional cases where only some of the primary or proxy databases are shown.These cases will be used later to illustrate situations that require special attention.In all four cases, as we mentioned before, the main goal is to obtain more accurate estimated results for the target database.Thus, to achieve this goal we have to select two source databases.The problem is which databases should we choose from a given set of primary and proxy databases that provide more accurate estimation results.
The solution of the problem mentioned above is based on two conjectures.The first one is that the more cells of common dimensions the primary database shares with the target database the more accurate are the estimated results.A cell is defined as the smallest element formed by the cross product of the dimensions.Referring to the primary databases shown in Case 1, DB P R4 not only shares the largest number of cells of common dimensions with the target database but also includes all the dimensions of the first three primary databases.Note that in this case all common dimensions are target dimensions.Now, let us consider Case 2 and Case 4. The problem is which primary database should we choose?In the next section, we will show that basing this decision on the estimate of the maximum entropy provides more accurate results.
The second conjecture is that the proxy database that shares the largest number of cells of the common dimensions with the primary database provides more accurate results.In Case 1 and Case 2, DB P X4 is such a proxy database.A similar problem arises when selecting the proxy database in Case 3 and Case 4. In these cases, which approach should be applied in order to select the proxy database for the estimation of the target database?We discuss this problem in the next section as well.
The problem addressed in this paper consists of the general problem labeled by i shown in Table 2.In (Pourabbas & Shoshani 2007), we studied the case iv.In this paper, we examine the first general case.The problems ii, iii, and iv are special cases of the problem i as well.

Related Work
There was a significant amount of work in the literature on approximate query processing.In (Malvestuto 1993), for instance, the definition of a universal statistical database containing several summary tables which share the same summary measure is examined.Given a query, a system of linear equations over the universal database is constructed whose solutions satisfy the query.In (Malvestuto & Pourabbas 2004), and (Malvestuto & Pourabbas 2005), the problem of evaluating a summary query from a set of summary tables sharing the same variable and an auxiliary table is discussed.These works propose algorithms which make use of techniques developed in the theory of acyclic database schemas.
In contrast, we focus here on the problem of the accuracy of the query estimation.In our work, we consider a set of proxy (or auxiliary) databases, which share the same summary measures.
In (Hellerstein, Haas & Wang 1997) the authors propose a framework for approximate answers to aggregation queries called online aggregation in which the base data is scanned in random order at query time and the approximate answer is continuously updated as the scan proceeds.The Approximate Query Answering (AQUA) (Gibbons & Matias 1998) system provides approximate answers using small precomputed synopses of the underlying base data.In (Palpanas et al. 2005), the authors consider the problem of deriving approximately the original data from the aggregates.They propose a framework for estimating the original values based on the notion of information entropy.In our work, we use a different approach of estimating the values of the target database by using additional information from proxy databases.We apply the principles of entropy over the multiple source databases in order to identify two databases, which achieve more accurate results.We prove formally that the source databases with the largest number of cells in common provide more accurate estimated results.Based on these results, we propose an algorithmic approach for determining the steps to select or compute the source databases from multiple summary databases.
The paper is structured as follows.The next section provides the principles of entropy used in this paper.In this section we also introduce the formal model which provides the basis for a formal analysis of the results in this paper.Section 3 discusses the problem of selecting two source summary databases from multiple primary and multiple proxy databases in order to achieve maximum accuracy for the target database.In Section 4, we develop an algorithmic approach for determining the steps to achieve maximum accuracy, and we prove theorems which show the source databases with the largest number of cells in common provide the more accurate estimates.Section 5 illustrates some experimental results on the accuracy of the target database estimation.Section 6 contains the conclusions.

PRINCIPLES AND FORMAL MODEL 2.1 Principles of Entropy
In this section, we recall the principles of maximum entropy and minimum cross-entropy, which will be used in the next sections.The (Shannon)entropy H of a discrete probability distribution p(x) is the non negative function where X represents the set of tuples.H reaches its maximum value at the uniform distribution over X, i.e., log|X|.
In statistics and information theory, a maximum entropy probability distribution is a probability distribution whose entropy is at least as great as that of all other members of a specified class of distributions.
Let P (X1, . . ., Xn) be an n-dimensional discrete probability distribution to be estimated from P (X1, . . ., Xn) and the set of all marginal distribution P i (X i ) with i = 1, . . ., n ("Marginals" is a commonly-used term in Statistics that refers to the summary of rows and columns in the "margins" of a table .)If X = {X1, , Xn}, we may find P that maximizes the entropy H(P ) of P over all marginal probability distributions such that it satisfies the following constrains: Note that in this paper, we will refer to the constraints mentioned above as the consistency conditions.Let P (X) be the maximum entropy approximation to P (X).The crossentropy (or relative entropy or Kullback-Leibler distance) between P (X) and P (X) measures the similarity of two distribution and is defined as follows: Minimizing D( P , P ) is the same as maximizing the entropy of P .The technique used to compute the maximum entropy estimate is Iterative Proportional Fitting Procedure-IPFP (Deming & Stephan 1940), which starts with the zero approximation P [0] (X) = P (X) and determines the higherorder approximations to P (X) according to the following computation scheme: where the approximation P [hn+i] (X) in the (h + 1)-th iteration cycle, 1 ≤ i ≤ n, is obtained by fitting the approximation P [hn+i−1] (X) to the marginal distribution Pi(Xi) as follows: This procedure converges monotonically to the maximum entropy estimation.The iterations stop when the estimate at two consecutive steps are the same or the difference of estimates are less than a pre-defined value.

Formal model
We use here the formal model defined in (Pourabbas & Shoshani 2007), which provides the basis for a formal analysis of the results.In the following sections, we assume two source summary databases, called DB P and DB Q that are used to produce a target database DBT .The databases are defined as follows: , where M P , M Q , and M T are the measures of the corresponding databases, A i P , A j Q , and A k T are the corresponding dimensions, and m, n, and t are the cardinalities of the corresponding dimensions.In defining a target database over the two source summary databases, one of the measures, either M P or M Q is selected.Without loss of generality, suppose that M P is selected.Thus, M P = M T .DBP is called the primary database, MQ is called the proxy measure, and DB Q is called the proxy database.
Given two source summary databases DB P and DB Q that are used to generate a target database DBT , we can classify the source database dimensions as belonging to three disjoint groups: target dimensions, common dimensions, and non-common dimensions.First, we pick the dimensions in the source databases that are specified in the target database for the target group; then the remaining dimensions are considered common if they are in both source databases, and are considered non-common otherwise.Note that a target dimension can exist in both source databases.We use the following notation: , and

DATABASE SELECTION
In this section, we investigate the problem of selecting two source summary databases from multiple primary and multiple proxy databases in order to achieve maximum accuracy for the target database.Only primary databases that have the same measure as that of the target database need be considered.
The proxy database is selected in order to provide the dimensions missing in the primary database and specified in the target database.For all four cases shown in Section 1.1, the Sex dimension in the multiple proxy databases is needed for the target database and is not provided from primary databases.We recall the results discussed in (Pourabbas & Shoshani 2007) regarding the non-common dimensions or the dimensions which are not specified in the target database but exist in one of the source databases.According to the Partial Pre-aggregation (PP) method, pre-aggregating the source databases over the non-common dimensions, the estimation results are as accurate as the estimates obtained by the full cross-product of all dimensions of the source databases first and then aggregating over non-common dimensions.In this paper, we use this approach in considering which primary and proxy databases to choose to maximize accuracy.
In the previous section, we conjectured that the primary database which includes the largest number of cells of the desired target dimensions is the better choice.Let us recall the set of primary databases shown in Case 1, and shown in Table 3 (where we use the symbols "I" and "P" to indicate Income and Population, respectively.)By multiplying the cardinalities of the dimensions we obtain the number of cells for each choice.As can be seen in Table 3, DBP R4 shares 416 cells for dimensions in common with the target database Income(State, Labor status, Age, Sex).It includes more cells with respect to the other three primary databases.An important idea associated with the number of cells is that of entropy.According to the principles discussed in Subsection 2.1, given a set of primary databases we have to choose the one with the largest number of cells to achieve the largest entropy (Jaynes 1979).In Section 4 we prove in the first theorem that the more accurate estimate is achieved when the primary database with the largest number of cells in common with the target database is selected.
For the databases shown in Table 3, the largest entropy is achieved by DB P R4 .This primary database also satisfies the three constraints of consistency conditions listed in Subsection 2.1.Concerning the proxy databases (see Table 4), if there are common dimensions, we conjecture that the proxy database with the largest number of cells of the common dimensions with the primary database achieves the more accurate result.In this case, it is DB P X4 .This conjecture is also proven in Section 4 where we show in the second theorem that the more accurate estimate is achieved when the proxy database with the largest number of cells in common with the primary database is selected.The relative entropy (or loss of information) of the estimates by applying each primary database to DBP X4 is shown in Table 3, fourth column.Applying DB P R4 , the amount of information that we lose is less than the others.This indicates that the estimate obtained by DB P R4 is more similar to that of the real distribution of Income with respect to the other primary databases.Thus, the combination of DBP R4 and DB P X4 provides the more accurate estimate.The accuracy results are given in Section 5. Suppose, in Table 3, that only the first three databases are given (i.e.Case 2).In this case, the maximum number of cells is provided by DBP R1, but none of them satisfies the consistency conditions (see Subsection 2.1).Thus, Income(State, Labor status, Age) needs to be estimated.For this reason, we have to consider all three primary databases by applying IPFP to estimate Încome(State,Labor status, Age).This estimates satisfies the above mentioned condition because, for instance, aggregating that over "Age", we have Income(State,Labor status), over "Labor status" we obtain Income(State,Age) and over "State" we obtain Income(Labor status,Age).This estimate provides maximum entropy and contains the largest number of cells in common with the target database (this is expressed in the Procedure in Section 4).In (Malvestuto & Pourabbas 2005), it is discussed that this estimate is uniquely determined by the informationtheoretic principle of minimum cross-entropy and its distribution is defined as follows.(For the sake of brevity, the symbols "S","L",and "A" indicate "State", "Labor status", and "Age", respectively.) . . .Note that the zero approximation (or initial distribution) is set to the proxy database with the same dimensions of the estimate of Income.In this example, the mentioned proxy is DB P X4 , where P op(S, A, L) = Sex P op(S, A, L, Sex).
Case 4 differs from Case 2 in the proxy database computation.In order to apply IPFP to the primary databases, the zero approximation should be set to P(S,L,A), but this proxy is not provided.Our solution is to estimate P (S, L, A, Sex) from the proxy databases.We return to this point in Section 5.The estimate of the primary database is obtained by IPFP, where the zero approximation is defined by the aggregation over Sex of P (State, Labor status, Age, Sex) given below: P (State, Labor status, Age, Sex) = P (State, Age, Sex) P op (State,Labor status,Sex)   P op (State,Sex) As a final remark, we emphasize that in each set of databases there can be summary databases which are marginal of a database in the same set.They are not considered in the database selection because they are redundant.

ALGORITHMIC APPROACH
We propose the use of an algorithmic approach for determining the steps to achieve maximum accuracy.The procedure is essentially based on two theorems introduced below.Using the notation introduced in Subsection 2.2, we can formulate the following definition and theorems.
) be primary summary databases, and let Q ) be a proxy database.We define MP k to be the estimation result of the target database over the primary summary database Similarly, we define MP l to be the estimation result of target database over the primary database M P l (A C P l , A C P l , A T C P l , A T C P l ).The expressions of the estimators above are defined by applying the PP method, according to which the source databases are aggregated over non-common dimensions first: then, linear indirect estimation method is applied: and C represents common and common-target dimension groups.Let MP k and MP l be the estimate of the target database obtained by applying the primary databases MP k and MP l to M Q , respectively.The primary database M P k achieves better estimates with respect to M P l .

Proof Let the relative entropy of MP
Q ) be defined according to expressions: We show D( MP k , M P ) < D( MP l , M P ), or D( MP l , M P ) − D( MP k , M P ) > 0 as follows: ) and G = , and according to Theorem 3.1 (The theorem states the relative entropy obtained from distributions of the observations is positive, see Chapter 2) in (Kullback 1959) which leads to the conclusion that FGlogG > 0. Thus, D( MP l , MP ) − D( MP k , MP ) > 0, with equality if and only if: , be primary database, and let be proxy databases.We define MP k to be the estimation result of the target database by applying the Similarly, we define MP l to be the estimation result of target database by applying the primary database to . The expressions of the estimators above are defined by applying the PP method as follows: Theorem 2. M P (A C P , A T C P , A T C P ), be primary database, and let Q ) be defined according to the following expressions: We show D( MP l , M P ) − D( MP k , M P ) > 0 as follows: To summarize the discussion above, the procedure for determining the steps to achieve maximum accuracy can be defined by Procedure.It is composed by three parts.Note that in step (3), the second part is called for the propose of obtaining the proxy database which includes maximum common dimensions with the primary databases.

Procedure
Input: Given target database DB T , multiple primary databases DB P Ri with 1 ≤ i ≤ n and multiple proxy databases DB P Xj 1 ≤ j ≤ m databases Goal: Select two source databases to obtain maximum accuracy for the estimate of DB T Part 1-Selection of the primary database (1) Given that M T = M P R start with selecting a primary database; (2) Select the primary database whose dimensions cover the dimensions of all other primary databases (indicated by A P R ) (3) If no such primary database exists run Part 2 and then apply IPFP to multiple primary databases with zero approximation fixed to DB P X pre-aggregated over A T C P X ; (4) Once DB P R was chosen (step 2) or estimated (step 3), pre-aggregate the non-common dimensions; Part 2-Selection of the proxy database (5) Consider only DB P X with dimensions A P X = A T C P X ∪ A P R ; (6) If there is no such proxy database, consider proxy databases that have A P X = A T C P X , with additional dimensions such that: (a) if non-common, pre-aggregate dimensions; (b) if common, apply IPFP to multiple proxy databases; Part 3-Estimation of the target database (7) Apply linear indirect estimation method to DB P R , and DB P X .

EXPERIMENTAL RESULTS
We discuss the experimental results of the application of our algorithmic approach to the four cases introduced in Subsection 1.1.For the experimental results, we use the values in the base data to evaluate the estimated errors.We start with Case 1.We note that DBP R4 and DBP X4 satisfy step (2) and step (5).In fact, they provide the most accurate results (see Table 5, first row).In Case 2, according to step (3), IPFP is applied to the given primary databases.As we mentioned in Section 3, the zero approximation is fixed to DBP X4 which is pre-aggregated over the non-common target dimension.The convergence of the estimate of Income is achieved after five iteration cycles.Note that, we could have fixed the zero approximation of IPFP to every primary database in order to estimate the primary database, but this starting values effect the accuracy of the results.In fact, the average relative error of the target database is 0.1732 vs 0.1625 by applying step (3).Overall, we note the accuracy results in Case 2 is close to that of Case 4. Similarly, the accuracy of results in Case 1 is close to that of Case 3.With respect to Case 1, the accuracy of Case 3 is better that Case 2. It seems that the estimation of the proxy database does not effect significantly the accuracy of the results.But, this is not the case of the estimation of the primary database (see the accuracy of Case 1 and Case 2).Obviously, the accuracy of Case 4 is worse than the other cases.
In addition, we compare some accuracy results of the estimates.Specifically, in Table 6, we compare the accuracy results of the estimate of target database by applying each primary database to P (State, Labor status, Age, Sex) and the estimate of the primary database computed according to step (3) of the proposed procedure.Table 7 illustrates the accuracy results of the estimate of the target database by applying to I(State, Labor status, Age) each given proxy database and the estimated proxy database computed according to step (6) of the proposed procedure.
Finally, Table 8 shows the accuracy results of the estimate of the target database by applying the estimated primary database Î(State, Labor status, Age) to each given proxy database and the estimated proxy database P (State, Labor status, Age, Sex).

CONCLUSIONS
Given multiple primary and multiple proxy databases summarized over a large base cube database, we investigate the problem of selecting the source summary databases that provide the most precise estimate for a target database.The databases in each set share the same summary measure.We show that the primary and proxy databases with the largest number of cells in common provide more accurate results.Our methodology is based on the principles of information entropy.Based on these results, we proposed an algorithmic approach for determining the steps to select or compute the source databases from multiple summary databases.To describe such proposed algorithm, some example databases were used, and experimental results for them have been demonstrated.
where C, C, and T refer to the common, non-common, and target dimension-groups, respectively.Note that A C P = A C Q , and A T C P = A T C Q .We use the notation AT for the group of target dimensions {A k T 0 < k ≤ t}.Thus, DBT = MT (AT ).Using the notation above, we have A T = A T C P ∪ A T C P ∪ A T C Q .Note that A T C Q must always exist to make the proxy summarization meaningful.However, A T C P and A T C P may or may not exist.Indeed, if A T C Q does not exist, then there is no need to use DB Q , since the results can be obtained from DB P only.For instance, let us consider the source summary databases: Income(Age,Labor status,Sex), and Population(State,Age,Race,Sex).Let us assume that the summary query expressed over them is Income(State).In this case, Income(State) is the target summary database, Population(State,Age,Race,Sex) is the proxy database, and Income(Age,Labor status,Sex) is the primary database.AT = {State} is the target dimension, where A T C P opulation = A T C Income =Ø, A T C P opulation = {State}, A T C Income =Ø are the non-common target dimensions, A C P opulation = A C Income = {Age, Sex} are the common dimensions between the source summary databases, and A C P opulation = {Race}, and A C Income = {Labor status} are the non-common dimensions.If the summary query expressed over the source databases is Income(State,Age), then A T = {State, Age} and accordingly, A T C P opulation = A T C Income = {Age}, A T C P opulation = {State}, A T C Income =Ø, and A C P opulation = A C Income = {Sex}.

Table 3 :
Primary databases

Table 4 :
Proxy databases ) to MQ k and MQ l , respectively.The estimate MP k is more accurate than the estimate MP l .Proof Let the relative entropy of MP k P

Table 5 :
Accuracy results of selected primary and proxy databases in four cases

Table 8 :
ARE of Î(State, Labor status, Age, Sex) Labor status, Age) to proxy databases Age, Labor status, Sex) 832 0.1625