5.2.1. The Univariate Case
The histograms are plotted to present the mixed and grouped distribution of EUIs for both cities. In the left panel of
Figure 3, the mixed distribution created two peaks, although there was no obvious separation in the overlapped part. The true distributions can, however, be seen quite clearly in the grouped (separated) distribution displayed in the right panel of
Figure 3. One limitation when using a Gaussian distribution is that the value of EUI cannot be negative. A truncated Gaussian may be a more appropriate representation. However, since all the truncated data points belonged to the Boston population, using GMM will not affect the responsibility probability
when conducting the E-step. Thus, we still follow the Gaussian assumption. If Gaussian property is severely violated, for example, because of the heat load variation in the bivariate variables, a Box-Cox transformation is required before implementing EM.
Another issue for the parameter estimation is to determine the initial values of
. It is not a difficult task to compare several combinations for the univariate case, but it will become a problem when the parameter size is large. Thus, we tested a number of possible combinations by varying the initial choice of
and applied the same scheme to the bivariate case. More details on this process can be found in previous work on the setting of initial values [
58]. The choice was determined by considering whether the final estimation could well represent both populations. The empirical results showed that the choices of
did not seem to be dominant for the convergence, and we simply took
and
to be the sample standard deviations. On the other hand, close values for
and
failed to separate the populations. In most of the experiments,
and
converged to a single value. Thus, we initialized
and
by letting them be constrained in the upper and lower 1/3 quantile of the mixed population, respectively. Further, one hundred initial values for
and one hundred for
were randomly generated to obtain ten thousand combinations in which we randomly opted ten for evaluating the performance.
The ten sets of parameters are summarized in
Table 1. For every parameter, there was no significant variation, and the mean values can be used to represent
and
. We also computed the absolute errors in percentage terms between the mean and true values. Most of them were within 5%, while the overestimation of
for Aarhus might be due to the slightly smaller estimation for
. Unlike fixed initial values, we also allowed for random variations up to 25% for
and
to validate our argument. As
Table 2 shows, the result resembled
Table 1. The same conclusion could be drawn for
and
, which are not shown here. It is also observed that the performance of the log-likelihood values in
Figure 4 is uniform. All of the experiments stopped within 15 updates and seemed to converge at the same point. In other words, it is enough to make inferences based on current estimations.
We created two scenarios and assigned the sample points to the population with the larger density for classification. Two corresponding confusion matrices are presented in
Table 3. We divided all the data points into four categories: true Boston, false Boston, true Aarhus and false Aarhus. The accuracy is the sum of both true classifications that is close to 90%.
We then demonstrated the fitness between theoretical and empirical proportions. Proportion here refers to the quotient for which Boston’s EUI should theoretically and empirically account. Theoretical proportion is made by
where
corresponds to the probability quantile segmentation on the x-axis in
Figure 5 for the mixed distribution. Similarly, the empirical proportion,
, counts the number of data points from Boston divided by all data points between two adjacent values of
. For example, if 100% of the observations are taken from Boston,
should be extremely close to 1 at the quantile 0. Both
and
are supposed to decrease because
is greater than
. Thus, the results showed a good fit except for the quantiles close to 1 on the bottom right corner. This is because of the long tail of Boston’s EUI. However, there were only a total of eight data points in this area, which is a minor deviation compared with
. The performance for Aarhus would look exactly the same just by vertically reversing
Figure 5.
Given the results, we have drawn an additional two-step random sampling from the distributions to examine the generated EUIs. In step 1, we created a vector to store a random sequence of 0s and 1s that imply to which population a generated point belongs. The probabilities are taken as
and
, where
I is the indicator. The length of the vector is the same as the one of the observed sample, namely 842. In step 2, Gaussian random samples were drawn for each population to constitute the generated data. The quantile–quantile plot is shown for both fixed and random initial σs. As seen in
Figure 6, the quantile values were taken every 5%. It is not surprising that neither setting for initial
differed a great deal, and almost all of the quantile values were located on the 45-degree line, which means that the quantile values matched each other. In this sense, the generated sample under GMM presented a reliable representation of the populations. Here we only used the true sample to validate a generated sample of the same size. When the method is used in an authentic context, the sample size will depend on the total number of buildings in the original cohort.
5.2.2. The Bivariate Case
Testing bivariate case requires more parameters to be estimated. The selection of the initial values followed the same paradigms that were adopted for the univariate case. In order to keep the Gaussian property, as mentioned in
Section 5.2.1, the daily heat load variation (Ga) was treated by Box-Cox transformation [
59]. Since the estimations then became complex, we also increased the number of experiments for determining the estimates. We picked 20 initial values from each of the population means:
,
,
and
. The number of combinations became
. In all the experiments, we highlighted the combinations with a log-likelihood in the top 10%. We present the resulting pattern in
Figure 7 where 400 combinations were located on the x-axis for the population ‘1951–1960’, while 400 for the population ‘After 2015’ were located on the y-axis. The combinations were arranged from {minimum EUI, minimum Ga} to {minimum EUI, maximum Ga} and then to {maximum EUI, maximum Ga}. The figure shows that there were slight periodic patterns among the bigger
grids. Higher log-likelihood was slightly denser in the top left part. In almost all the bigger
grids, however, high log-likelihood could always be found. Thus, the selection of initial values for the EM algorithm in the bivariate case appears to be somewhat isotropic at finding estimates with high log-likelihood values.
Something similar happened when we summarized the results of the 10% experiments in
Table 4. In the bivariate case the overall errors decreased significantly compared with the univariate case. The majority were now below 3%. The reason for the bivariate model’s success might be that it is able to use more of the energy performance features to separate the populations more correctly. It should be noted that the estimated
was not equal to the transformed mean value of Ga because the Box-Cox transformation is nonlinear. Thus, we only show the results for the transformed values here. We present both transformed and non-transformed Ga in the generative model evaluation later on. The classification accuracy is further computed in concordance with the univariate case in
Table 5. Given the better parameter estimations, the true accuracy was over 99%.
The estimates obtained in
Table 4 were used to generate density contours, with dense and sparse areas distinguished in the two-dimensional surface. We compared these with the true distributions because they disclose the real scales of the densities. The contours are displayed in the left panel of
Figure 8, and show the comparison that is made for the transformed heat load variation, while the result of the non-transformed distributions is shown in the right panel. The generative models are supposed to characterize the distributions in both dense and sparse areas. Both panels had obvious and observable centers. Both of these centers converged at the densest part of the real data. In other words, the generative models represented the real data to an acceptable degree. As discussed in the univariate case, the actual number of generated samples depends on the city’s capacity to evaluate its energy performance with insufficient data.