In this section we look at a number of statistical models having 2 parameters, and try to understand what can be learnt from looking at contour maps of the likelihood surface and how to calculate the MLEs. In principle the same methodology applies in higher dimensions, but exploring graphically in more than 2 dimensions becomes problematic. In each example it is assumed that are independent random variables.
Example 3.5: Normal Data, ctd.
Let
, so that
with parameter space .
Obtain the likelihood function for and the MLE
.
Then,
and so
First we calculate the MLEs; to do this we differentiate with respect to and and obtain:
For a maximum we need to obtain such that
and
Thus,
and so , and
giving .
Thus is the sample mean and is the sample
standard deviation. This seems reasonable given and
are the population mean and standard deviation respectively.
We will now look at this example in a number of ways to try to understand how the likelihood is determined by the sample data.
Sample similarities and variation. Figure Figure 3.2 (Link) shows histograms of 4 data sets, each of size 25, simulated from the distribution. Thus, the true value of . For each sample, contours of the log-likelihood function for have been plotted in steps of one unit down from the maximum, which is indicated by an asterix.
There are similarities in all four cases:
The shape of the contours is identical — reasonably elliptical with principal axes parallel to the coordinate axes
There is some asymmetry in the direction, which becomes more pronounced as we move further from the maximum.
The true parameter value has reasonably high likelihood and the MLE seems to be a quite adequate estimate of the true value.
We can also see how sample characteristics affect the likelihood. For example:
The top right sample seems to be rather more dispersed than the others; consequently, the likelihood gives greater values to large rather than small values of compared to the other examples.
The bottom-left sample seems slightly shifted to the left (of zero); thus, the likelihood is greater for negative rather than positive values of .
Although all of the samples are of equal size and drawn from
the same population, variation across the samples leads to
variation in some characteristics of the likelihood surface (the
location of the maximum, the gradient of the contours), while other
features (the overall shape) remain roughly the same.
R code to produce Figure 3.2 is given at the end of the example.
Densities corresponding to parameter pairs. Now consider another sample of size simulated from the Normal distribution, but with . A histogram of the data in shown in Figure Figure LABEL:fignlikloc1 (Link), together with the probability density function of the distribution from which they were simulated superimposed in bold-type.
The right of Figure Figure LABEL:fignlikloc1 (Link) plots contours of the log-likelihood function at steps of 1 unit down from the maximum.
Also identified on the contour plot are the maximum likelihood estimate (labelled ‘1’) and three other points (labelled ‘2’ to ‘4’). The probability density curves corresponding to these values have been superimposed on the histogram.
The curves representing the data have the following features:
Curve ‘1’ is the most consistent with the sample data, and is a quite reasonable estimate of the true model.
Curve ‘4’ is not much worse, as might be expected since the likelihood is not much below that of the maximum.
Curves ‘2’ and ‘3’ still have reasonably high likelihood, but we can see from the histogram why they have less support from the sample: curve ‘2’ has the mean about right, but the variance is greater than that of the data
curve ‘3’ seems to have the mean too low.
All of these conclusions are consistent with the information provided by the likelihood contours.
Effect of sample size, . Figure 3.5 (Link) shows simulated datasets from the distribution, and corresponding
log-likelihood contour plots for samples of sizes . The
contour plots are drawn on the same scales.
Thus, the main point to notice is that as the sample size increases, the contour gradients
become greater. Thus, with more data, smaller regions around the
maximum likelihood estimate are found to
be ‘plausible’ values of , as judged in terms of
likelihood.
This is what we would expect: with more data our inferences on should be more precise.
# produce 2x4 graphics display
par(mfrow=c(2,4))
# function to calc. l(theta) for normal data
lliknorm<-function(x,mu,sigma){
# sample size
n<-length(x)
# log-lhd equation (with additive consts removed)
-n*log(sigma) - 1/(2*sigma^2) * sum((x-mu)^2)
} # end of function
# optional: will mean you get the same results as me:
set.seed(330)
#for loop to produce 4 variants
for(i in 1:4){
# generate the data (25 normal deviates)
x<-rnorm(25,mean=0,sd=1)
# histogram of the data (breakpoints in hist forced)
hist(x,xlim=c(-3,3),breaks=(-3):3,main=NULL)
# grid of mu/sigma values to evaluate
mu<-seq(from=-1.5,to=1.5,length=100)
sigma<-seq(from=0.5,to=2.5,length=100)
# evaluate llik for each (mu,sigma) pair
llmat<-matrix(nrow=100,ncol=100)
for(j in 1:100) for(k in 1:100){
llmat[j,k]<-lliknorm(x,mu[j],sigma[k])
}
mle<-max(llmat)
# where to draw the contours
levelsdrawn<-mle-0:4
# contour plot of log-lhd
contour(mu,sigma,llmat,drawlabels=FALSE,levels=levelsdrawn,
ΨΨ ylab="sigma",xlab="mu")
} # end of for loop
Recall that in MATH235 we studied the one-parameter Uniform distribution and found that although we could use likelihood techniques, the standard asymptotic results and techniques did not apply because the support of the distribution depends on the parameter, so it does not satisfy regularity condition R1. This is also true in the two-parameter case — the problem is non-regular.
Example 3.6: Uniform Data.
Suppose
. Hence,
with , and
So,
and
This is a model for which the log-likelihood is not differentiable at the MLE. However it still helps to consider the partial derivatives for and :
Since , these tell us that the log-likelihood is
increasing as increases and decreasing as increases.
As the log-likelihood is outside the region we are considering, we wish to choose to be its largest value in this region, and the smallest.
This gives and .
Figure 3.6 (Link) shows the histogram of a sample of size 25
simulated from the Uniform distribution and a contour plot of
the associated log-likelihood function; as usual, contours are
plotted in steps of from the maximum.
It is also clear from the figure that (or ) will be maximized by taking as
small as possible, subject to the constraints that each of the
lie in the interval . This point is plotted in Figure 3.6 (Link).
The contours are highly non-elliptical: they are straight lines parallel to the line. (Note, however, that contours are truncated along the lines and , which again correspond to the limits of the parameter space having observed the data).
Example 3.7: Gamma Distribution.
Let , so and . Then
where is the gamma function. It follows that
Thus,
Note (the digamma function) may look complicated but you can think of this simply as a function of . Also
so to obtain we need to solve
Hence,
and
so replacing in terms of gives
This latter equation can only be solved numerically.
With true parameter values of and , Figure 3.7 (Link) shows a simulated sample of size and the associated log-likelihood contour plot.
In this case the maximum likelihood estimate of appears to be a good estimate of the true value, which itself lies inside the highest contour. The contours are near-elliptical, but the principal axes are not parallel to the coordinate axes.
This has implications for inference, suggesting that estimated values of that are quite different from are plausible, provided estimates of changes accordingly.
This is
understandable if we think about the effect of the parameters and
: is a shape parameter, while is a scale parameter. Quite
similar models can be obtained by, say, increasing the scale parameter
(and thus changing the dispersion of the data), provided the shape of the
distribution is also distorted somewhat.
This aspect can be seen quite clearly from Figure Figure LABEL:figgamlikloc1 (Link). This is a similar exercise to that carried out for the normal distribution, but now based on the simulated Gamma data. The left panel shows a histogram of the data, with the true Gamma density shown in heavy bold-type.
The right panel plots the contours of the log-likelihood function, with three points marked. Point ‘1’ is the maximum likelihood estimate; point ‘2’ is on the edge of the first likelihood contour; point ‘3’ is closer to the maximum likelihood estimate than point ‘2’, but has a considerably lower likelihood.
The corresponding model densities are also super-imposed on the left panel. We see that
Curves ‘1’ and ‘2’ are actually quite similar, as a consequence of the compensation between the parameters and ;
However, the relatively small change from point ‘1’ to point ‘3’, has led to a substantially different curve, that seems to represent the data less well. This is a consequence of the fact that and have not moved in a way that allows their effects to compensate each other.
Note, in fact, that since the distribution has expectation
it follows that all points on the line have constant mean. Thus, it might be anticipated that the contours should be almost parallel to this line.
Example 3.8: Simple Linear Regression.
We now look at a simple regression model:
with (known) explanatory variables . Thus,
and .
The likelihood is
and
To obtain :
For a maximum we find such that
Thus from the first equation we have
From the second equation we have
This expression for is identical to
(Note: these coincide with the least-squares estimates; see
MATH235).
Example 3.9: Linear Regression with unknown variance.
Now consider unknown. Let with (known) explanatory variables . Thus, and .
So,
and
Since this is a three-parameter problem, it’s not possible to contour the log-likelihood. Instead, we look at the three 2-parameter likelihoods obtained by fixing each parameter in turn at its true value and contouring the log-likelihood function of the other two.
Of course, with real data this cannot be done, since the true value of each parameter would be unknown; one possibility is to fix each parameter in turn at its maximum likelihood estimate. This issue will be considered in more detail in a Chapter 5.
The following observations can be made:
The likelihood has near-elliptical contours, with principal axes not parallel to the coordinate axes. Again, the association between the individual parameter estimates is induced because the parameter values partially compensate for each other:
so that increasing can lead to the same mean if is decreased accordingly.
(if you had drew a regression line ‘by eye’ you could perhaps increase the intercept a little provided you decrease the gradient; the likelihood surface is doing precisely this).
The likelihoods for and are also near-elliptical, but with axes parallel to the coordinate axes.
Example 3.10: Multinomial Data.
Consider an experiment which has three possible outcomes, with respective probabilities (subject to ). A concrete example is a football match, with outcomes win, lose, draw. Then suppose that independent trials of the experiment are undertaken and that represents a count of how many times each outcome occurred (so that ). The variable is said to follow a multinomial distribution.
Simple probability arguments, which generalise those leading to the binomial distribution, lead to the probability mass function
where . Moreover, although is three-dimensional, the constraint that means there are only two independent parameters (the third is fixed once the other two are known).
Thus we may think of the likelihood as a function of just and , say . Hence, the likelihood based on a single realisation is
and so
with and .
A simulation with and led to the data . Contours of the corresponding log-likelihood function are plotted in Figure LABEL:figmulti.ps. The contours in this case are near-elliptical close to the maximum, but have a distorted shape away from the maximum due to the constraints on the parameters. The MLE also seems to give a quite reasonable estimate of .
Example 3.11: Reparameterised Regression.
The feature observed in the previous example, of an
association between the parameters and in the
(log)likelihood surface, is rather problematic, because it means that
any statements made about estimates of will always depend on
the associated estimate of .
It is much more desirable to have
inferences about two different parameters disentangled, so that what
we say about one is independent of what we say about the other.
This requires that the (log)likelihood surface is parallel to the coordinate axes as was the case for each of the other parameter pairs.
In many situations, careful re-parametrisation of a model can overcome this problem. For example, we anticipated that the problem in the regression example was due to a compensation between and as the model attempted to match the mean of the data.
This suggests the following reformulation of the problem: , with . Thus, the model is identical to that of the previous example, with
and . Now,
so that represents the value of when .
Now, to see why this improves things, think again about drawing a regression line ‘by eye’, but this time, instead of choosing the intercept and then the gradient, you choose a value for (i.e. the centre of the cloud of points), and then the gradient. What you should find is that you can (more or less) make a decision about the gradient which is unaffected by your decision about the value of . Likelihood-based techniques benefit from the re-parametrisation for the same reasons.
Working with the new parameters,
To calculate the MLEs for the new parameters, we can proceed as before. Thus
Now
as . Thus setting this partial derivative to gives . Setting the other partial derivative to , and substituting for gives
so
as before.
Note that these MLEs could also be solved from the MLEs
from the earlier example using the multiparameter extension of the invariance
principle: so ; and
so
. But as
, then
.
The corresponding contour plot of the log-likelihood is shown in Figure LABEL:figreparreg.ps. Notice now that the contours have axes seemingly parallel to the coordinate axes.
Comment: the information in the likelihood is exactly that of the likelihood, but viewed from a different angle. The important thing, however, is that from the new angle, statements about each of the two parameters can be made regardless of the value of the other parameter.
Example 3.12: Logistic Regression.
In a binomial regression experiment we have
, where is some function of a
covariate associated with individual . To make things more
concrete (if overly-simplistic), suppose is 1 or 0 according
to whether individual passes an exam or not, and is the
amount of time individual studies for the exam.
We might anticipate that the greater the value of , the greater the value of , the probability of passing. But using a relationship like
is problematic, since the values of have to lie between 0 and 1. Thus, it is usual to suppose that is related to non-linearly in such a way that the constraints on are respected.
One relationship often used is the logistic relationship:
Thus, if is positive, increases with ; if
, then is constant (equal to
) for all individuals; if is
negative, then decreases with . In each case, though,
lies between 0 and 1.
The likelihood for individual is
Therefore, the overall likelihood for is
so
where is a sum over the individuals for whom and is a sum over the individuals for whom .
A simulated set of data with and is shown in Figure 3.12 (Link) — notice that since , the chance of being 1 rather than 0 increases with .
A contour surface of the log-likelihood for these data is shown in Figure LABEL:figlog.ps (left).
As usual, the maximum likelihood estimate provides a reasonable estimate of .
The contours are near-ellipses, but with substantial association between and .
As in the linear regression example, this can be partly overcome by the re-parametrisation
This leads to the likelihood surface shown in Figure LABEL:figlog (Link). There is indeed some improvement, but unlike the linear regression example, some association between the two parameters remains.