Applications of Optimization - Maximum Likelihood Estimation

Maximum Likelihood Estimation is a statistical technique that estimates the parameters $\theta$ of a probability distribution by maximizing a likelihood function. In the language of optimization, simply maximizing a function where the variables involved are the probability distribution parameters. I will focus on probability distributions with continuous variables and continuous real support.

I will flesh out the concept a bit more here, given a parameterized probability density function $p(y_i, \theta)$ and a sequence of data ${y_i}$. The likelihood function of this sequence is defined as (1). This is not used in practice as it is numerically infeasible, so it is usually transformed into the log-likelihood defined by (2).

\[\text{(1) }\mathcal{L}(\{y_i\}, \theta) = \prod^N_{i=0} p_i\left(y_i, \theta \right)\] \[\text{(2) }\ln{\mathcal{L}}(\{y_i\}, \theta) = \sum^N_{i=0} \ln p_i(y_i, \theta)\]

In effect, we want to maximize the log-likelihood by tuning $\theta$. If we do not have constraints on the parameters $\theta$, we can use the local optimality as a governing equation of the solution. In the case of a strictly concave log-likelihood function, this local minimum is also the global minimum. That is to say, that the regressed parameters are the optimal parameters.

\[\hat{\theta} = arg \max_{\theta} \ln{\mathcal{L}}(\{y_i\}, \theta)\] \[\nabla \ln{\mathcal{L}}(\{y_i\}, \hat{\theta}) = 0\]

This optimization problem can be solved in many different ways, and in some cases, it is possible to derive analytical expressions for $\hat{\theta}$. I will do an analytical solution of a single variable normal distribution. Methods of solving these equations are based around forming the normal equations based on $\nabla \ln{\mathcal{L}}({y_i}, \hat{\theta}) = 0$ and then iteratively solving them via an iterative optimization technique. For unconstrained problems, e.g., where $\theta$ is not constrained to any expression, we have unconstrained optimization techniques such as gradient decent, conjugate gradient, BFGS, or Newton iterations. For constrained problems, we can use ADMM, successive quadratic programming, or a projected gradient algorithm.

Case Study: Single Variable Normal Distribution

The normal distribution has the pdf and log-pdf as the following. From those, we can get the log-likelyhood function.

\[p(y_i, \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp^{-\frac{(x - \mu)^2}{2\sigma^2}}\] \[\ln{p(y_i, \mu, \sigma^2)} = -\frac{1}{2}\ln{2\pi\sigma^2} - \frac{1}{2\sigma^2}(y_i - \mu)^2\] \[\ln{\mathcal{L}}(\{y_i\}, \hat{\theta}) = -\frac{N}{2}\ln{2\pi\sigma^2} - \frac{1}{2\sigma^2}\sum^N_{i=0}\left(y_i - \mu \right)^2\]

We can apply the optimality conitions to this equation to get an expression for optimal parameters.

\[\frac{\partial}{\partial \mu} \ln{\mathcal{L}} = \frac{1}{\sigma^2}\sum^N_{i=0}\left(y_i - \mu \right) = 0\]

From here, we can do some algebraic manipulations to solve for $\mu$.

\[\sum^N_{i=0}\left(y_i - \mu \right) = 0\] \[\sum^N_{i=0}y_i =\sum^N_{i=0}\mu = N\mu \rightarrow \mu^* = \frac{1}{N}\sum^N_{i=0}y_i = \bar{y}\]

And for $\sigma^2$ we get the following.

\[\frac{\partial}{\partial \sigma} \ln{\mathcal{L}} = -\frac{N}{\sigma} +\frac{1}{\sigma^3}\sum^N_{i=0}\left(y_i - \mu \right)^2 = 0\]

By multiplying thru by $\sigma^3$ and simple rearrangement, we get the following expression for optimal $\sigma^{* 2}$

\[\sigma^{* 2} = \frac{1}{N}\sum^N_{i=0}\left( y_i - \mu \right)^2\]

We now how the ‘most likely’ parameters for our normal distribution! This is NOT to say that the data is, in fact, normally distributed but just stating that the most likely parameters for the distribution are $\mu^* $ and $\sigma^{* 2}$.