Skip to main content

Maximum Likelihood Estimation

https://stats.stackexchange.com/questions/2641/what-is-the-difference-between-likelihood-and-probability#2647

Maximum Likelihood Estimation

Maximum likelihood is like regression.
Given the data, we try to find the best probability distribution for it, i.e. best parameters of the distribution.
Likelihood (theta | Data) = P (Data | theta)

[ http://complx.me/2017-01-22-mle-linear-regression/ ][ http://complx.me/2017-01-22-mle-linear-regression/ ]

Statistics:
Statistics is not a statement about individuals - it is a statement about the parameters of the distribution that was used to model the 'data'. Even among those studies we began with, some might be wrong.

Statistics is summarizing the data.

Maximum Likelihood:
MLE is finding the 'model' / 'distribution' that is 'more consistent' with your data .

- Say you have a data and you are willing to assume it's gaussian - or a combination / mixture of gaussian. But there are infinite number of gaussian distribution given by different means and variances.

More Consistent:
- How is consistency determined?
Probability of getting a particular 'y' (data) you observed with greater value given the choice of mean and variance, than with others.

Likelihood:
Likelihood is the probability of observing the data given the parameters P( D | theta ). So the 'theta' which gives more probability is selected in maximum likelihood.

In maximum likelihood, the 'theta' which gives the maximum likelihood is selected. Which means we must assume the distribution (the parameters) apriori.

In a sense, maximum likelihood is like finding the best regressor, normally we find the best regressor by minimizing the least square error, now we find it by maximizing the likelihood i.e. the probability.
In regression, we need an aprior function we try to fit the data in example, a line y = mx + c and we find m and c. In MLE, we need a apriori distribution too, like a gaussian G(theta = mean, variance) and we find mean and variance.

If we revisit Linear Regression from a probabilistic perspective, we get the Maximum Likelihood estimation

Wikipedia: [https://en.wikipedia.org/wiki/Prior_probability]

In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into account.

Parameters of prior distributions are a kind of hyperparameter.

[https://towardsdatascience.com/a-gentle-introduction-to-maximum-likelihood-estimation-9fbff27ea12f]:
MLE is like MAP, MAP assumes a uniform prior distribution. MAP: Maximum a posteriori estimation
MAP estimate of 'theta' coincides with the ML estimate when the prior is uniform (that is, a constant function).

Disadvantages of MLE:
Requires a strong assumption about the structure of data - like, it requires us to assume the structure / distribution of the data to be gaussian so that we can estimate mean and variance.

Simple MLE Example:
Suppose we flip a coin 10 times and get: HHTHHHTTHH
We want to find a distribution which will represent this data.

First we need some apriori distribution.

Since we have only two outcomes here: Heads or Tails, so reasonable to select Bernoulli Distribution which as only one parameter 'p' which say is the probability of getting H, for T it'll automatically be 1 - p.

So now let's find good estimate of p.

Given p, what is the probability that we get the given data?

L( D={HHTHHHTTHH} | p ) = p . p . (1-p) . p . p . p . (1-p) . (1-p) . p . p = p^7.(1-p)^3

That L is the likelihood,
we need to maximize it, i.e. we need to find the 'p' which maximizes the L,

One way to find p:
Since we know p lies within [0,1] we can sweep at intervals like 0.1 or something and find the best p.
Or use other methods derivative to zero and so on.
The value of p should be 0.7 (Can be calculated from data itself: No of heads / total expts)

log likelihood = loge(L) -> Easier to find value of p from here, whatever maximized log-likelihood also maximized likelihood.



////////////////////////////   DO  NOT  READ  THIS  NEED  MORE  STUDY  /////////////////////////////////////////
MLE is special MAP where we use a naive prior but never bother to update it.

P(β∣y) = P(y∣β) x P(β) / P(y)

posterior = likelihood x prior / evidence

We can effectively ignore the prior and the evidence because — given the Wiki definition of a uniform prior distribution — all coefficient values are equally likely. And probability of all data values (assume continuous) are equally likely, and basically zero.

The probability of some specific coefficients given I’m seeing some results, relates to framing the question the exact opposite way. Which helps, cause that question is WAY easier to solve.

///////////////////////////////////////////////////////////////////////////////////////////////////////////////


Importance of MLE in Neural Network:
For any given neural network architecture, the objective function can be derived based on the principle of Maximum Likelihood.



Summary:
MLE is finding maximum - likelihood P(Data | theta), the theta which gives maximum probability

We need a rough idea about the distribution beforehand, called a prior distribution.