Cross Entropy Vs. Squared Error | KL-Divergence
Generative Adversarial Network
Generative model is about finding the underlying distribution or model from which the samples of images/data are supposed to come from.
Say we have a million images of cats, so we can assume, all the images come from one distribution, Pcat
So a sample from Pcat is an image of cat.
So we want: to model Pcat.
So we model the data / find the model of the data. Model = Distribution.
[yt: hQv8FNaJHEA]
Explicit model:
Learns the actual parameter of the distribution. X_1 = P (X) --> Finding P
Implicit model:
Don't bother to learn the distribution of the data, instead focus on the stochastic [random] procedure that directly generate the data - on some input.
X_1 = F_W(Z), Z = Latent space/input, W = Learned weights
GAN are implicit models.
Overall View:
minG maxD { x~Pdata(X) E[ logD(X) ] + z~Pz(Z) E[ log ( 1 - D(G(Z)) )]
X is a real data, Z is the noise, and G(Z) is the generated/fake data.
log(0) --> Has minimum value (negative)
log(1) --> Has maximum value (zero)
So our discriminator should try to,
D(X) ---> 1 and D(G(z)) ---> 0
Generator View,
We want to make D(G(z)) ---> 1
so,
minG z~Pz(Z) E[ log ( 1 - D(G(z)) ) ]
[https://medium.com/deep-math-machine-learning-ai/ch-14-general-adversarial-networks-gans-with-math-1318faf46b43]
https://drive.google.com/open?id=1dUUVPfwpjpnt84VSHKuCZlnI8xcrnXxs
https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html#kullbackleibler-and-jensenshannon-divergence
Cross-entropy KL divergence are closely related.
Lets explore cross-entropy:
Entropy is highest when all the all the outputs have equal probability.
Let's say we can ask yes/no questions only. Number of questions required to guess something (an outcome) with probability of p is log2(1/p).
These number of questions we ask is also known as information (in bit since we know yes or no questions only). So if we know the outcome which had a probability of p, we got log2(1/p) information (in bits / in yes-no questions)
Example: We flipped a fair coin twice and obatined HH, so we know probability of HH = 1/4, thus information = log2(1/4) = 2
Thus we have 2 bits info, or we know answers of two yes-no questions: that the first flip was head and second flip was head.
So amount of info = amount of uncertainity = amount of questions that will answer if the outcome is known
Hence, information I(x) = log2(1/P(x)) = -log2(P(x))
Shanon Entropy:
Say, a probability distribution takes fixed number of (bounded) inputs. See PDF vs Normal Function.
So average number of questions or information it (the probability distribution) has is its entropy [It's average, not total because each information is also multiplied by its probability]
Entropy, H(x) = Ex~P[I(x)] = - Ex~P[ log(P(x)) ]
H(x) = - sumxi ( pxi log pxi) ; the multiplication by pi makes it an average and not total
Entropy gives a lower bound on the number of bits needed on average to encode symbols drawn from probability distribution P
Note: 0*log0 is considered 0
Deterministic distribution : Zero entropy
Uniform distribution: Maximum Entropy
When x is continuous, Shannon Entropy is known as differential entropy.
Relative Entropy / Relative Shanon Entropy / KL-Divergence (Kullback–Leibler)
Measure of how one probability distribution is different from a second. Defined on the same probability space.
If we have two separate probability distributions over same random variable x, say P(x) and Q(x).
But a RV can have only one distribution, we assume two distributions and then compare to see which one is good, which one has more information.
For discrete probability distributions P and Q defined on the same probability space, the Kullback–Leibler divergence between P and Q is to be:
How different these two distributions are can be calculated using KL Divergence.
DKL ( P||Q ) = - Sumx~X [ P(x) log ( Q(x) / P(x) ) ] or - Ex~P [ log(Q(x)) - log(P(x)) ]
Given Q, what is the extra information P has
https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained
Very often in Probability and Statistics we'll replace observed data or a complex distributions with a simpler, approximating distribution. KL Divergence helps us to measure just how much information we lose when we choose an approximation.
[https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained]
Suppose we are on a different planet and find it is filled with certain worms.
We see a worm (perfect) has 10 teeth.
But not all worm has 10 teeth.
Due to certain reasons some teeth are lost. So number of possible teeth is [ 0 1 2 3 4 5 6 7 8 9 10 ] a total of 11 possibilities.
So we collect many worms (say 10000) and count their teeth and record.
We now get a distribution.
We now want to send this information to earth but we would like to transfer as low info as possible.
--> One way to send is say that probability of finding x teeth is 1/11, so all number of teeth have equal probabilities.
So we can send only 1 (article says 2, p and n, but later says only n is enough yes true) information and we get certain info this is very inaccurate -> In can only accurately tell the number of total teeth a worm can have.
--> Another way is to represent it as Bionomial Distribution.
First let us understand what is bionomial distribution.
Bionomial Distribution:
Say we have a event whose probability we know. 1 event we know = heads on a fair coin flip --> the probability p = 0.5
Say we flip the coin 10 times.
We want to know how many times heads occurs.
Can be obtained from combination formula.
no of possible 1 heads in 10 flips = 10 [it can be H--------- or -H-------- or --H------- and so on until ---------H], where - is tails
similarly for 2 heads, [HH-------- or H-H------- and so on..]
from that formula we can construct a distribution, that is the bionomial distribution. [well this is a very rough explanation as p can not be exactly 0.5 in which case the formula is a bit expanded]
Now, for our case, we can compare probability of one tooth as p, and number of flips = 10, [1 2 .. 10]
But we don't know p.
So lets approximate,
- In the coin case if we didn't know p, we could flip coin 1000 times and count heads and then p = #heads / 1000
- Here too we can do same thing, we count all the teeth of the worms we found, say we found 10000 worms, so #of total teeth / (10000*10) | *10 because each worm has 10 teeth
- Or we calculate the mean teeths, say #no of teeth / total no of worms = 5.7 say
and then divide 5.7 by 10 beacuse 10 teeh in a worm to get p = 0.57
Now say we construct a bionomial probability with these info, n = 10 and p = 0.57
--> not only this we have many possible ways to have many distributions, so which distribution to choose to represent the data?
Our concern here is not just to minimize the amount of information we need to send but also find the distribution which preserves most of the information from the original data source.
There comes KL Divergence.
First we need to understand entropy [explained some above]
Entropy, if log2 is used, gives the the minimum number of bits it would take us to encode our information ['empirical' distribution].
Although it tells the lower bound of number of bits, on avg, needed to encode our information [no of teeth observed in single case], it doesn't tell which is the optimal encoding scheme [finding it is a separate topic].
So entropy gives how much information actually is in our data, and thus we can compute how much info is lost if we chose a parameterized approximate distribution.
That information loss is measured by KL Divergence, and thus we want to minimize it as much as possible.
DKL(p∣∣q)= [N ∑ i=1] [ p(xi)⋅(log p(xi) − log q(xi)) ]
Generative Adversarial Network
Generative model is about finding the underlying distribution or model from which the samples of images/data are supposed to come from.
Say we have a million images of cats, so we can assume, all the images come from one distribution, Pcat
So a sample from Pcat is an image of cat.
So we want: to model Pcat.
So we model the data / find the model of the data. Model = Distribution.
[yt: hQv8FNaJHEA]
Explicit model:
Learns the actual parameter of the distribution. X_1 = P (X) --> Finding P
Implicit model:
Don't bother to learn the distribution of the data, instead focus on the stochastic [random] procedure that directly generate the data - on some input.
X_1 = F_W(Z), Z = Latent space/input, W = Learned weights
GAN are implicit models.
Overall View:
minG maxD { x~Pdata(X) E[ logD(X) ] + z~Pz(Z) E[ log ( 1 - D(G(Z)) )]
X is a real data, Z is the noise, and G(Z) is the generated/fake data.
log(0) --> Has minimum value (negative)
log(1) --> Has maximum value (zero)
So our discriminator should try to,
D(X) ---> 1 and D(G(z)) ---> 0
Generator View,
We want to make D(G(z)) ---> 1
so,
minG z~Pz(Z) E[ log ( 1 - D(G(z)) ) ]
[https://medium.com/deep-math-machine-learning-ai/ch-14-general-adversarial-networks-gans-with-math-1318faf46b43]
https://drive.google.com/open?id=1dUUVPfwpjpnt84VSHKuCZlnI8xcrnXxs
https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html#kullbackleibler-and-jensenshannon-divergence
Cross-entropy KL divergence are closely related.
Lets explore cross-entropy:
Entropy is highest when all the all the outputs have equal probability.
Let's say we can ask yes/no questions only. Number of questions required to guess something (an outcome) with probability of p is log2(1/p).
These number of questions we ask is also known as information (in bit since we know yes or no questions only). So if we know the outcome which had a probability of p, we got log2(1/p) information (in bits / in yes-no questions)
Example: We flipped a fair coin twice and obatined HH, so we know probability of HH = 1/4, thus information = log2(1/4) = 2
Thus we have 2 bits info, or we know answers of two yes-no questions: that the first flip was head and second flip was head.
So amount of info = amount of uncertainity = amount of questions that will answer if the outcome is known
Hence, information I(x) = log2(1/P(x)) = -log2(P(x))
Shanon Entropy:
Say, a probability distribution takes fixed number of (bounded) inputs. See PDF vs Normal Function.
So average number of questions or information it (the probability distribution) has is its entropy [It's average, not total because each information is also multiplied by its probability]
Entropy, H(x) = Ex~P[I(x)] = - Ex~P[ log(P(x)) ]
H(x) = - sumxi ( pxi log pxi) ; the multiplication by pi makes it an average and not total
Entropy gives a lower bound on the number of bits needed on average to encode symbols drawn from probability distribution P
Note: 0*log0 is considered 0
Deterministic distribution : Zero entropy
Uniform distribution: Maximum Entropy
When x is continuous, Shannon Entropy is known as differential entropy.
Relative Entropy / Relative Shanon Entropy / KL-Divergence (Kullback–Leibler)
Measure of how one probability distribution is different from a second. Defined on the same probability space.
If we have two separate probability distributions over same random variable x, say P(x) and Q(x).
But a RV can have only one distribution, we assume two distributions and then compare to see which one is good, which one has more information.
For discrete probability distributions P and Q defined on the same probability space, the Kullback–Leibler divergence between P and Q is to be:
How different these two distributions are can be calculated using KL Divergence.
DKL ( P||Q ) = - Sumx~X [ P(x) log ( Q(x) / P(x) ) ] or - Ex~P [ log(Q(x)) - log(P(x)) ]
Given Q, what is the extra information P has
https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained
Very often in Probability and Statistics we'll replace observed data or a complex distributions with a simpler, approximating distribution. KL Divergence helps us to measure just how much information we lose when we choose an approximation.
[https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained]
Suppose we are on a different planet and find it is filled with certain worms.
We see a worm (perfect) has 10 teeth.
But not all worm has 10 teeth.
Due to certain reasons some teeth are lost. So number of possible teeth is [ 0 1 2 3 4 5 6 7 8 9 10 ] a total of 11 possibilities.
So we collect many worms (say 10000) and count their teeth and record.
We now get a distribution.
We now want to send this information to earth but we would like to transfer as low info as possible.
--> One way to send is say that probability of finding x teeth is 1/11, so all number of teeth have equal probabilities.
So we can send only 1 (article says 2, p and n, but later says only n is enough yes true) information and we get certain info this is very inaccurate -> In can only accurately tell the number of total teeth a worm can have.
--> Another way is to represent it as Bionomial Distribution.
First let us understand what is bionomial distribution.
Bionomial Distribution:
Say we have a event whose probability we know. 1 event we know = heads on a fair coin flip --> the probability p = 0.5
Say we flip the coin 10 times.
We want to know how many times heads occurs.
Can be obtained from combination formula.
no of possible 1 heads in 10 flips = 10 [it can be H--------- or -H-------- or --H------- and so on until ---------H], where - is tails
similarly for 2 heads, [HH-------- or H-H------- and so on..]
from that formula we can construct a distribution, that is the bionomial distribution. [well this is a very rough explanation as p can not be exactly 0.5 in which case the formula is a bit expanded]
Now, for our case, we can compare probability of one tooth as p, and number of flips = 10, [1 2 .. 10]
But we don't know p.
So lets approximate,
- In the coin case if we didn't know p, we could flip coin 1000 times and count heads and then p = #heads / 1000
- Here too we can do same thing, we count all the teeth of the worms we found, say we found 10000 worms, so #of total teeth / (10000*10) | *10 because each worm has 10 teeth
- Or we calculate the mean teeths, say #no of teeth / total no of worms = 5.7 say
and then divide 5.7 by 10 beacuse 10 teeh in a worm to get p = 0.57
Now say we construct a bionomial probability with these info, n = 10 and p = 0.57
--> not only this we have many possible ways to have many distributions, so which distribution to choose to represent the data?
Our concern here is not just to minimize the amount of information we need to send but also find the distribution which preserves most of the information from the original data source.
There comes KL Divergence.
First we need to understand entropy [explained some above]
Entropy, if log2 is used, gives the the minimum number of bits it would take us to encode our information ['empirical' distribution].
Although it tells the lower bound of number of bits, on avg, needed to encode our information [no of teeth observed in single case], it doesn't tell which is the optimal encoding scheme [finding it is a separate topic].
So entropy gives how much information actually is in our data, and thus we can compute how much info is lost if we chose a parameterized approximate distribution.
That information loss is measured by KL Divergence, and thus we want to minimize it as much as possible.
DKL(p∣∣q)= [N ∑ i=1] [ p(xi)⋅(log p(xi) − log q(xi)) ]