In this chapter we discuss general concepts and equations. When these are applied to different distributions they give rise to other equations that are used in practice, as discussed in the chapters focusing on the specific distributions.
A random experiment can yield different results despite being performed under the same conditions. For example, each tossing of a coin can have two possible results: heads or tails. Another example is the measurement of the absorbance of a solution, which, if done in triplicate under the same conditions, can have different values albeit close to some expected average.
A sample space is the set of all possible outcomes of a random experiment. For example if we toss a die just once the sample space has 6 possible sample points: {1,2,3,4,5,6}. But if we toss it twice the sample space now has 36 points {(1,1),(1,2),(1,3),...(6,6)}
A random variable, usually denoted by a capital X or by a capital Y, is the numerical outcome of a random experiment. Random variables can be discrete or continuous. For example, if we count the number of heads while tossing a coin 10 times, and we get 4 heads, then X = 4, and this is a discrete random variable because X is always either 0 or an integer. If we measure the absorbance of a solution and obtain 1.384, then X = 1.384 and that is a continuous random variable because its sample space consists of a continuous range of real numbers.
A function is a probability function or probability distribution if it satisfies the following criteria:
For a discrete random variable \( X \), the probability function is called a probability mass function (PMF). It gives the probability that \( X \) takes on the value \( x \): \( P(X = x) = p(x) \).
For a continuous random variable, the probability function is called a probability density function (PDF). In this case, \( P(X = x) = 0 \) for any real number \( x \), because the variable can take infinitely many values even within a very small interval (e.g., from 1.000 to 1.001). Therefore, the PDF does not give the probability at a point, but rather a density. To compute the probability over an interval \( [a, b] \), we integrate the PDF: \( P(a \leq X \leq b) = \int_a^b f(x) \, dx \)
The total probability for a discrete variable is: \( \sum_x p(x) = 1 \). For a continuous variable, it is: \( \int_{-\infty}^{+\infty} f(x) \, dx = 1 \). This represents the total area under the curve. If the range of a continuous variable \( X \) is restricted to \( a \le x \le b \), then:
\( \int_{-\infty}^{a} f(x) \, dx = 0, \quad \int_a^b f(x) \, dx = 1, \quad \int_b^{+\infty} f(x) \, dx = 0 \)
A Cumulative Distribution Function (CDF) gives the probability that a random variable \( X \) is less than or equal to a given value \( x \).
For a discrete variable: \( CDF(x) = \sum\limits_{x_i = x_{\text{min}}}^{x} p(x_i) \)
For a continuous variable: \( CDF(x) = \int_{-\infty}^x f(x) \, dx \)
Python code for Figure G1
import numpy as np
import matplotlib.pyplot as plt
import math
from scipy.stats import norm
from scipy.stats import binom
# DISTRIBUTION PARAMETERS
N = 100
p = 1/2 # binomial probability (e.g. coin toss h vs t)
mu,v,sk,k = binom.stats(N,p,moments="m,v,s,k")
s = math.sqrt(v) # standard deviation
R_ = list(range(N+1)) # [0, 1, 2, ... 100] 101 possible values for binomial
# X AND Y VALUES AND RANGES TO BE PLOTTED
P_ = binom.pmf(R_,N,p)
xmin = 30
xmax = 70
x_ = np.linspace(xmin,xmax,300)
y_ = norm.pdf(x_,mu,s)
Pof50 = binom.pmf(50,N,p)
fOf50 = norm.pdf(50,mu,s)
cdfAround50 = norm.cdf(50.5,mu,s) - norm.cdf(49.5,mu,s)
x_fill = np.linspace(49.5,50.5,100)
y_fill = norm.pdf(x_fill,mu,s)
#FIGURE PARAMETERS
fig, ax = plt.subplots(figsize=(12,5))
binMrkrColors = ['teal']*len(R_)
binMrkrColors[50] = 'red'
ax.tick_params(axis='both', which='major', labelsize=18)
ax.set_title(r'Discrete Binomial (N = 100, p = 0.5) versus Continuous Normal $(\mu = 50 \ \sigma = 5)$ Distributions',
fontsize=14)
#ANNOTATIONS WITH ARROWS
ax.annotate(
f'Discrete X:\nP(x = 50) = {Pof50:.4f}',
xy=(50,Pof50),
xytext=(57,Pof50),
arrowprops=dict(facecolor='black', arrowstyle='->'),
fontsize=14
)
ax.annotate(
f'Continuous X:\nDensity at x = 50\n = {fOf50:.4f}',
xy=(50,Pof50),
xytext=(37,Pof50),
arrowprops=dict(facecolor='black', arrowstyle='->'),
fontsize=14
)
ax.annotate(
f'Continuous X:\ncdf(x = 50.5) - cdf(x = 49.5)\n = Area under curve = {cdfAround50:.4f}',
xy=(50,0.025),
xytext=(57,0.05),
arrowprops=dict(facecolor='black', arrowstyle='->'),
fontsize=14
)
# AXES
ax.set_xlabel(f'x',fontsize=14)
ax.set_ylabel(r'PMF (discrete) or PDF (continuous)',fontsize=14)
ax.set_xlim(left=30, right=70)
ax.set_ylim(bottom=0, top=0.15)
# PLOT
ax.scatter(R_, P_, marker="o", s = 20, color=binMrkrColors,
label=r'binom.pmf(R,p=0.5,N=100)')
ax.plot(x_, y_, linestyle="-", marker="none", color='pink',
label=r'norm.pdf($x,\mu=50,\sigma=5$)')
ax.fill_between(x_fill,y_fill,color='red', alpha=0.2)
ax.legend(loc="upper left", frameon=False,fontsize=14)
#ADJUST LAYOUT AND SAVE FIGURE THEN SHOW IT
plt.tight_layout()
plt.savefig("AAA.jpeg", dpi=300, bbox_inches='tight')
plt.show()
The mathematical expectation, is an operator written as \(\mathbb{E}\)[ ] or as E[ ], which is used to define several "statistics" such as the mean, the variance, the skewness and the kurtosis, as well as functions such as the moment generating function.
The expectation of X is the mean of its distribution.
For a discrete random variable with a probability mass function f(x), it is the weighted average of all possible values of X, where each value is weighted by its corresponding frequency: $$\mathbb{E}[X] = \sum\limits_{i=1}^{n}x_if(x_i) = \mu$$ For a continuous random variable with a probability density function f(x): $$\mathbb{E}[X] = \int_{-\infty}^{+\infty}xf(x)dx = \mu$$
Properties of \(\mathbb{E}\)[ ]
The mode is the value of the random variable that has the highest occurrence probability. For a continuous probability function (PDF) it is the value of x corresponding to the maximum density.
The median, and the percentiles or quartiles are calculated differently, depending on whether we are referring to datasets or to probability distributions. This is best explained examples. Take the following dataset with 9 values. Quartiles are either individual data points or the average of two adjacent points (shown in bold).
Dataset = [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.50, 0.22]
1st | 2nd = Median | 3rd | |
---|---|---|---|
Quartiles | \(\frac{0.02+0.03}{2}=0.025\) | 0.05 | \(\frac{0.07+0.50}{2}=0.285\) |
Of course, there is little or nothing to reveal by dividing a dataset of 9 data points into quartiles. Quartiles or percentiles are especially useful when you have large datasets, such the weights of newborns in a certain year and region of the country. But this small dataset is useful to help understand the calculation procedure, which is as follows:
Let us now calculate the median and the quartiles for a discrete probability distribution. Let the following array be the complete set of probabilities of a discrete random variable. The probabilities are all non-negative and add up to 1, as required.
X = [1, 2, 3, 4, 5, 6, 7, 8, 9], P(X) = [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.50, 0.22]
This means that, for example, P(X = 8) = 0.5.
1st | 2nd = median | 3rd | |
---|---|---|---|
Quartiles | \(x_j | \sum\limits_{i=1}^{j}P(x_i) \ge 0.25 \rightarrow x = 7\) | \(x_j | \sum\limits_{i=1}^{j}P(x_i) \ge 0.50 \rightarrow x = 8\) | \(x_j | \sum\limits_{i=1}^{j}P(x_i) \ge 0.75 \rightarrow x = 8\) |
PMF(xj) | 0.07 | 0.50 | 0.50 |
CDF(xj) | 0.28 | 0.78 | 0.78 |
In this case, the values of X must be sorted in ascending order, and the corresponding probabilities must remain aligned. Then, the CDF is computed cumulatively until each quartile threshold is met.
Finally, in the case of a continuous random variable the median is simply the x value for which the cumulative probability distribution is 0.5. The same principle applies for calculating quartiles, deciles or percentiles. $$\int_{-\infty}^xf(x)\,dx = 0.5$$ For example $$f(x) = x^2\Big|_0^{\sqrt[3]{3}}$$ is a PDF because its total probability within its boundaries is 1. $$\int_0^{\sqrt[3]{3}}x^2\,dx = \frac{1}{3}x^3 \Big|_0^{\sqrt[3]{3}} = 1$$ Its median is: $$\sqrt[3]{\frac{3}{2}} \approx 1.34 \quad \text{because} \quad \int_0^{\sqrt[3]{\frac{3}{2}}} x^2 \, dx = \frac{1}{3} x^3 \Big|_0^{\sqrt[3]{3/2}} = 0.5$$ Its mode is: \(\sqrt[3]{3}\) - the value of x where the PDF f(x) = x2 reaches its maximum within the support. (Note: in statistics support is the set of values that the random variable can take and for which the PDF or PMF is positive).
Its mean is: $$\int_0^{\sqrt[3]{3}}xx^2\,dx = \frac{1}{4}x^4|_0^{\sqrt[3]{3}} = \frac{1}{4}3^{\frac{4}{3}} \simeq 1.08 $$
Figure G2 (Interactive): Superimposed plots or the binomial and normal distributions. Play with the sliders to change binomial N and p (probability of success). Refreshing the page resets these parameters to N = 22 and p = 0.20. The statistical parameters mean (\(\mu\)), and standard deviation (\(\sigma\)) are the same for both distributions. The mode, skewness, and kurtosis apply only to the binomial distribution because in the normal distribution the mode, the median and the mean are the same (\(\mu\)) and the skewness and kurtosis are always 0 and 3, respectively. The first, second (median) and third quartiles, are indicated by colored (green, red, green) bars for the binomial distribution or red dots for the normal distribution.
The variance is a measure of the spread of the distribution and of the
variability of the data. The standard deviation
is the square root of the variance. In a normal distribution, the standard
deviation corresponds to the distance from the mean to the inflection
point of the curve.
$$Var(X) = \sigma^2 = \mathbb{E}[(X - \mu)^2] = \mathbb{E}[(X - \mathbb{E}[X])^2]$$
The standard deviation is \(\sigma\).
For a continuous distribution such as the normal distribution,
$$\sigma^2 = \int_{-\infty}^{+\infty} (x - \mu)^2 f(x) dx$$
Which is a better estimator of the true variance:
\(\frac{\sum(x_i-\bar{x})^2}{n} \ or \ \frac{\sum(x_i-\bar{x})^2}{n-1}\)?
Let A be a variable that could take either of the values: A = n or A = n - 1.
At this point note the following equalities (see Chapter 1 /Variance ...)
Therefore: \(\mathbb{E}[\frac{\sum(x_i-\bar{x})^2}{A}]=\sigma^2\) only if A = n - 1. Thus, using n - 1 in the denominator makes the sample variance an unbiased estimator of the population variance.
The population variance equation is valid only if we know the entire population's mean - note the denominator of the fraction N. However, we usually work with samples, for which we do not know the true mean. In this case, the denominator of the fraction is n - 1 instead of N. This is to correct for the fact that dividing by N tends to yield an underestimate of the variance. This underestimate arises from the smaller spread of the sample when compared to its true spread. Dividing by n-1 will increase the estimate of the variance. Click "derivation" after the equation to see the math that justifies this correction.
Random variables X and Y, with a probability distribution \(f^{}_{X,Y}(x,y)\) can be correlated, meaning
that a change in X is associated with a change in Y and vice-versa.
We can then define their covariance as:
\(Cov(X,Y)=\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]\)
A positive covariance means that the random variables tend to increase together;
Conversely, a negative variance indicates that the tendency of one variable to increase
correlates with the tendency of the other to decrease.
The correlation coefficient r or "\(\color{red}{\rho}\)" or Pearson's correlation coefficient is a statistical quantification of the strength and direction of the linear relationship between two variables. $$\rho = \frac{cov(X,Y)}{\sigma^{}_{X}\sigma^{}_{Y}}$$ $$\rho_{xy} = \frac{\sum((x_i-\bar{x})(y_i-\bar{y}))}{\sqrt{\sum(x_i-\bar{x})^2}\sqrt{\sum(y_i-\bar{y})^2}}$$ If there is no correlation their covariance is zero, \(Cov(X,Y) = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])]=0\). In contrast, if there is a perfect correlation, the product of their standard deviations equals the absolute value of the covariance. \(|cov(X,Y)| = \sigma^{}_{X}\sigma^{}_{Y}\). The closer r is to -1 or to 1 the stronger the negative or positive correlation, and the closer the value is to 0 the weaker the correlation.
If X and Y are perfectly correlated, (i.e. Y = aX + b) \(|cov(X,Y)| = \sigma^{}_{X}\sigma^{}_{Y}\) proofTherefore, |r|=1 when Y=aX+b, and the sign of a determines the direction (positive or negative correlation).
We are very familiar with the bell curve. As a rule, we visualize it as being symmetrical, with equal tails on both sides of the mean. If a distribution is not symmetrical, one of its tails is longer than the other. This asymmetry is quantified by the skewness, which is defined as: $$\frac{\mathbb{E}[(X - \mu)^3]}{\sigma^3}$$
\(\alpha_3 = \frac{\mathbb{E}[X^3] - 3\mu\sigma^2 - \mu^3}{\sigma^3}\) proofCorollary - for a centered distribution (i.e., one where the mean \( \mu = 0 \)):
$$\alpha_3 = \frac{\mathbb{E}[X^3]}{\sigma^3}$$
Kurtosis is a measure of how the tails of a probability distribution
compare to those of a normal distribution. While often associated with
the height of the peak, it primarily captures the presence of extreme
values: high kurtosis suggests a distribution with heavy tails, whereas
low kurtosis indicates a distribution with lighter tails.
$$\alpha_4 = \frac{\mathbb{E}[(X-\mu)^4]}{\sigma^4}$$
Corolary - for a centered distribution (i.e., one where the mean \( \mu = 0 \)):
$$\alpha_4 = \frac{\mathbb{E}[X^4]}{\sigma^4}$$You may run into the term excess kurtosis, which derives from a comparison with the kurtosis of the normal distribution, \(\alpha_4 = 3\). Excess kurtosis is, thus defined as:
$$\gamma_2 = \alpha_4 - 3$$So that the normal distribution has: \(\gamma_2 = 0\), and :
See the next section to understand how these general equations for skewness and kurtosis, which contain terms with the third and fourth moments (\(\mathbb{E}[X^3], \text{and } \mathbb{E}[X^4]\)), are used to derive respective practical equations for calculating skewness and kurtosis of the various probability distributions used in statistics.
In mathematics, moments are quantitative values that describe the shape of a curve defined by a function. In statistics, moments characterize probability distributions. Given a random variable \(X\) with probability density function \(f(X)\), its moments are given by the expected values:
The moment-generating function (MGF) provides a convenient way to derive these moments: $$M_{X}(t) = \mathbb{E}[e^{tX}]$$