import warnings
"ignore") warnings.filterwarnings(
1. Theory
One of the central concepts of Data Science and related disciplines is that of loss function: a function \(f(\theta)\), whose minimization provides the “best value” for the parameters \(\theta\) of the model. Arguably, the simplest example is that of mean square error, commonly used for regression problems: \[ f(\theta) = \dfrac{1}{N} \sum_i \left[y_i - M_\theta(x_i)\right]^2 \] where the sum is over all \(N\) samples in the dataset, \(x_i\) and \(y_i\) are the feature vector and the target scalar of sample \(i\) respectively, and \(M_\theta: x \to y\) is the model.
But where do these loss functions come from? Enter the concept of likelihood!
Let’s start by noting that the dataset \(D := \{(x_i, y_i)\}_i\) can be seen as \(N\) independent observations of the random variable \(Y|X\), whose probability density we denote by \(p_{Y|X}\). By definition, the model \(M_\theta\) should produce the expected value of this variable: \[ M_\theta(x) = \mathbb E[Y|X=x] \]
A common (and reasonable) assumption for a variable \(Y\) whose possible values are in \(\mathbb R\), is that \[ Y|X \sim \mathcal N(\mu(X), \sigma(X)^2) \] In this case, \(M_\theta(x) = \mathbb E[Y|X=x] = \mu(x)\), and if we assume that \(\sigma(x) = \sigma\) is a constant, then \[ p_{Y|X}(y|x, \theta) = \dfrac{1}{\sqrt{2\pi \sigma^2}} \exp\left[-\dfrac{1}{2\sigma^2} (y - M_\theta(x))^2\right] \] where we made explicit the dependence on \(\theta\).
It follows that, given the model with parameters \(\theta\), the probability of observing the dataset \(D\) is \[ P(D|\theta) = \prod_i p_{Y|X}(y_i|x_i, \theta) = \left(\dfrac{1}{2\pi \sigma^2}\right)^{\frac{N}{2}} \exp\left[-\dfrac{1}{2\sigma^2} \sum_i (y_i - M_\theta(x_i))^2\right] \] Seen as a function of \(\theta\), this quantity represents the “likelihood” that the data \(D\) would be generated by model \(M_\theta\). Hence, the best values for parameters \(\theta\) will be those for which the probability \(P(D|\theta)\) is maximal: we thus talk about maximum likelihood estimation (MLE).
In fact, maximizing the likelihood is equivalent to maximizing the log-likelihood, \(\ell(\theta)\), which reads \[ \ell(\theta) := \ln(P(D|\theta)) = -\frac{N}{2} \ln(\sigma^2) - \dfrac{1}{2\sigma^2} \sum_i [y_i - M_\theta(x_i)]^2 \] up to a constant. It is now evident that \(\max_\theta \ell(\theta)\) coincides with \(\min_\theta f(\theta)\), with \(f\) the mean square error loss function defined above.
1.1 Binary Classification Case
We saw how the mean square error loss function typical of regression problems arises as the (negative) log-likelihood of data \(D := \{(x_i, y_i)\}_i\) when we think of \(y_1, ..., y_N\) as independent observations of random variable \(Y\) which, conditional to \(X\), follows a normal distribution.
Another common Data Science area is that of classification problems. For a binary classification, \(Y\) can only take values \(0\) or \(1\), and so we must have \[ Y|X \sim \text{Bernoulli}(\phi(X)) \] In this case, \(M_\theta(x) = \mathbb E[Y|X=x] = \phi(x)\), so \[ p_{Y|X}(y|x, \theta) = M_\theta(x)^y \ (1 - M_\theta(x))^{1 - y} \] It follows that the log-likelihood is \[ \ell(\theta) = \ln(P(D|\theta)) = \sum_i \ln(p_{Y|X}(y_i|x_i, \theta)) = \sum_i \left[y_i \ln(M_\theta(x_i)) + (1 - y_i) \ln(1 - M_\theta(x_i))\right] \] which we can recognise as the cross-entropy loss function (up to a sign), thus explaining why this is the usual loss function for binary classifiers.
2. Example
Let’s stay on binary classification, and see how maximum likelihood estimation (MLE) works in an example.
First, let’s generate some data based on Bernoulli distribution with \(M_\theta\) a sigmoid model with a single feature: \[ M_\theta(x) = \dfrac{1}{1 + e^{-\theta x}} \] To achieve this, we first generate \(N\) observations \(x_1, ..., x_N\) of the input variable \(X\) (here, without loss of generality, we take it to be normally distributed). We then determine the probabilities \(\phi_1, ..., \phi_N\) by the sigmoid \(\phi_i = \phi(x_i) = M_\theta(x_i) = 1/(1 + \exp(-\theta x_i))\). Finally, we generate \(N\) observations \(y_1, ..., y_N\) of the target variable \(Y\) via \(y_i \sim \text{Bernoulli}(\phi_i)\).
import numpy as np
import pandas as pd
import plotly.express as px
= np.random.RandomState(seed=42)
ran
= 10000
num_samples = -1.2 theta_true
def sigmoid(theta, x):
return 1/(1 + np.exp(-theta * x))
# generate input data
= ran.normal(0, 1, num_samples)
x_data
# compute probabilities phi's (phi_i denotes the probability that y_i = 1)
= sigmoid(theta_true, x_data)
phi_s
# generate output data
= np.array([1 if ran.uniform(0, 1) < phi else 0 for phi in phi_s])
y_data
# combine input and output in a dataframe
= pd.DataFrame([x_data, y_data]).T
data = ['x', 'y'] data.columns
Let’s make a scatter plot of \(x\) vs \(y\):
= px.scatter(data, x='x', y='y')
fig fig.show()
Suppose we knew that this data comes from a sigmoid model, but we did not know the true value of parameter \(\theta\). We can see by naked eye that the true value must be negative, since samples with higher \(x\) tend to have decreased probability that \(y=1\). But to find the correct value of \(\theta\) we must ask: What is the value of \(\theta\) for which the data above is most likely? In other words, we look for the \(\theta\) that maximizes the (log) likelihood.
Let’s therefore plot this function for a range of values of \(\theta\).
def log_likelihood(theta):
= sigmoid(theta, x_data)
y_pred return np.sum(y_data * np.log(y_pred) + (1 - y_data) * np.log(1 - y_pred))
= np.linspace(-6, 6, 1000)
theta_vals = [log_likelihood(theta) for theta in theta_vals]
ll_vals
= pd.DataFrame([theta_vals, ll_vals]).T
output = ['theta', 'log-likelihood']
output.columns
= px.line(output, x='theta', y='log-likelihood')
fig fig.show()
We see that the log-likelihood has a maximum around 1. To identify the exaclt value, we could implement a gradient ascent algorithm and use it to maximize the log-likelihood – but python already thought about that, and it provides a library to minimize a given function: this is the reason why, in the code, we always use the negative log-likelihood.
from scipy.optimize import minimize
def neg_log_likelihood(theta):
return -log_likelihood(theta)
= minimize(neg_log_likelihood, x0=0.0) # x0 is the initial guess
result
= result.x[0]
theta_estimated print(f"Estimated theta: {theta_estimated:.2f}")
print(f"True theta: {theta_true:.2f}")
Estimated theta: -1.19
True theta: -1.20
Our estimation is very close to the true value we generated the data with: the MLE approach worked!
3. Application: Simple Price Simulations
Let’s now investigate some more realistic application of MLE. We will consider the daily price time series of a stock (in this case, Nvidia). Our purpose is to define a simple model for this time series, a model based on a few parameters. We will then use MLE to estimate such parameters, and finally generate some future possible paths according to the model.
First, load the data and visualize it.
# load data
= pd.read_csv('data/likelihood-1.csv', parse_dates=['date']).set_index('date')
df = df['price']
price
# visualize
= px.line(price, y='price')
fig
fig.show()
# compute auto-correlation and partial auto-correlation funcitons
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
= plt.subplots(1, 2, figsize=(14, 5))
fig, axes
=30, ax=axes[0])
plot_acf(price, lags0].set_title('Autocorrelation Function')
axes[0].set_xlabel('Lag')
axes[
=40, ax=axes[1])
plot_pacf(price, lags1].set_title('Partial Autocorrelation Function')
axes[1].set_xlabel('Lag')
axes[
plt.tight_layout() plt.show()
As it is usual for prices, the series presents a strong 1-lag autocorrelation. This suggests modelling the return: \[ R_t := \nabla P_t = P_t - P_{t-1} \]
= df['price'].diff(1)
R
= px.line(R)
fig ='return', showlegend=False)
fig.update_layout(yaxis_title fig.show()
The series \(R_t\) suffers from heteroscedasticity, meaning that the standard deviation is not constant in time. A simple solution is to consider the rate of return: \[ r_t := \dfrac{P_t - P_{t-1}}{P_{t-1}} \approx \nabla_t \ln(P_t) \]
= np.log(df['price']).diff(1)
r
= px.line(r)
fig ='rate of return', showlegend=False)
fig.update_layout(yaxis_title fig.show()
Not only does this makes the series homoscedastic, but it also means that, modelling \(r_t\), we ensure our forecasted prices will never be negative, since \[ \hat P_{t+1} := P_t \exp(\hat r_{t+1}) \]
The simplest model for \(r_t\) is that of a normal random variable: \[ r_t \sim \mathcal N(\mu_o, \sigma_o^2) \] This is equivalent to say that \(P_t\) follows as a geometric Brownian process with drift, whose stochastic differential equation reads \[ dP_t = P_t (\mu_o dt + \sigma_o dB_t) \] where \(B_t\) is a Brownian process, and drift \(\mu_o\) and volatility \(\sigma_o\) are constants.
Given that \(r_t \sim \mathcal N(\mu_o, \sigma_o^2)\), we could find parameters \(\mu_o\) and \(\sigma_o\) by simply computing the mean and the standard deviation of the time series \(r_t\). However, to see MLE in its glory, we follow a different route.
First, we discretize the stochastic differential equation: \[ P_{t+\Delta t} - P_t = P_t (\mu_o \Delta t + \sigma_o \epsilon_{t+\Delta t}) \] where we used the Brownian process property that \(B_{t+\Delta t} - B_{t} \sim \mathcal N(0, \Delta t)\) to replace the discretization of \(dB_t\) with \(\epsilon_{t+\Delta t} \sim \mathcal N(0, \Delta t)\). If we set \(\Delta t = 1\) (meaning that we’ll have to remember that the parameters are relative to 1-day time step), then the equation simplifies to \[ P_{t+1} = (1 + \mu_o) P_t + \sigma_o P_t Z_{t+1} \] where \(Z_t \sim \mathcal N(0, 1)\). This is equivalent to \[ P_t|P_{t-1} \sim \mathcal N(\mu(P_{t-1}), \sigma(P_{t-1})^2) \] where \[ \mu(P_{t-1}) := (1 + \mu_o) P_{t-1}, \ \ \ \ \ \sigma(P_{t-1}) = \sigma_o P_{t-1} \] If we think of \(P_t\) as the observation \(y_t\) of target variable \(Y\) and as \(P_{t-1}\) as the observation \(x_t\) of input feature(s) \(X\), we see that this system is a special case of the formula from above: \[ Y|X \sim \mathcal N(\mu(X), \sigma(X)^2) \] with the difference that now \(\sigma(X)\) is not constant, as it clearly depends on \(X_t = P_{t-1}\).
The likelihood of observations \(P_1, ..., P_T\) is \[\begin{align} P(D|\mu_o, \sigma_o) & = \prod_{t=1}^T p(Y_t|X_t, \mu_o, \sigma_o) = \prod_{t=1}^T \dfrac{1}{\sqrt{2\pi \sigma(X_t)^2}} \exp\left(-\dfrac{(Y_t - \mu(X_t))^2}{2\sigma(X_t)^2}\right) \\ \\ & = \prod_{t=1}^T \dfrac{1}{\sqrt{2\pi \sigma(P_{t-1})^2}} \exp\left(-\dfrac{(P_t - \mu(P_{t-1}))^2}{2\sigma(P_{t-1})^2}\right) \end{align}\]
Let’s write code to compute the value of the negative log-likelihood given as inputs the parameters \(\mu_o\) and \(\sigma_o\).
def mu_model(mu0, x):
return (1 + mu0) * x
def sigma_model(sigma0, x):
return sigma0 * x
def neg_log_likelihood(params, price):
= params
mu0, sigma0 if sigma0 <= 0: # ensure sigma0 > 0
return np.inf
# build the data, consisting of target y = P and of input X (P at t-1)
= pd.DataFrame()
data 'P_t'] = price
data['P_t-1'] = price.shift(1)
data[= data.dropna()
data
# separate target from input
= data['P_t-1']
x = data['P_t']
y
# compute mu(X) and sigma(X)
= mu_model(mu0, x)
mu = sigma_model(sigma0, x)
sigma
# negative log-likelihood for normal distribution
= 0.5 * (np.log(sigma**2) + ((y - mu) ** 2) / sigma**2).sum()
nll
return nll
Now we can estimate the parameters by minimizing the negative log-likelihood.
def estimate_params(df):
= df['price']
price
# initial guess
= 0
mu0_init = 0.01
sigma0_init
# minimize negative log-likelihood
= minimize(neg_log_likelihood, x0=(mu0_init, sigma0_init), args=(price))
result
= result.x
mu0, sigma0 return mu0, sigma0
= estimate_params(df) mu0_estim, sigma0_estim
print(f"NLL with initial guess: {neg_log_likelihood((0, 0.02), price):.2f}")
print(f"NLL with estimated parameters: {neg_log_likelihood((mu0_estim, sigma0_estim), price):.2f}")
print(f"\nParameters estimated via MLE:")
print(f"mu0 = {mu0_estim:.4f}")
print(f"sigma0 = {sigma0_estim:.4f}")
NLL with initial guess: 1330.40
NLL with estimated parameters: 709.95
Parameters estimated via MLE:
mu0 = 0.0029
sigma0 = 0.0345
We see that the negative log-likelihood is indeed lower when evaluated on the estimated parameters than when evaluated on the initial guess: a good indication that our minimization algorithm was successful.
Having identified the best parameters, we can use them to simulate several possible future paths that the price might take.
def simulate_paths(x0, mu0, sigma0, T=100, N=100, seed=None):
= np.random.RandomState(seed=seed)
ran
= np.zeros((N, T + 1))
paths 0] = x0
paths[:,
for t in range(1, T + 1):
= paths[:, t-1]
x
= mu_model(mu0, x)
mu = sigma_model(sigma0, x)
sigma
= ran.normal(mu, sigma)
paths[:, t]
= pd.DataFrame(paths[:, 1:]).T
paths return paths
= 250
periods = 10
simulations
= df['price']
historical_price = historical_price.iloc[-1]
last_price = simulate_paths(last_price, mu0_estim, sigma0_estim, T=periods, N=simulations, seed=123)
paths
= df.index[-1] + pd.Timedelta(days=1)
start = pd.date_range(start=start, periods=periods, freq='D')
index = index
paths.index
= pd.concat([historical_price] * paths.shape[1], axis=1)
historical_price_df = paths.columns
historical_price_df.columns = pd.concat((historical_price_df, paths), axis=0)
paths = 'date' paths.index.name
= px.line(paths)
fig ='price', legend_title_text='simulations')
fig.update_layout(yaxis_title fig.show()
Our model seems to be quite optimistic on the future of Nvidia… Perhaps it’s still a bit too simple!
To conclude, it is interesting to compare the parameters we estimated via MLE with those we would find by simply computing the mean and standard deviation of the rate of return \(r_t\).
print(f"\nParameters computed from r:")
print(f"mu0 = {r.mean():.4f}")
print(f"sigma0 = {r.std():.4f}")
Parameters computed from r:
mu0 = 0.0023
sigma0 = 0.0343
Given that this simpler method produces essentially the same results, one might wonder why we went through the trouble of performing MLE.
The answer is that the MLE method allows much more general models. For example, the input \(X_t\) of functions \(\mu(X_t)\) and \(\sigma(X_t)\) could be a vector (not just the scalar \(P_{t-1}\)), consisting of many autoregressive and seasonal components, as well as calendar features and other fundamental drivers. Moreover, the functions \(\mu\) and \(\sigma\) themsleves need not be linear in the parameters: we might want to consider more complex relations. In fact, we might even keep them as “black boxes” by modelling them via neural networks, letting the algorithm “learn” what the drift and the volatility should be in terms of inputs \(X_t\)… But this is a topic for a future post!