Understanding Linear Regression [2/3]
Ordinary Least Squares with Simple Linear Regression
Since you have already been introduced to simple linear regression, we wonβt discuss the details here.
The simple linear regression model is:
where, we need to estimate the parameters, intercept(π½0) and slope(π½1).
Letβs recall an Advertising dataset and simple linear regression performed on scatter plot of sales Vs. TV. With the help of Scikit-Learn, we were able to fit the best regression line among all the possibilities. Here is a snapshot:
The blue line is a simple linear regression line with output π² as sales and π± as TV. The residual or error, π is the difference between the observed value, yπ, and predicted value, yhat . The observed value is the actual output data point, which is all red dots in the figure, and the predicted value is the point given by the blue regression line. Error for each output data point is shown by the vertical distance from the actual output data point to the predicted point on a regression line.
The predicted output value is:
The observed (actual) output value is:
Where ππ is a random error, not a parameter. The error ππ as (y-yhat) can either be positive or negative or even 0 sometimes. As we can see in the figure, vertical lines are on either side of the regression line. To avoid the cancellation of the error while summing errors, we square each error and sum them, called Residual Sum of Squares (RSS) or Sum of Squared Errors (SSE).
The summation is indexed from 1 to π, since we have π samples. Sum of Squared Errors (SSE) is the function of π½0 and π½1 . We can also take it as Loss function. The main principle of Least Squares is that we should end up choosing intercept (π½0) and slope (π½1) such that the overall sum is minimum.
Thus, to estimate the parameters, we minimize the sum of squared error. Sum of Squared Errors (SSE) can also be written as:
yhat is replaced with the simple linear regression model equation. Since we tend to minimize SSE , it is also called an objective function. Since the objective function, SSE is a squared term, it is always positive. If we plot objective function, it would be a convex graph facing upwards.
The parameters at a minimum point are obtained from calculus by setting the first derivative of the objective function to 0. Gradient or slope is always 0 at the minimum point. This statement is extended in the upcoming chapter, Gradient Descent in detail. We have two unknown parameters, intercept (π½0) and slope (π½1) so, we will take the partial derivative of SSE with respect to π½0 and π½1 separately. We will set both partial derivatives to 0 and solve for π½0 and π½1 separately.
Note: To avoid a little clutter in the derivation below, we will not include the summation index. As said earlier, the summation is always indexed from 1 to π, π being the number of samples.
Taking partial derivatives with respect to π½0:
Note that the derivative of the sum is the sum of the derivatives. So, we can take the derivative inside the summation.
Now, applying power rule and chain rule, we get:
Now, with respect to π½1:
Again, the derivative of the sum is the sum of the derivatives, So, we take the derivative inside the summation.
Applying power rule, 2 comes out front and exponent becomes 1 . We will also apply chain rule to encounter the coefficient of π½1 .
Cleaning up a bit,
Now, we set up the partial derivatives equal to 0 for equation (1) and (2).
Here, we have two equations and two unknowns, and we are going to solve this to find our parameters. But how do we get that?
First, we will get an expression for π½0 from the first equation. That expression would involve π½1, and we will substitute that equation in the second equation and solve for π½1. Letβs solve the first equation.
Solving for π½0 equating equation (1) to 0,
We can divide both sides by β2 so that we get,
If we carry the summation term through each terms inside the bracket, we get:
Note that with respect to summation, π½0 and π½1 are constants. Statistically, they are random variables that take on any random value. But the values they take are constant over the samples. With respect to summation over the samples, they are constants so they can come outside the summation term as:
The sum of π½0 from 1 to π turns to ππ½0 and π½1 comes out of the summation term.
Now, isolating the ππ½0 term, we get:
Dividing both sides by π, we get:
The sum of all π¦β²π divided by π gives the mean or average and so is for π₯β²π . So, we end up with:
But this doesnβt work without knowing the value of π½1 . So, we substitute this expression of π½0 to the equation where the partial derivative of π½1 is set to 0.
Hence, solving for π½1,
We can divide both sides by β2 so that we get,
Substituting π½0, we get:
Now, we are getting somewhere since the unknown in the above expression is only π½1 . Now, we will find a way to isolate π½1 . Letβs first gather similar terms together.
So, now we proved the similarity of the denominator and numerator terms of both expressions of π½1 .
Since the parameters are estimates, we usually put hats on them. The key equations of the estimated parameters for simple linear regression are:
From the samples provided, first we find π½1 from the first expression and substitute the value of π½1 in the second expression for π½0 .
Implementation on Real World Dataset For implementation, we will use same Advertising dataset.
A popular introductory statistics book, An Introduction to Statistical Learning, provides this dataset on their website. This dataset can be downloaded from the following address:
http://faculty.marshall.usc.edu/gareth-james/ISL/Advertising.csv This dataset has got three inputs as advertising mediums, i.e. TV, radio and newspaper. Similarly the output variable is sales. This is a sales prediction problem with investment in any of the advertising mediums.
# Imports
import numpy as np
import pandas as pd
import matplotlib as mpl
from matplotlib import pyplot as plt
data_path = "https://storage.googleapis.com/codehub-data/1-lv2-2-2-Advertisement.csv"
# Read the CSV data from the link
data_df = pd.read_csv(data_path,index_col=0)
# Print out first 5 samples from the DataFrame
fig = plt.figure(figsize=(15,4))
gs = mpl.gridspec.GridSpec(1,3)
# Plot of sales vs TV
ax = fig.add_subplot(gs[0])
ax.scatter(data_df["TV"], data_df["sales"], color="red", marker=".")
# Plot of sales vs radio
ax = fig.add_subplot(gs[1])
ax.scatter(data_df["radio"], data_df["sales"], color="green", marker=".")
# Plot of sales vs newspaper
ax = fig.add_subplot(gs[2])
ax.scatter(data_df["newspaper"], data_df["sales"], color="blue", marker=".")
The first plot shows a sharp upward trend in the number of units sold as TV advertising increases. A similar trend is also found as radio advertising increases. However, in the last plot, there does not appear to be a relationship between newspaper advertising and the number of units sold.
Simple Linear Regression with Ordinary Least squares Earlier we used Scikit-Learnβs LinearRegression predictor object to estimate π½0 and π½1 . Now, we will implement the formulas derived from OLS to estimate the parameters.
fig = plt.figure(figsize=(15,4))
gs = mpl.gridspec.GridSpec(1,3)
# function for training model and plotting
def train_plot(data_df, feature, ax, c):
# initializing our inputs and outputs
X = data_df[[feature]].values
Y = data_df[["sales"]].values
# mean of our inputs and outputs
x_mean = np.mean(X)
y_mean = np.mean(Y)
#total number of samples
n = len(X)
# using the OLS formula to calculate the b1 and b0
numerator = 0
denominator = 0
for i in range(n):
numerator += (X[i] - x_mean) * (Y[i] - y_mean)
denominator += (X[i] - x_mean) ** 2
b1 = numerator / denominator
b0 = y_mean - (b1 * x_mean)
y_hat = b0 + np.dot(X,b1)
##Plot the regression line
ax.scatter(data_df[feature], data_df["sales"], color=c, marker=".")
ax.plot(X, y_hat, color="black")
ax.set_title(("$y$ = %3f + %3f$x$" %(b0, b1)))
# Train model using TV data to predict sales
ax0 = fig.add_subplot(gs[0])
train_plot(data_df, "TV", ax0, "red")
# Train model using radio data to predict sales
ax1 = fig.add_subplot(gs[1])
train_plot(data_df, "radio", ax1, "green")
# Train model using newspaper data to predict sales
ax2 = fig.add_subplot(gs[2])
train_plot(data_df, "newspaper", ax2, "blue")
Here, we performed a simple linear regression in each of the scatter plots.
TV vs. sales
TV is the input variable, one of the advertising mediums and sales is the output variable. Parameters estimated from OLS has done pretty good work in fitting the data points. The intercept ( π½0 ) has been estimated as 7.03 and slope ( π½1 ) has been estimated as 0.04 . The values of the parameters through OLS are the same to that through Scikit-Learn. The first plot depicts the simple linear regression with input as TV and output as sales.
radio vs. sales
radio is the input variable, one of the advertising mediums, and sales is the output variable. Parameters estimated from OLS has done pretty good work in fitting the data points. The intercept ( π½0 ) has been estimated as 9.31 and slope ( π½1 ) has been estimated as 0.21 . The values of the parameters through OLS are the same to that through Scikit-Learn.The second plot depicts the simple linear regression with input as radio and output as sales.
newspaper vs. sales
newspaper is the input variable, which is one of the advertising mediums, and sales is the output variable. Parameters estimated from OLS has done pretty good work in fitting the data points. The intercept ( π½0 ) has been estimated as 12.35 and slope ( π½1 ) has been estimated as 0.05 . The values of the parameters through OLS is the same as that through Scikit-Learn.The third plot depicts the simple linear regression with input as newspaper and output as sales.
You can check out previous reading material to ensure the similarity of the values of parameters.