Linear Regression : Model Representation and Cost Functions

Liesl Anggijono
5 min readNov 7, 2020


Supervised learning is one of the sub branches of machine learning tasks commonly used Today. Using this method, the data is labeled to tell the machine exactly what patterns it should look for. Just like a sniffer dog that will hunt down targets once their master shows the dog which target to know, hence knowing the scent it should look after. This method has been used and is pivotally used in many fields. From defining relationships between a company’s advertising budget and sales all the way to linear relationships between radiation therapies and tumor sizes. Supervised learning maps an input to the output based on example pairs, inferring a function from labelled training data.

Haven’t fully grasped the idea yet ? Let’s try to visualize this. For example, you wanted to help a friend sell a house in the city with the size of 1200 square feet. So, you refer back to data of houses sold in the area, our training data: the size(x) and the price(y) of your home. This is considered supervised learning as you’re given the right answer of each example in the data, trying to constantly learn from that data. In this particular problem, we are told the actual prices \of the house sold for and moreover, this is a regression problem — the fact that we’re predicting real-valued outputs (aka the price) between variables. The x here, would be the input features and Y is to denote my output variables or the target variable which we’re going to predict.

(On the x axis — the house size),(On the y axis — the house prize)

A pair, (x(i),y(i)) is a training set that we’ll use to learn. The (i) is an index to the training set.Our goal is to learn the function — h: X → Y so that the h(x) will be a predictor for the corresponding value of y. This is moreover a regression problem as the target variable that we’re predicting is continuous.

Cost Function

Cost functions lets us figure out how to fit the best linear line to our data, measuring the accuracy of our hypothesis function. In other words, how accurate is this h function? This takes the average of all the results of h with inputs from the x’s and output y’s.0 and 1 stabilizes the parameters of the model . What we’re trying to achieve is to come up with values for the parameters 0 and 1 so that the linear line we get, corresponds to a straight line that can fit the data well.

Note : notation m is the number of training examples

Choosing a 0 , 1so that h(x) is closest to y for our training example (x,y)

Squared Error Cost Function

How do we do that ? we get to choose our parameters so that h(x) — the value we put a prediction on input x, has to be at least close to the values0 and 1, wanting the difference between y and h(x), so we can try to minimize the square difference between the output and actual house price. So we can sum over our training set i = 1 to m, of the square difference between (prediction of hypothesis when it is input to the size of house number i) minus the actual price the house number sold. Then we want to minimize the sum of our training set , so sum from I equals through m, of the difference of our squared error( square difference between predicted house price and the real price it was really sold for). It is 12x where xis the mean of the squares of h(xi) — yi. This function is called the squared error function or mean square error where the mean is halved for the convenience of the computation of this gradient descent, as the derivative term of the square function will cancel out the ½ term.

Let’s go over the equation again. We minimize our average — one over 2m times i = 1m(yiy i)2. Putting the 2 at the constant one half in front, so minimizing one-half of something should give you the same values of the process, minimizing that function. This notation minimizes over 0 and 1 , finding the values of 0 and 1 that causes this expression to be minimized and this expression depends on 0 and 1 . So we’re basically finding the values of theta zero and theta one so that the average, the 1 over the 2m, times the sum of squared errors between my predictions on the training set minus the actual values of the houses on the training set is minimized.

Our training data is scattered on the graph and we’re. We are trying to make the best possible linear line that passes through these data points so that the average squared vertical distances of the scattered points from the line will be at its minimum.

This squared error cost function works well for problems for most regression programs although there are other cost functions that are reasonable too. This function is a moderately reasonable one for most linear regression problems as is it one of the ones most commonly used.

A short summary of our function