# Linear Regression

*This tutorial written and reproduced with permission from Peter Ponzo*

We assume that some set of variables, **y**_{1}, **y**_{2}, … **y**_{K}, is dependent upon variables **x**_{k1}, **x**_{k2}, … **x**_{kn} (for k = 1 to K). We assume the relationship betwen the **y**s and **x**s is “almost” linear, like so:

[1]

y_{1}=ß_{0}+ß_{1}x_{11}+ß_{2}x_{12}+ … +ß_{n}x_{1n}+e_{1}

y_{2}=ß_{0}+ß_{1}x_{21}+ß_{2}x_{22}+ … +ß_{n}x_{2n}+e_{2}

……..

y_{K}=ß_{0}+ß_{1}x_{K1}+ß_{2}x_{K2}+ … +ß_{n}x_{Kn}+e_{K}

#### Why so many variables?

Well, suppose we note that, when the xs have the values x_{11}, x_{12}, … x_{1n}, the y-value is y_{1}. We suspect an almost linear relationship, so we try again, noting that x-values x_{21}, x_{22}, … x_{2n} result in a y-value of y_{2}. We continue, for K observations, where, for x-values x_{K1}, x_{K2}, … x_{Kn} the result is a y-value of y_{K}. Then in an attempt to identify the “almost” linear relationship, we assume the relationship [1].

We can write this in matrix format, like so:

[2]

or, more elegantly:

[3]

**y = Xß + e **

where y, ß and e are column vectors and X is a K x (n+1) matrix.

We attempt to minimize the sum of the squares of the errors (also called “residuals”) by clever choice of the parameters ß_{0}, ß_{1}, … ß_{n}.

E = e_{1}^{2} + e_{2}^{2} + …+e_{K}^{2} = e^{T}e where the row vector e^{T} denotes the transpose of e.

(Note that this sum of squares is just the square of the magnitude of the vector e … so we’re making the vector as small as possible.) We set all the derivatives to zero, to locate the minimum.

For each j = 0 to n we have:

*?E / ?ß _{j }= 2 e_{1} ?e_{1}/?ß_{j }+ 2 e_{2} ?e_{2}/?ß_{j} + … + 2 e_{K} ?e_{K}/?ß_{J }= 2 S e_{k} ?e_{k}/?ß_{j} = 0 *

the summation being from k = 1 to K.

Since ek = yk – ß0 – ß1xk1 – ß2xk2 – … – ßnxkn (from [1]) then:

*e _{k} ?e_{k}/?ß_{j} = [y_{k }– ß_{0} – ß_{1}x_{k1} – ß_{2}x_{k2} – … – ß_{n}x_{kn}] (-x_{kj}) = -(x_{kj})* [y_{k }– ß_{1}x_{k1} – ß_{2}x_{k2} – … – ß_{n}x_{kn}]*

= -(the kjth component of X)*[the kth component of y – Xß]

= -(the jkth component of XT)*[the kth component of y – Xß]

then we have n+1 equations (for j = 0 to n) like:

[4]

Sek ?ek/?ßj = – S(the jkth component of XT)*[the kth component of y – Xß] … summed over k.

But [4] defines the jth component of the n-component column vector: XT [ y – Xß ]. Setting them to zero gives us n+1 such linear equations to solve for the n+1 parameters ß0, ß1, ß2 … ßn, namely:

[5] XT [ y – Xß ] = 0.

#### What about the xs and ys?

We know them. They’re our observations and our goal is to determine the “almost” linear relationship between them. That means finding the (n+1) ß-values which we do by solving [5] for:

[6] ß = (XTX)-1XTy where (XTX)-1 denotes the inverse of XTX.

#### Don’t you find that … uh, a little confusing?

It’ll look better if we elevate it’s position like so:

If we wish to find the “best” linear relationship between the values of y_{1}, y_{2}, … y_{K} and x_{k1}, x_{k2}, … x_{kn} (for k = 1 to K) according to:

**y**_{1} = **ß**_{0} + **ß**_{1}**x**_{11} + **ß**_{2}**x**_{12} + … +**ß**_{n}**x**_{1n} + **e**_{1}

**y**_{2} = **ß**_{0} + **ß**_{1}**x**_{21} + **ß**_{2}**x**_{22} + … +**ß**_{n}**x**_{2n} + **e**_{2}

……..

**y**_{K} = **ß**_{0} + **ß**_{1}**x**_{K1} + **ß**_{2}**x**_{K2} + … +**ß**_{n}**x**_{Kn} + **e**_{K}

or

or

**y = Xß + e**

where the K-vector e denotes the errors (or residuals) in the linear approximation, y is a K-vector, ß an (n+1)-vector and X a K x (n+1) matrix, then we can minimize the size of the residuals by selecting the ß-values according to: ß = (XTX)-1XTy

#### Well it doesn’t look better to me!

Here’s an example where K = n = 3 and ß0 = 0 (so we’re looking for ß1, ß2 and ß3 and we ignore that first column of 1s in X):

Note the assumed values for the X matrix and the column vector y … coloured

We run thru’ the ritual, calculating XT and XTX etc. etc. … and finally the ß parameters … coloured

The resultant (almost linear) relationship is inside the blue box.

There are (as expected!), errors denoted by e1, e2 and e3 … but they’re pretty small. They’re coloured

If, instead of those “best” choices for the parameters, we had chosen a different set, say ß’ coloured the errors are significantly greater.