AI is built on top of the premise that for every event or object in the universe there are values more probable than others. There are places where it is more probable to find something rather than nothing, atoms more probable in a molecule than others, species more probable in an environment than others, and things people with a specific personality in a determined context are more probable to say than others. That probability can be expressed with a traditional equation like the Schrödinger equation that describes the probability of finding an electron in a specific place in an atom. However, that equation took Schrödinger years and earned him a Nobel prize.

As cool as it could be, it would be impractical to expect people to just go around and build Nobel-prize winning equations for any probability function on any problem in any given day. However, machine learning provides a way to brute-force an approximation to these unknown probabilities underlying in a large enough sample of the relationship or object of interest by modifying the parameters of a more general equation. The simplest ML algorithm is the linear regression that works with a few key ideas.

Every example on a dataset with the same number of the same variables in the same order can be expressed as a point in a vector space with the same number of dimensions as there are variables. For example, any place on Earth can be mapped as a point in a 2D space, where the first dimension represents its distance from an arbitrary origin place on Earth from south to north and the second dimension from west to east. We could add a third dimension that represented the altitude of the place in respect to the sea level. However, getting rid of the concept of physical space and embracing mathematical abstract space, the same can be done with houses. The first dimension can represent the number of rooms, the second dimension the number of bathrooms, and the third, the zip code of the place. However, as this is no longer physical space, there can be a fourth dimension that represents the size of the land plot, a fifth one that represents if there is or not a garden, and as much dimensions as needed for the variables to represent, even if they can't be easily pictured as actual physical dimensions. The regular 2D or 3D maths work the same on 700D or 168289D.

Addition represents movement and multiplication represents scaling or rotation. 2 + 5 = 7 can also be thought as a 1D vector with size 2 moving in the direction a 1D vector of size 5. The position this new vector lands is the same as the position a 1D vector size 7 would be, so the result is 7. 2 x 5 = 10 can be thought as a 1D vector of size 2 getting 5 times bigger or just moving the vector in the direction of a 1D vector of the same size 5 times. The position it lands on is the same as a 1D vector of size 10 would be, so the result is 10. Multiplying by negative numbers inverts the direction of the movement, but keeps the same size. Multiplying by -5 still gets you to size 10, but on the other side of the space you're moving on, that is, -10.

Vector coordinates are relative to arbitrary axes called basis vectors. Traditionally, x, y and z are used for 3D space, but they could just as easily be called a, b and c or pineapple, orange and banana. Names are completely arbitrary. The only requirement for an axis to be so is that every vector in the space can be formed as a unique addition or multiplication of these basis vectors, but they can't form each other. In the case of 3D space, x is [1, 0, 0], y is [0, 1, 0] and z is [0, 0, 1]. A vector can be formed as multiplication or addition of these, for example, vector [7, 2, 4] is actually the same as 7x, 2y and 4z summed together. Vector [2, 2, 2] is not a basis vector because it and all its transformations can be formed by transforming 2x, 2y, 2z, while y cannot be formed by transforming x or z.

The same vector can be expressed by different basis vectors and vectors in one basis can be transformed to another basis by providing a transformation matrix that contains the coordinates of basis B in the basis vectors of basis A. For example, basis A uses x [1, 0] and y [0, 1] as basis vectors. There is a vector [7, 11] in basis A that is also expressed as [1, 1] in basis B. It represents the same actual point in space but from the perspective of two different coordinate systems. For vector [1, 1] in basis B to be vector [7, 11] in basis A, then x has to be [7, 0] and y [0, 11] in basis A. If we order both in matrix notation, we get a transformation matrix that can be used to get any point in basis B in basis A coordinates and the matrix that represents the reverse operation is called its inverse.

[1, 1] * [[7, 0], [0, 11]] = [7, 11]

[7, 11] * [[1/7, 0], [0, 1/11]] = [1, 1]

Information in vectors in n dimensions can be compressed into n-1 dimensions by applying a basis transformation matrix. Therefore a 5 dimensional house vector can be compressed into a 1 dimensional price vector by applying the right matrix. We could get that matrix by trial and error, but that can take a really long time which becomes larger the more dimensions there are. Linear regression is all about educated trial and error by computational brute force 😜

Regular programming can be applied when the right equation for converting x into y is already known, like converting Celsius degrees into Fahrenheit. However, machine learning works the other way around. We get a lot of examples of degrees in Celsius and represent them as vectors, and do the same with their Fahrenheit equivalents. Then we get the very general equation:

input vector c = [25 16 -20]

true values f = [77 60.8 -4]

f = transformation matrix * c

We start by filling a matrix according to the number of variables in the input vector with random values, in this case just one:

f = [0.72] x c

At this point, we know that 25 degrees Celsius are 77 degrees Fahrenheit, yet we don't know what is the actual function that expresses that relationship. By what we said before, what we're trying to do is to find the matrix that transforms the vector that expresses a temperature in the Celsius basis to the Fahrenheit basis. We test our random values:

[0.72] * [25 16 -20] = [18 11.52 -14.4]

This is definitely not right, but how wrong are we? Enter loss functions. The simplest error, lost or cost function (used interchangeably) is the sum of squared errors (SSE). That is, for all the samples in the dataset, assuming there are only 3 in this example

0.72 x 25 = 18 → 77 - 18 = 59 → (59)^2 = 3481

0.72 x 16 = 11.52 → 60.8 - 11.52 = 49.28 → (49.28)^2 = 2428.52

0.72 x -20 = -14.4 → -4 - 14.4 = - 18.4 → (-18.4)^2 = 338.56

SSE = 3481 + 2428.52 + 338.56 = 6248.08

We are 6248.08 wrong! This is useful because it is a metric that can be improved to better fit the data overall. Error is just original - prediction. There are two key reasons to square it: 1) -18.4 without the squaring would partially neutralise the positive values, which would seem as a lower error, which would mislead the improvements. 2) SSE has a very simple derivative: error * input vector. What would we need a derivative for?

The derivative of the loss function gives us a direction of improvement and how far are we from a maximum or minimum. If we wanted to make the error bigger, we could sum the derivative of the loss to our matrix. However, we want to make it lower, so we subtract it from the matrix. This method of minimising the error function by brute force guided by its derivative is called gradient descent, because a gradient is a vector of partial derivatives of a function.

In real life we also use two more parameters: a bias that moves the vectors around by addition and a learning rate that scales the error down so that it's not a huge number that we are subtracting from the matrix. Every time we improve the error by calculating predictions and subtracting the gradient for all the examples in the datasets is called an epoch. Training on large datasets with dozens of variables can take hundreds, if not thousands of epochs. Also, some huge data like videos or images are very highly dimensional and multiplying such large matrices on a CPU is excruciatingly slow. In those cases, GPUs are used. Although they were originally designed for video games and rendering animations, those tasks also imply multiplying huge vectors and matrices, so they're well suited for the task.

Got any suggestions?

Let us know!