How to implement a linear regression algorithm?

12

I need to implement a linear regression algorithm. Preferably, it gives results that are the same as or close to the TREND (or TREND ) function of Excel.

I have already found a lot of material that attempts to explain the whole concept of linear regression, but that is not what I want. I would like a pseudo code that is directly translatable for an imperative programming language.

For those who did not use the TENDENCY function, it works like this:

Given two vectors representing x and y, in an ordered pair (x, y), for known values, for example, [1,10], (2,20), (3,30), (4.40)] , and an x value for which you want to find the corresponding y value, the TREND function fits the known values in a function of the form y = mx + b gives you the value of y. In this case if I pass the value 5 as the last argument, the function returns me 50.

Is anyone able to give a pseudo code an algorithm that does this?

    
asked by anonymous 05.02.2014 / 17:26

7 answers

8

I made a small example in Java based on some examples that I found on the Web.

I used the data you provided ...

public class Main {

    public static void main(String[] args) {
        double[] x = {1, 2, 3, 4};
        double[] y = {10, 20, 30, 40};
        System.out.println(trend(y, x, 5));
    }

    public static double trend(double[] known_y, double[] known_x, double new_x)
    {
        double[] values = LeastSquaresFitLinear(known_y, known_x);
        return (values[0] * new_x) + values[1];
    }

    public static double[] LeastSquaresFitLinear(double[] known_y, double[] known_x)
    {
        double M, B;
        if (known_y.length != known_x.length)
        {
            return new double[]{0,0};
        }

        int numPoints = known_y.length;

        double x1, y1, xy, x2, J;

        x1 = y1 = xy = x2 = 0.0;
        for (int i = 0; i < numPoints; i++)
        {
            x1 = x1 + known_x[i];
            y1 = y1 + known_y[i];
            xy = xy + known_x[i] * known_y[i];
            x2 = x2 + known_x[i] * known_x[i];
        }

        M = B = 0;
        J = ((double)numPoints * x2) - (x1 * x1);

        if (J != 0.0)
        {
            M = (((double)numPoints * xy) - (x1 * y1)) / J;
            B = ((y1 * x2) - (x1 * xy)) / J;
        }
        return new double[]{M,B};
    }

}
    
05.02.2014 / 18:35
9

The @Marcos Banik answer can be translated for example to Python with minimal effort:

>>> x = [1,2,3,4]
>>> y = [10,20,30,40]
>>> m = sum(a*b for (a,b) in zip(x,y)) - sum(x) * sum(y) / len(x)
>>> m /= sum(a**2 for a in x) - (sum(x)**2)/len(x)
>>> b = sum(y)/len(y) - m * sum(x)/len(x)
>>> m*5 + b
50

However, not every imperative language has the same level of expressiveness, so I'm going to present a more elaborate version (in JavaScript):

var x = [1,2,3,4];
var y = [10,20,30,40];

function produto(x, y) {
    var ret = [];
    for ( var i = 0 ; i < x.length ; i++ )
        ret.push(x[i] * y[i]);
    return ret;
}

function quadrados(x) {
    var ret = [];
    for ( var i = 0 ; i < x.length ; i++ )
        ret.push(x[i] * x[i]);
    return ret;
}

function somatorio(x) {
    var ret = 0;
    for ( var i = 0 ; i < x.length ; i++ )
        ret += x[i];
    return ret;
}

function media(x) { return somatorio(x) / x.length; }

var m = somatorio(produto(x,y)) - somatorio(x) * somatorio(y) / x.length;
m /= somatorio(quadrados(x)) - somatorio(x)*somatorio(x) / x.length;

var b = media(y) - m * media(x);

console.log(m*5 + b);

Example in jsFiddle .

    
05.02.2014 / 18:50
5

I do not know what algorithm the TENDENCIA function uses, but if it is just an input variable x and an output y , the response to m and b by methods of least squares is: (code in R)

m <- (sum(x*y) - sum(x) * sum(y) / n) / ( sum(x^2) - ( sum(x) )^2 ) / n )
b <- mean(y) - m * mean(x)

n is the number of elements of the x and y vectors, sum is the function that returns the sum of the vectors, x * y is the vector with the product of elements of the same index. x^2 is the vector where all its elements are squared

You can find this solution at wikipedia

    
05.02.2014 / 17:55
4

Using the same algorithm for the @Marcos Banik answer , I wrote a small C algorithm based on the method of least squares:

void lms(double *x, double *y, int n, double *m, double *b)
{
    int i;
    double sumYX = 0.;
    double sumX = 0.;
    double sumY = 0.;
    double sumX2 = 0.;
    double sum2X = 0.;
    for(i = 0; i < n; i++) {
        sumYX += x[i] * y[i];
        sumX += x[i];
        sumY += y[i];
        sumX2 += x[i] * x[i];
    }
    sum2X = sumX * sumX;

    *m = (sumYX - (sumX * sumY) / (double)n) / (sumX2 - sum2X / (double)n);
    *b = sumY / (double)n - *m * sumX / (double)n;
}

The algorithm is exactly the same as that used in its response. The for initial loop is made to calculate all sums and then its results are used to find m and b . I believe this is easily translated into your language of preference. :)

With m and b you have the equation of the line. To calculate the value of y, simply do a function:

double trend(double m, double b, double x)
{
    return m*x + b;
}

To call functions:

int main(void) 
{
    double x[] = {1., 2., 3., 4.};
    double y[] = {10., 20., 30., 40.};
    double m, b;
    double nx = 5.;
    double ny;

    lms(x, y, 4, &m, &b);
    ny = trend(m, b, nx);

    printf("m: %lf \nb: %lf \nnx: %lf \nny: %lf\n", m, b, nx, ny);

    return 0;
}
    
05.02.2014 / 18:58
2

For those of you interested here is the implementation I was looking for.

It is written in the Clipper (compiler x / Harbor) language and works the same as TREND for several values I tested.

Function Main()

   LOCAL x := { 1,2,3,4 }
   LOCAL y := { 10,20,30,40 }

   ? Trend(y, x, 5)

   Return NIL

Function Trend( known_y, known_x, new_x )

   LOCAL v := LSFL(known_y, known_x)

   Return (v[1] * new_x) + v[2]

Function LSFL(known_y, known_x)

   LOCAL M,B
   LOCAL n
   LOCAL x1, y1, xy, x2, J
   LOCAL i

   IF Len( known_y ) != Len( known_x )
      Return { 0, 0 }
   ENDIF

   n := Len( known_y )

   x1 := 0; y1 := 0; xy := 0; x2 := 0

   For i := 1 To n
      x1 := x1 + known_x[i]
      y1 := y1 + known_y[i]
      xy := xy + known_x[i] * known_y[i]
      x2 := x2 + known_x[i] * known_x[i]
   Next

   M := 0
   B := 0

   J := ( n * x2 ) - ( x1 * x1 )

   IF !Empty( J )
      M := (( n * xy ) - ( x1 * y1 )) / J
      B := ( (y1 * x2) - ( x1 * xy )) / J
   ENDIF

   Return { M, B }
    
05.02.2014 / 19:17
2

You need y=mx+b and for this you only need x data. y would only serve to discover other variables, but not to define the regression equation, correlation coefficient, mean, median and fashion of the data.

Generic scope: "what is the year (x) in which absolute population growth will be zero?"

First we need to find the growth rate (%) and whether it is positive or negative, which defines TREND or the slope of the graph generated by ER.

Based on the existing data ( anos/x ) we define the correlation coefficient or precision of the answers for the new y .

In the middle of the path we define fashion, median, and so on to support the solution of the problem that allows to work any array of x:y depending on TREND, the sign of bx or mx + or - .

y = a + bx + c ou y=mx+b

The complete code that generates the regression equation and correlation coefficient is in The State of Art by James Holmes in the Chapter 8 sources .

    
11.07.2014 / 06:59
1

Here is an algorithm that I did to calculate the MMQ based on a video lesson I watched on Youtube:

#include<stdio.h>
#include<stdlib.h>
#include<math.h>


void main(){

    float x[] = {-1,0,1,2};
    float y[] = {-1,1,3,5};
    float sx;
    float xy;
    float totalx = 0;
    float totaly = 0;
    float totalsx = 0;
    float totalxy = 0;
    int n = sizeof(x)/sizeof(float);

    int cont;

    for(cont = 0;cont<n;cont++){
        totalx = totalx + x[cont];
        totaly = totaly + y[cont];
        totalsx = totalsx + pow(x[cont],2);
        totalxy = totalxy + (x[cont]*y[cont]);
    }

    // Passo1
    float vbb = n;
    float vba = totalx;
    float somvba = totaly;
    float b2p1 = totalx;
    float a2p1 = totalsx;
    float soma2p1 = totalxy;

    //Passo 2
    float b1p2 = vbb / vbb;
    float a1p2 = vba / vbb;
    float soma1p2 = somvba / vbb;
    float b2p2 = b2p1-(b1p2*totalx);
    float a2p2 = a2p1-(a1p2*totalx);
    float soma2p2 = soma2p1-(soma1p2*totalx);

    // Passo 3
    float b1p3 = b1p2;
    float a1p3 = a1p2;
    float soma1p3 = soma1p2;
    float a2p3 = a2p2 / a2p2;
    float soma2p3 = soma2p2 / a2p2;

    float afinal = soma2p3 / a2p3;
    float bfinal = soma1p3 - a1p3 * afinal;

    printf("\nValor final de A= %.5f e valor de B= %.5f\n\n", afinal, bfinal);
    system("pause >> log");
}
    
13.10.2015 / 01:36