12Answers
  • 6
name

A PHP Error was encountered

Severity: Notice

Message: Undefined index: userid

Filename: views/question.php

Line Number: 191

Backtrace:

File: /home/prodcxja/public_html/questions/application/views/question.php
Line: 191
Function: _error_handler

File: /home/prodcxja/public_html/questions/application/controllers/Questions.php
Line: 433
Function: view

File: /home/prodcxja/public_html/questions/index.php
Line: 315
Function: require_once

name Punditsdkoslkdosdkoskdo

Multiple linear regression in Python

I can't seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.).

For example, with this data:

print 'y        x1      x2       x3       x4      x5     x6       x7'
for t in texts:
    print "{:>7.1f}{:>10.2f}{:>9.2f}{:>9.2f}{:>10.2f}{:>7.2f}{:>7.2f}{:>9.2f}" /
   .format(t.y,t.x1,t.x2,t.x3,t.x4,t.x5,t.x6,t.x7)

(output for above:)

      y        x1       x2       x3        x4     x5     x6       x7
   -6.0     -4.95    -5.87    -0.76     14.73   4.02   0.20     0.45
   -5.0     -4.55    -4.52    -0.71     13.74   4.47   0.16     0.50
  -10.0    -10.96   -11.64    -0.98     15.49   4.18   0.19     0.53
   -5.0     -1.08    -3.36     0.75     24.72   4.96   0.16     0.60
   -8.0     -6.52    -7.45    -0.86     16.59   4.29   0.10     0.48
   -3.0     -0.81    -2.36    -0.50     22.44   4.81   0.15     0.53
   -6.0     -7.01    -7.33    -0.33     13.93   4.32   0.21     0.50
   -8.0     -4.46    -7.65    -0.94     11.40   4.43   0.16     0.49
   -8.0    -11.54   -10.03    -1.03     18.18   4.28   0.21     0.55

How would I regress these in python, to get the linear regression formula:

Y = a1x1 + a2x2 + a3x3 + a4x4 + a5x5 + a6x6 + +a7x7 + c

    • not an expert, but if the variables are independent, can't you just run simple regression against each and sum the result?
      • 1
    • @HughBothwell You can't assume that the variables are independent though. In fact, if you're assuming that the variables are independent, you may potentially be modeling your data incorrectly. In other words, the responses Y may be correlated with each other, but assuming independence does not accurately model the dataset.
      • 1
    • @HughBothwell sorry if this a dum question, but why does it matter if the raw feature variables x_i are independent or not? How does that affect the predictor (=model)?

sklearn.linear_model.LinearRegression will do it:

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit([[getattr(t, 'x%d' % i) for i in range(1, 8)] for t in texts],
        [t.y for t in texts])

Then clf.coef_ will have the regression coefficients.

sklearn.linear_model also has similar interfaces to do various kinds of regularizations on the regression.

  • 100
Reply Report
      • 1
    • @HuanianZhang what do you mean by confidence level? If you want the coefficient of determination, the score method will do it; sklearn.metrics has some other model evaluation criteria. If you want the stuff like in Akavall's answer, statsmodels has some more R-like diagnostics.

Here is a little work around that I created. I checked it with R and it works correct.

import numpy as np
import statsmodels.api as sm

y = [1,2,3,4,3,4,5,4,5,5,4,5,4,5,4,5,6,5,4,5,4,3,4]

x = [
     [4,2,3,4,5,4,5,6,7,4,8,9,8,8,6,6,5,5,5,5,5,5,5],
     [4,1,2,3,4,5,6,7,5,8,7,8,7,8,7,8,7,7,7,7,7,6,5],
     [4,1,2,5,6,7,8,9,7,8,7,8,7,7,7,7,7,7,6,6,4,4,4]
     ]

def reg_m(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results

Result:

print reg_m(y, x).summary()

Output:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.535
Model:                            OLS   Adj. R-squared:                  0.461
Method:                 Least Squares   F-statistic:                     7.281
Date:                Tue, 19 Feb 2013   Prob (F-statistic):            0.00191
Time:                        21:51:28   Log-Likelihood:                -26.025
No. Observations:                  23   AIC:                             60.05
Df Residuals:                      19   BIC:                             64.59
Df Model:                           3                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.2424      0.139      1.739      0.098        -0.049     0.534
x2             0.2360      0.149      1.587      0.129        -0.075     0.547
x3            -0.0618      0.145     -0.427      0.674        -0.365     0.241
const          1.5704      0.633      2.481      0.023         0.245     2.895

==============================================================================
Omnibus:                        6.904   Durbin-Watson:                   1.905
Prob(Omnibus):                  0.032   Jarque-Bera (JB):                4.708
Skew:                          -0.849   Prob(JB):                       0.0950
Kurtosis:                       4.426   Cond. No.                         38.6

pandas provides a convenient way to run OLS as given in this answer:

Run an OLS regression with Pandas Data Frame

  • 60
Reply Report
      • 1
    • The reg_m function is unnecessarily complicated. x = np.array(x).T, x = sm.add_constant(x) and results = sm.OLS(endog=y, exog=x).fit() is enough.
    • This is a nice tool. Just ask one question: in this case, the t value is outside the 95.5% confidence interval, so it means this fitting is not accurate at all, or how do you explain this?
      • 2
    • Just noticed that your x1, x2, x3 are in reverse order in your original predictor list, i.e., x = [x3, x2, x1]?
      • 2
    • @HuanianZhang "t value" is just how many standard deviations the coefficient is away from zero, while 95%CI is approximately coef +- 2 * std err (actually the Student-t distribution parameterized by degrees of freedom in the residuals). i.e. larger absolute t values imply CIs further from zero, but they shouldn't be directly compared. clarification is a bit late, but hope it's useful to somebody

Just to clarify, the example you gave is multiple linear regression, not multivariate linear regression refer. Difference:

The very simplest case of a single scalar predictor variable x and a single scalar response variable y is known as simple linear regression. The extension to multiple and/or vector-valued predictor variables (denoted with a capital X) is known as multiple linear regression, also known as multivariable linear regression. Nearly all real-world regression models involve multiple predictors, and basic descriptions of linear regression are often phrased in terms of the multiple regression model. Note, however, that in these cases the response variable y is still a scalar. Another term multivariate linear regression refers to cases where y is a vector, i.e., the same as general linear regression. The difference between multivariate linear regression and multivariable linear regression should be emphasized as it causes much confusion and misunderstanding in the literature.

In short:

  • multiple linear regression: the response y is a scalar.
  • multivariate linear regression: the response y is a vector.

(Another source.)

  • 47
Reply Report

You can use numpy.linalg.lstsq:

import numpy as np
y = np.array([-6,-5,-10,-5,-8,-3,-6,-8,-8])
X = np.array([[-4.95,-4.55,-10.96,-1.08,-6.52,-0.81,-7.01,-4.46,-11.54],[-5.87,-4.52,-11.64,-3.36,-7.45,-2.36,-7.33,-7.65,-10.03],[-0.76,-0.71,-0.98,0.75,-0.86,-0.50,-0.33,-0.94,-1.03],[14.73,13.74,15.49,24.72,16.59,22.44,13.93,11.40,18.18],[4.02,4.47,4.18,4.96,4.29,4.81,4.32,4.43,4.28],[0.20,0.16,0.19,0.16,0.10,0.15,0.21,0.16,0.21],[0.45,0.50,0.53,0.60,0.48,0.53,0.50,0.49,0.55]])
X = X.T # transpose so input vectors are along the rows
X = np.c_[X, np.ones(X.shape[0])] # add bias term
beta_hat = np.linalg.lstsq(X,y)[0]
print beta_hat

Result:

[ -0.49104607   0.83271938   0.0860167    0.1326091    6.85681762  22.98163883 -41.08437805 -19.08085066]

You can see the estimated output with:

print np.dot(X,beta_hat)

Result:

[ -5.97751163,  -5.06465759, -10.16873217,  -4.96959788,  -7.96356915,  -3.06176313,  -6.01818435,  -7.90878145,  -7.86720264]
  • 28
Reply Report
      • 1
    • may i know what is difference between print np.dot(X,beta_hat)... and mod_wls = sm.WLS(y, X, weights=weights) res = mod_wls.fit() predsY=res.predict() they all return the Y result

Use scipy.optimize.curve_fit. And not only for linear fit.

from scipy.optimize import curve_fit
import scipy

def fn(x, a, b, c):
    return a + b*x[0] + c*x[1]

# y(x0,x1) data:
#    x0=0 1 2
# ___________
# x1=0 |0 1 2
# x1=1 |1 2 3
# x1=2 |2 3 4

x = scipy.array([[0,1,2,0,1,2,0,1,2,],[0,0,0,1,1,1,2,2,2]])
y = scipy.array([0,1,2,1,2,3,2,3,4])
popt, pcov = curve_fit(fn, x, y)
print popt
  • 13
Reply Report

Once you convert your data to a pandas dataframe (df),

import statsmodels.formula.api as smf
lm = smf.ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df).fit()
print(lm.params)

The intercept term is included by default.

See this notebook for more examples.

  • 8
Reply Report
    • This notebook is awesome. it shows how to regress multiple independent variables (x1,x2,x3...) on Y with just 3 lines of code and using scikit learn.
      • 1
    • @canary_in_the_data_mine thanks for the notebook. how can i plot linear regression which has multiple features? I couldn't find in the notebook. any pointers will be greatly appreciated. -- Thanks

I think this may the most easy way to finish this work:

from random import random
from pandas import DataFrame
from statsmodels.api import OLS
lr = lambda : [random() for i in range(100)]
x = DataFrame({'x1': lr(), 'x2':lr(), 'x3':lr()})
x['b'] = 1
y = x.x1 + x.x2 * 2 + x.x3 * 3 + 4

print x.head()

         x1        x2        x3  b
0  0.433681  0.946723  0.103422  1
1  0.400423  0.527179  0.131674  1
2  0.992441  0.900678  0.360140  1
3  0.413757  0.099319  0.825181  1
4  0.796491  0.862593  0.193554  1

print y.head()

0    6.637392
1    5.849802
2    7.874218
3    7.087938
4    7.102337
dtype: float64

model = OLS(y, x)
result = model.fit()
print result.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.859e+30
Date:                Wed, 09 Dec 2015   Prob (F-statistic):               0.00
Time:                        15:17:32   Log-Likelihood:                 3224.9
No. Observations:                 100   AIC:                            -6442.
Df Residuals:                      96   BIC:                            -6431.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             1.0000   8.98e-16   1.11e+15      0.000         1.000     1.000
x2             2.0000   8.28e-16   2.41e+15      0.000         2.000     2.000
x3             3.0000   8.34e-16    3.6e+15      0.000         3.000     3.000
b              4.0000   8.51e-16    4.7e+15      0.000         4.000     4.000
==============================================================================
Omnibus:                        7.675   Durbin-Watson:                   1.614
Prob(Omnibus):                  0.022   Jarque-Bera (JB):                3.118
Skew:                           0.045   Prob(JB):                        0.210
Kurtosis:                       2.140   Cond. No.                         6.89
==============================================================================
  • 4
Reply Report

Multiple Linear Regression can be handled using the sklearn library as referenced above. I'm using the Anaconda install of Python 3.6.

Create your model as follows:

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X, y)

# display coefficients
print(regressor.coef_)
  • 4
Reply Report

You can use the function below and pass it a DataFrame:

def linear(x, y=None, show=True):
    """
    @param x: pd.DataFrame
    @param y: pd.DataFrame or pd.Series or None
              if None, then use last column of x as y
    @param show: if show regression summary
    """
    import statsmodels.api as sm

    xy = sm.add_constant(x if y is None else pd.concat([x, y], axis=1))
    res = sm.OLS(xy.ix[:, -1], xy.ix[:, :-1], missing='drop').fit()

    if show: print res.summary()
    return res
  • 1
Reply Report

Scikit-learn is a machine learning library for Python which can do this job for you. Just import sklearn.linear_model module into your script.

Find the code template for Multiple Linear Regression using sklearn in Python:

import numpy as np
import matplotlib.pyplot as plt #to plot visualizations
import pandas as pd

# Importing the dataset
df = pd.read_csv(<Your-dataset-path>)
# Assigning feature and target variables
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

# Use label encoders, if you have any categorical variable
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X['<column-name>'] = labelencoder.fit_transform(X['<column-name>'])

from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = ['<index-value>'])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding the dummy variable trap
X = X[:,1:] # Usually done by the algorithm itself

#Spliting the data into test and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 0, test_size = 0.2)

# Fitting the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the test set results
y_pred = regressor.predict(X_test)

That's it. You can use this code as a template for implementing Multiple Linear Regression in any dataset. For a better understanding with an example, Visit: Linear Regression with an example

  • 1
Reply Report

Here is an alternative and basic method:

from patsy import dmatrices
import statsmodels.api as sm

y,x = dmatrices("y_data ~ x_1 + x_2 ", data = my_data)
### y_data is the name of the dependent variable in your data ### 
model_fit = sm.OLS(y,x)
results = model_fit.fit()
print(results.summary())

Instead of sm.OLS you can also use sm.Logit or sm.Probit and etc.

  • 0
Reply Report

Trending Tags