Error message here!

Hide Error message here!

忘记密码?

Error message here!

请输入正确邮箱

Hide Error message here!

密码丢失?请输入您的电子邮件地址。您将收到一个重设密码链接。

Error message here!

返回登录

Close

Python mathematical model notes - sklearn (4) linear regression

youcans 2021-05-13 14:58:34 阅读数:4 评论数:0 点赞数:0 收藏数:0

1、 What is linear regression ?

regression analysis (Regression analysis) It's a statistical analysis , Study the quantitative relationship between independent variables and dependent variables . Regression analysis involves not only establishing mathematical models and estimating model parameters , Test the credibility of the mathematical model , It also includes prediction or control by using the established model and estimated model parameters . According to the type of relationship between input and output variables , Regression analysis can be divided into linear regression and nonlinear regression .

Linear regression (Linear regression) Suppose that the output variables in the sample data set (y) And input variables (X) There is a linear relationship , That is, the output variable is a linear combination of the input variables . The linear model is the simplest model , It is also a very important and widely used model .

If the model has only one input variable and one output variable , It's called a univariate linear model , A straight line can be used to describe the relationship between output and input , Its expression is a linear equation of one variable :

y = w0 + w1*x1 + e

If the model includes two or more input variables , It is called multivariate linear model , A plane or hyperplane can be used to describe the relationship between output and input , Its expression is a multivariate linear equation :

Y = w0 + w1*x1 + w2*x2+...+ wm*xm + e

Use the least square method (Least square method) The parameters of regression model can be estimated by sample data , Minimize the sum of square error between the output of the model and the sample data .

Regression analysis needs to further analyze whether linear regression model can be used , Or whether the assumption of linear relationship is reasonable 、 Whether the linear model has good stability ? This requires the use of statistical analysis for significance testing , Test whether the linear relationship between input and output variables is significant , Is it appropriate to use linear models to describe the relationship between them .


2、SKlearn The linear regression method in (sklearn.linear_model)

From the perspective of machine learning , Regression is a widely used forecasting modeling method , Linear regression is an important basic algorithm in machine learning .SKlearn Machine learning toolkit provides a wealth of linear model learning methods , The most important and widely used is undoubtedly the ordinary least squares method (Ordinary least squares,OLS), In addition, polynomial regression (Polynomial regression)、 Logical regression (Logistic Regression) Back to Heling (Ridge regression) Also more commonly used , It will be introduced in this article and the following articles . Other methods are relatively special , The following is a brief introduction according to the official website , Ordinary readers can skip .

  1. Ordinary least squares (Ordinary least squares):
    The objective of optimization is to minimize the sum of the squares of the residual errors between the predicted values of the model and the observed values of the samples .
  2. Ridge return (Ridge regression)
    On the basis of ordinary least squares method, penalty factor is added to reduce the influence of collinearity , With penalty (L2 Regularization ) The objective of optimization is to minimize the sum of squares of residuals . Better learning ability and smaller inertia energy are considered in the index , In order to avoid over fitting and poor generalization ability of the model .
  3. Lasso Return to (Least absolute shrinkage and selection operator)
    On the basis of the ordinary least squares method, the absolute deviation is added as the penalty term (L1 Regularization ) To reduce the effect of collinearity , While fitting the generalized linear model, variable selection and complexity adjustment are carried out , For sparse coefficient models .
  4. multivariate Lasso Return to (Multi-task Lasso)
    A linear model for estimating the sparse coefficients of multiple regression . Note that it doesn't mean multithreading or multitasking , It means to filter out the same characteristic variables for multiple output variables ( That is to say, the regression coefficient is listed as 0, So the input variable corresponding to this column can be deleted ).
  5. Elastic network regression (Elastic-Net)
    introduce L1 and L2 Norm regularization to form a model with two penalty terms , It's equivalent to ridge regression and Lasso The combination of regression .
  6. Multi-task Elastic-Net
    Elastic network regression method for estimating multiple regression sparse coefficient linear model .
  7. Minimum angle regression algorithm (Least Angle Regression)
    Combining forward gradient algorithm and forward selection algorithm , While preserving the accuracy of the forward gradient algorithm, the iterative process is simplified . Each choice adds an independent variable with the highest correlation , most m You can solve it in three steps . It is especially suitable for the case that the feature dimension is much higher than the number of samples .
  8. LARS Lasso
    Use the least angle regression algorithm to solve Lasso Model .
  9. Orthogonal matching pursuit (Orthogonal Matching Pursuit)
    For approximate linear models with non-zero coefficient variable number constraints . Orthogonalization is performed at each step of decomposition , Select to delete the column related to the current maximum residual , Iterate repeatedly to achieve the required sparsity .
  10. Bayes regression (Bayesian Regression)
    Linear regression model solved by Bayesian inference method , It has the basic properties of Bayesian statistical model , The probability density function of weight coefficient can be solved . It can be used to solve the problem of less observation data but requiring posterior distribution , For example, accurate estimates of physical constants ; It can also be used for variable filtering and dimension reduction .
  11. Logical regression (Logistic Regression)
    Logistic regression is a generalized linear model , This paper studies the output of order variable or attribute variable , It's actually a sort of classification . Add... Through a linear model Sigmoid Mapping function , The continuous output of linear model is transformed into discrete value . It is often used to estimate the possibility of something , Such as looking for risk factors 、 Predict the probability of disease 、 Judge the probability of illness , It's the most common analysis method in epidemiology and medicine .
  12. Generalized linear regression (Generalized Linear Regression)
    Generalized linear regression is a generalization of linear regression model , It's actually a nonlinear model . By monotonically differentiable connection functions , Establish a linear relationship between output variables and input variables , Simply and directly transform the problem into a linear model to deal with .
  13. Stochastic gradient descent (Stochastic Gradient Descent)
    Gradient descent is a search based optimization method , Using the gradient descent method to find the minimum parameter estimation of the loss function , Applicable sample number ( And the characteristic number ) A very, very large situation . The random gradient descent method is used to calculate the descent direction , Randomly choose a data to calculate , Instead of scanning the entire training dataset , Faster iterations .
  14. perceptron (Perceptron)
    Perceptron is a simple classification algorithm suitable for large-scale learning . Training speed is faster than SGD Faster , And the resulting model is more sparse .
  15. Passive attack algorithm (Passive Aggressive Algorithms)
    Passive attack algorithm is a kind of algorithm for large-scale learning .
  16. Robust regression (Robustness regression)
    The purpose of robust regression is to fit the regression model in the presence of damaged data , If there are outliers or errors .
  17. Polynomial regression (Polynomial regression)
    Polynomial regression extends a simple linear regression model by constructing polynomials of characteristic variables . For example, combining characteristic variables into second-order polynomials , Paraboloids can be fitted into the data , So it has more flexibility and adaptability .

3、SKlearn The least square linear regression method in

3.1 Least squares linear regression class (LinearRegression )

SKlearn In bag LinearRegression() Method , It should not be understood literally as linear regression method , LinearRegression() Only based on ordinary least squares (OLS) The linear regression method of .

sklearn.linear_model.LinearRegression Class is OLS The concrete realization of linear regression algorithm , Please refer to :https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression

sklearn.linear_model.LinearRegression()

class sklearn.linear_model.LinearRegression(*, fit_intercept=True, normalize=False, copy_X=True, n_jobs=None, positive=False)

LinearRegression() Class doesn't have many arguments , There's usually almost no need to set up .

  • fit_intercept:bool, default=True Whether to calculate the intercept . The default value is True, Calculating intercept .
  • normalize:bool, default=False Whether to standardize the data , This parameter is only used in fit_intercept = True Effective when .
  • n_jobs:int, default=None The number of tasks set during calculation , by n>1 And large scale problems provide acceleration . The default value is The number of tasks is 1.

LinearRegression() Main properties of class :

  • coef_: Linear coefficient , That is, model parameters w1... The estimate of
  • intercept_: intercept , That is, model parameters w0 The estimate of

LinearRegression() Class :

  • fit(X,y[,sample_weight]) Using the sample set (X, y) Training models .sample_weight Weight each sample , Default None.
  • get_params([deep]) Get model parameters . Note that it's not the regression coefficient of the model , But to fit_intercept,normalize Equal parameter .
  • predict(X) Using trained models to predict data sets X Output . That is to say, the output of the model can be given to the training samples , It can also give prediction results for test samples .
  • score(X,y[,sample_weight]) R2 Determination factor , It is a commonly used model evaluation index .

3.2 Univariate linear regression

LinearRegression Using routines :

# skl_LinearR_v1a.py
# Demo of linear regression by scikit-learn
# Copyright 2021 YouCans, XUPT
# Crated:2021-05-12
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, median_absolute_error
# Generate test data :
nSample = 100
x = np.linspace(0, 10, nSample) # Starting from 0, The finish for 10, Are divided into nSample A little bit
e = np.random.normal(size=len(x)) # Normal distribution random number
y = 2.36 + 1.58 * x + e # y = b0 + b1*x1
# Data transformation according to the requirements of the model : Input is array Type of n*m matrix , The output is array Type of n*1 Array
x = x.reshape(-1, 1) # The input is converted to n That's ok 1 Column ( Multiple regression is multi column ) Two dimensional array of
y = y.reshape(-1, 1) # The output is converted to n That's ok 1 Two dimensional array of columns
# print(x.shape,y.shape)
# Univariate linear regression : Least square method (OLS)
modelRegL = LinearRegression() # Create a linear regression model
modelRegL.fit(x, y) # model training : Fitting data
yFit = modelRegL.predict(x) # Using regression models to predict output
# Output regression results XUPT
print(' Regression intercept : w0={}'.format(modelRegL.intercept_)) # w0: intercept
print(' Regression coefficient : w1={}'.format(modelRegL.coef_)) # w1,..wm: Regression coefficient
# The evaluation index of regression model YouCans
print('R2 Determine the coefficient :{:.4f}'.format(modelRegL.score(x, y))) # R2 Determination factor
print(' Mean square error :{:.4f}'.format(mean_squared_error(y, yFit))) # MSE Mean square error
print(' Mean absolute error :{:.4f}'.format(mean_absolute_error(y, yFit))) # MAE Mean absolute error
print(' Error of median absolute value :{:.4f}'.format(median_absolute_error(y, yFit))) # Mean absolute error
# mapping : Raw data points , Fit the curve
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x, y, 'o', label="data") # Raw data
ax.plot(x, yFit, 'r-', label="OLS") # Fitting data
ax.legend(loc='best') # Show Legend
plt.title('Linear regression by SKlearn (Youcans)')
plt.show() # YouCans, XUPT

Program description :

  1. linear regression model LinearRegression() Class in model training modelRegL.fit(x, y) when , Request input x And the output y The data format is array Type of n*m matrix . Univariate regression model m=1, It's going to be transformed into n*1 Of array type :
x = x.reshape(-1, 1) # The input is converted to n That's ok 1 Column ( Multiple regression is multi column ) Two dimensional array of
y = y.reshape(-1, 1) # The output is converted to n That's ok 1 Two dimensional array of columns
  1. LinearRegression() Class provides only R2 indicators , But in sklearn.metrics Mean square error is provided in the package 、 Mean absolute error and median absolute error , Its usage is given in the routine .

Program run results :

 Regression intercept : w0=[2.45152704]
Regression coefficient : w1=[[1.57077698]]
R2 Determine the coefficient :0.9562
Mean square error :0.9620
Mean absolute error :0.7905
Error of median absolute value :0.6732

3.2 Multiple linear regression

use LinearRegression() To solve the problem of multiple linear regression and the steps of single linear regression 、 Parameters and properties are the same , Just pay attention to the format requirements of sample data : input data X yes array Type of n*m Two dimensional array , Output data y yes array Type of n*1 Array ( It can also be used. n*k Represents multivariable output ).

Problem description :
Data files toothpaste.csv Collected in 30 The sales volume of toothpaste in the past two months 、 Price 、 Advertising expenses and average market price in the same period .
(1) Analyze the sales volume and price of toothpaste 、 The relationship between advertising investment , Mathematical modeling ;
(2) Estimate the parameters of the established mathematical model , Conduct statistical analysis ;
(3) Using the fitting model , Forecast toothpaste sales at different prices and advertising expenses .
It should be noted that , The routine in this paper is not the best solution to the problem , Just use the problem and data to demonstrate the method of reading data file and data processing .

LinearRegression Using routines :

# skl_LinearR_v1b.py
# Demo of linear regression by scikit-learn
# v1.0d: linear regression model (SKlearn) solve
# Copyright 2021 YouCans, XUPT
# Crated:2021-05-12
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, median_absolute_error
# The main program
def main(): # The main program
# Read data file
readPath = "../data/toothpaste.csv" # The address and file name of the data file
dfOpenFile = pd.read_csv(readPath, header=0, sep=",") # The separator is a comma , The first line is the title line
dfData = dfOpenFile.dropna() # Delete data with missing values
print(dfData.head())
# Model 1:Y = b0 + b1*X1 + b2*X2 + e
# Linear regression : Analyze the dependent variables Y(sales) And The independent variables x1(diffrence)、x2(advertise) The relationship between
# Data transformation according to the requirements of the model : Input is array Type of n*m matrix , The output is array Type of n*1 Array
feature_cols = ['difference', 'advertise'] # Create a feature list
X = dfData[feature_cols] # Use the list to select the feature subset of the sample data
y = dfData['sales'] # Select the output variables of the sample data
# print(type(X),type(y))
# print(X.shape, y.shape)
# Univariate linear regression : Least square method (OLS)
modelRegL = LinearRegression() # Create a linear regression model
modelRegL.fit(X, y) # model training : Fitting data
yFit = modelRegL.predict(X) # Using regression models to predict output
# Output regression results # YouCans, XUPT
print("\nModel1: Y = b0 + b1*x1 + b2*x2")
print(' Regression intercept : w0={}'.format(modelRegL.intercept_)) # w0: intercept
print(' Regression coefficient : w1={}'.format(modelRegL.coef_)) # w1,..wm: Regression coefficient
# The evaluation index of regression model
print('R2 Determine the coefficient :{:.4f}'.format(modelRegL.score(X, y))) # R2 Determination factor
print(' Mean square error :{:.4f}'.format(mean_squared_error(y, yFit))) # MSE Mean square error
print(' Mean absolute error :{:.4f}'.format(mean_absolute_error(y, yFit))) # MAE Mean absolute error
print(' Error of median absolute value :{:.4f}'.format(median_absolute_error(y, yFit))) # Mean absolute error
# Model 3:Y = b0 + b1*X1 + b2*X2 + b3*X2**2 + e
# Linear regression : Analyze the dependent variables Y(sales) And The independent variables x1、x2 And x2 The relationship between squares
x1 = dfData['difference'] # The price difference ,x4 = x1 - x2
x2 = dfData['advertise'] # Advertising fee
x5 = x2**2 # The second dimension of advertising expenses
X = np.column_stack((x1,x2,x5)) # [x1,x2,x2**2]
# Multiple linear regression : Least square method (OLS)
modelRegM = LinearRegression() # Create a linear regression model
modelRegM.fit(X, y) # model training : Fitting data
yFit = modelRegM.predict(X) # Using regression models to predict output
# Output regression results # YouCans, XUPT
print("\nModel3: Y = b0 + b1*x1 + b2*x2 + b3*x2**2")
print(' Regression intercept : w0={}'.format(modelRegM.intercept_)) # w0: intercept , YouCans
print(' Regression coefficient : w1={}'.format(modelRegM.coef_)) # w1,..wm: Regression coefficient , XUPT
# The evaluation index of regression model
print('R2 Determine the coefficient :{:.4f}'.format(modelRegM.score(X, y))) # R2 Determination factor
print(' Mean square error :{:.4f}'.format(mean_squared_error(y, yFit))) # MSE Mean square error
print(' Mean absolute error :{:.4f}'.format(mean_absolute_error(y, yFit))) # MAE Mean absolute error
print(' Error of median absolute value :{:.4f}'.format(median_absolute_error(y, yFit))) # Mean absolute error
# Calculation F statistic and F Tested P value
m = X.shape[1]
n = X.shape[0]
yMean = np.mean(y)
SST = sum((y-yMean)**2) # SST: Total sum of squares
SSR = sum((yFit-yMean)**2) # SSR: Sum of regression squares
SSE = sum((y-yFit)**2) # SSE: The sum of squared residuals
Fstats = (SSR/m) / (SSE/(n-m-1)) # F statistic
probFstats = stats.f.sf(Fstats, m, n-m-1) # F Tested P value
print('F statistic :{:.4f}'.format(Fstats))
print('FF Tested P value :{:.4e}'.format(probFstats))
# mapping : Raw data points , Fit the curve
fig, ax = plt.subplots(figsize=(8, 6)) # YouCans, XUPT
ax.plot(range(len(y)), y, 'b-.', label='Sample') # Sample data
ax.plot(range(len(y)), yFit, 'r-', label='Fitting') # Fitting data
ax.legend(loc='best') # Show Legend
plt.title('Regression analysis with sales of toothpaste by SKlearn')
plt.xlabel('period')
plt.ylabel('sales')
plt.show()
return
if __name__ == '__main__':
main()

Program run results :

Model1: Y = b0 + b1*x1 + b2*x2
Regression intercept : w0=4.4074933246887875
Regression coefficient : w1=[1.58828573 0.56348229]
R2 Determine the coefficient :0.8860
Mean square error :0.0511
Mean absolute error :0.1676
Error of median absolute value :0.1187
Model3: Y = b0 + b1*x1 + b2*x2 + b3*x2**2
Regression intercept : w0=17.324368548878198
Regression coefficient : w1=[ 1.30698873 -3.69558671 0.34861167]
R2 Determine the coefficient :0.9054
Mean square error :0.0424
Mean absolute error :0.1733
Error of median absolute value :0.1570
F statistic :82.9409
F Tested P value :1.9438e-13

Program description :

  1. use LinearRegression() Class deals with multiple linear regression problems , The format of training sample data required by the model is : input data X yes array Type of n*m Two dimensional array , Output data y yes array Type of n*1 Array ( It can also be used. n*k Represents multivariable output ). Two ways of data conversion are given in the routine :Model 1 from Pandas Of dataframe Data transformation gets what the model requires array Type two-dimensional array , This is in Pandas It's very convenient to read data files ;Model3 Then use Numpy Of np.column_stack Array stitching gets array Type two-dimensional array .
  2. The problem and data of this routine 《Python Learning notes -StatsModels Statistical regression (3) Preparation of model data 》 In the same way , come from : Jiang Qiyuan 、 Thank you, Venus 《 mathematical model ( The first 3 edition )》, Higher Education Press .
  3. For the convenience of StatsModels Statistical regression results were compared , The model used in the routine is also consistent with that in this paper :Model1 Using characteristic variables in 'difference', 'advertise' Build a linear regression model ,Model3 Using characteristic variables in 'difference', 'advertise' And 'advertise' The second term of ( x2**2) Build a linear regression model .SKlearn And StatsModels The results of parameter estimation for these two models 、 The predicted results and R2 The coefficients of determination are exactly the same , Show that SKlearn And StatsModels All toolkits can implement linear regression .
  4. StatsModels The model checking indicators provided by the toolkit are very comprehensive 、 detailed , It's very important for model checking and statistical analysis . and SKlearn The package provides few statistical tests ,F test 、T test 、 There is no significant test index of correlation coefficient , The root cause is SKlearn It's a machine learning library, not a statistical toolbox , The focus is on model accuracy and predictive performance , It's not about the significance of the model .
  5. In order to solve the problem of lack of model significance test index , Add a paragraph to the routine Calculation F statistic and F test P value The procedure is available for reference .

copyright :
The content and routine of this article are original by the author , Not reprinting books or Internet content .
YouCans Original works
Copyright 2021 YouCans, XUPT
Crated:2021-05-12

Copyright statement
In this paper,the author:[youcans],Reprint please bring the original link, thank you

编程之旅,人生之路,不止于编程,还有诗和远方。
阅代码原理,看框架知识,学企业实践;
赏诗词,读日记,踏人生之路,观世界之行;