Getting started with Regression

Savan Nahar
9 min readFeb 24, 2019

--

Linear regression is the first algorithm that people learn when they start with machine learning. Due to its widespread many people end up thinking that it is the only form of regression.

The truth is that there are multiple types of regression, each having its own importance and specific condition where it can be best suited to apply. Through this article, I’ll try to explain 8 types of regression alongside its code written using sklearn library.

What is Regression ?

Regression is a technique from statistics that is used to predict values of a desired target quantity when the target quantity is continuous. Let’s say you have a price Vs area data of the town Branalle which is depicted in the figure below.

The goal of regression is to be able to predict the price of a given house after knowing the area of a given house. So you start by assuming that there is a polynomial of degree two that captures the true nature of the relationship between area and price well enough. That is,

price=w0+w1∗area+w2∗areaarea

where w0, w1 and w2 are constants.

By trying multiple values of w0,w1 and w2 you find the perfect constants which fit the datapoints.

Regression analysis estimates the relationship between two or more variables. There are multiple benefits of using regression analysis. They are as follows:

1. It indicates the significant relationships between dependent variable and independent variable.

2. It indicates the strength of impact of multiple independent variables on a dependent variable.

Types of Regression

There are many regression techniques which are mainly driven by 3 factors — Number of independent variables, Type of dependent variables and Shape of the regression line.

1. Linear Regression

In this technique, the dependent variable is continuous, independent variable(s) can be continuous or discrete, and the nature of the regression line is linear. It establishes the relation between these variables using Best Fit Straight Line.

It is represented by an equation Y=a+b*X + c, where a is intercept, b is slope of the line and c is error term. This equation can be used to predict the value of the target variable based on given predictor variable(s).

How to obtain best fit line (Value of a and b)?

This task can be easily accomplished by Least Square Method. It is the most common method used for fitting a regression line.We can evaluate the model performance using the metric R-square.

The Crux of Linear Regression

  • For the best fit, there must be a relation between independent and dependent variables
  • Linear Regression is very sensitive to Outliers. It can terribly affect the regression line and eventually the forecasted values.
  • Multiple regression suffers from multicollinearity, autocorrelation, heteroskedasticity.
>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> reg = LinearRegression().fit(X, y)
>>> reg.predict(np.array([[3, 5]]))
array([16.])

2. Logistic Regression

Logistic regression is used to find the probability of event=Success and event=Failure. We should use logistic regression when the dependent variable is binary (0/ 1, True/ False, Yes/ No) in nature. Here the value of Y ranges from 0 to 1.

Since we are working here with a binomial distribution (dependent variable), we need to choose a link function which is best suited for this distribution. And, it is logit function. In the equation below, the parameters are chosen to maximize the likelihood of observing the sample values rather than minimizing the sum of squared errors (like in ordinary regression).

The Crux of Logistic Regression

  • It is a classification algorithm and not a regression algorithm
  • Logistic regression doesn’t require linear relationship between dependent and independent variables. It handles various types of relationships because it applies a non-linear log transformation to the predicted odds ratio
  • To avoid over fitting and under fitting, we should include all significant variables. A good approach to ensure this practice is to use a step wise method to estimate the logistic regression
>>> from sklearn.datasets import load_iris
>>> from sklearn.linear_model import LogisticRegression
>>> X, y = load_iris(return_X_y=True)
>>> clf = LogisticRegression(random_state=0, solver='lbfgs',
... multi_class='multinomial').fit(X, y)
>>> clf.predict(X[:2, :])
array([0, 0])
>>> clf.predict_proba(X[:2, :])
array([[9.8...e-01, 1.8...e-02, 1.4...e-08],
[9.7...e-01, 2.8...e-02, ...e-08]])
>>> clf.score(X, y)
0.97...

3. Polynomial Regression

Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of X and the corresponding Y.

The Crux of Polynomial Regression

  • While there might be a temptation to fit a higher degree polynomial to get lower error, this can result in over-fitting. Always plot the relationships to see the fit and focus on making sure that the curve fits the nature of the problem. Here is an example of how plotting can help:
  • The fitted model is more reliable when it is built on large numbers of observations.
>>> from sklearn.preprocessing import PolynomialFeatures
>>>
X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
[2, 3],
[4, 5]])
>>> poly = PolynomialFeatures(2)
>>> poly.fit_transform(X)
array([[ 1., 0., 1., 0., 0., 1.],
[ 1., 2., 3., 4., 6., 9.],
[ 1., 4., 5., 16., 20., 25.]])
>>> poly = PolynomialFeatures(interaction_only=True)
>>> poly.fit_transform(X)
array([[ 1., 0., 1., 0.],
[ 1., 2., 3., 6.],
[ 1., 4., 5., 20.]])

4. Ridge Regression

It is a Variation of Linear regression which is majorly used when the data suffers from multicollinearity ( independent variables are highly correlated).In multicollinearity, even though the least squares estimates (OLS) are unbiased, their variances are large which deviates the observed value far from the true value. By adding a degree of bias to the regression estimates, ridge regression reduces the standard errors

The penalty term (lambda) regularizes the coefficients such that if the coefficients take large values the optimization function is penalized. So, ridge regression shrinks the coefficients and it helps to reduce the model complexity and multi-collinearity.

The Crux of Ridge Regression

  • The assumptions of this regression is same as least squared regression except normality is not to be assumed
  • It shrinks the value of coefficients but doesn’t reach zero, which suggests no feature selection feature
  • This is a regularization method and uses l2 regularization.
>>> from sklearn.linear_model import Ridge
>>> import numpy as np
>>> n_samples, n_features = 10, 5
>>> np.random.seed(0)
>>> y = np.random.randn(n_samples)
>>> X = np.random.randn(n_samples, n_features)
>>> clf = Ridge(alpha=1.0)
>>> clf.fit(X, y)
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001)

5. Lasso Regression

Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and Selection Operator) also penalizes the absolute size of the regression coefficients. In addition, it is capable of reducing the variability and improving the accuracy of linear regression models

Lasso regression differs from ridge regression in a way that it uses absolute values in the penalty function, instead of squares. This leads to penalizing (or equivalently constraining the sum of the absolute values of the estimates) values which causes some of the parameter estimates to turn out exactly zero

The Crux of Lasso Regression

  • The assumptions of this regression is same as least squared regression except normality is not to be assumed
  • It shrinks coefficients to zero (exactly zero), which certainly helps in feature selection
  • This is a regularization method and uses l1 regularization
  • If the group of predictors are highly correlated, lasso picks only one of them and shrinks the others to zero
>>> from sklearn import linear_model
>>> clf = linear_model.Lasso(alpha=0.1)
>>> clf.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)
>>> print(clf.coef_)
[0.85 0. ]
>>> print(clf.intercept_)
0.15...

6. ElasticNet Regression

ElasticNet is a hybrid of Lasso and Ridge Regression techniques. It is trained with L1 and L2 prior as a regularizer. Elastic-net is useful when there are multiple features which are correlated. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both.

Image result for elasticnet regression

A practical advantage of trading-off between Lasso and Ridge is that, it allows Elastic-Net to inherit some of Ridge’s stability under rotation.

The Crux of ElasticNet Regression

  • It encourages group effect in case of highly correlated variables
  • It can suffer from double shrinkage
>>> from sklearn.linear_model import ElasticNet
>>> from sklearn.datasets import make_regression
>>> X, y = make_regression(n_features=2, random_state=0)
>>> regr = ElasticNet(random_state=0)
>>> regr.fit(X, y)
ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
max_iter=1000, normalize=False, positive=False, precompute=False,
random_state=0, selection='cyclic', tol=0.0001, warm_start=False)
>>> print(regr.predict([[0, 0]]))
[1.451...]

7. Support Vector Regression

SVM can be used both as a Classification and regression algorithm. In SVM Hyperplane is basically the separation line between the data classes. Although in SVR we are going to define it as the line that will help us predict the continuous value or target value.

Our objective when we are moving on with SVR is to basically consider the points that are within the boundary line. Our best fit line is the line hyperplane that has the maximum number of points.

The Crux of Support Vector Regression

  • SVR should be preferred to an ordinary LSE minimization
  • SVR can be applied to linear or non-linear data and have the options to use either ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’ or ‘precomputed’ kernel.
>>> from sklearn.svm import SVR
>>> import numpy as np
>>> n_samples, n_features = 10, 5
>>> np.random.seed(0)
>>> y = np.random.randn(n_samples)
>>> X = np.random.randn(n_samples, n_features)
>>> clf = SVR(gamma='scale', C=1.0, epsilon=0.2)
>>> clf.fit(X, y)
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.2, gamma='scale',
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

8. Principal Component Regression

It is a technique for handling near collinearities among the regression variables in linear regression. The principal components that are dropped give insight into which linear combinations of variables are responsible for the collinearities.

The Crux of Principal Component Regression

  • PCR can perform regression when the explanatory variables are highly correlated or even collinear.
  • You can run PCR when there are more variables than observations (wide data).
  • Principal component regression does not consider the response variable when deciding which principal components to drop
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
>>> pca = PCA(n_components=2)
>>> pca.fit(X)
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
>>> print(pca.explained_variance_ratio_)
[0.9924... 0.0075...]
>>> print(pca.singular_values_)
[6.30061... 0.54980...]

Finding the Best Regression Model

Choosing the correct regression model is as much a science as it is an art.

  • Research what others have done and incorporate those findings into constructing your model. Before beginning the regression analysis, develop an idea of what the important variables are along with their relationships, coefficient signs, and effect magnitudes. Building on the results of others makes it easier both to collect the correct data and to specify the best regression model
  • Not all complex problem require complex models, many times simpler models produce more accurate prediction. Start simple, and only make the model more complex as needed.
  • As you evaluate models, check the residual plots because they can help you avoid inadequate models and help you adjust your model for better results

This blog is written under the guidance of DSAI Club, follow us for more exciting blogs and news about DataScience.

Linkedin — — Twitter — — Facebook — — Instagram

All the DSAI’s Lecture Series Notebooks can also be found at — adityak6798.github.io

Get connected with me on LinkedIn. Thank you!

--

--

Savan Nahar
Savan Nahar

Written by Savan Nahar

Building software that scales!

Responses (1)