This is a follow-up to Know Your Methods: EFA vs. PCA, and just like that summary, this one is also based on Field’s Discovering Statistics Using R (2012), because I’m studying for an exam at the moment and it’s a bloody awesome book to be studying with.
Regression is a type of analysis used to predict an outcome from a predictor – which is, in essence, what we’re trying to do in psychology most of the time.
Linear Regression is what we use when we have one predictor variable and one continuous outcome variable. Let’s say I had the hypothesis that GPA in highschool predicts future income – linear regression would be the method of choice. Linear regression is, together with a simple scatter-plot, generally the first step to analysing bivariate data (i.e. data for two variables).
Multiple Regression is to be used when we have several predictor variables for one continuous outcome variable. To give an example for that, I could extend my previous hypothesis by including number of semesters spent at university, height or age. The predictors can be any level of measurement we like and we assume the outcome to be a random variable.
Logistic Regression is a special type of multiple regression in which there are several continuous or categorical predictor variables and one categorical outcome variable with two categories. For example, say I knew that certain personality traits are significantly different in men and women, and say I had done a personality test with lots of people, but had forgotten to ask for their sex; I could then use the personality test scales which I know to differentiate between men and women (each a continuous predictor) to predict whether the person who filled in the questionnaire was a man or woman (categorical outcome). Sadly, I can’t use it to find a more realistic example right now.
There’s also multinominal (or polychotomous) logistic regression, in case our outcome has more than two categories.
How does it do that?
The general equation for any type of regression is outcome = model + error, in which the model can be further broken down into b0 + b1*X (b0 being the intercept, i.e. the point where the regression line crosses the Y-axis, the location of the model in geometric space, and b1 being the regression coefficient of predictor X which tells us the gradient, i.e. the shape of the model). This model can be extended for multiple regression by adding further regression coefficients and predictors, but basically, that’s it: our outcome is predicted by our model, i.e. its intercept and slope, and an error term that’s always there.
Regression analysis does not only give us an equation describing our model, which we can then apply to other data, it also tells us how much of our total variance our model explains.
We generally – with one exception – make the same assumptions for all types of regression.
First of all, we assume that there will always be a portion of variance that our model won’t be able to explain. Other assumptions include:
- homoscedasticity (basically means that at any point along the levels of any predictor variable, we expect the spread of residuals to be fairly constant)
- normal distribution
- consistent error variance
The latter two can be tested using the residual plots.
About the error term in that equation in particular, we assume that:
- the error term is a statistically independent random variable
- the error variance is the same for every value of the predictor
- it’s normally distributed
- its average = 0
Finally, there is the assumption of linearity: we assume the outcome variable to be a random variable with a mean that is a linear function of the predictors. This assumption, by definition, is not made in logistic regression.
Not so much an assumption as a problem for all types of regression is multicollinearity. This is the case when correlations between the predictors are too high, and this is problematic because these predictors will then explain the same fraction of variance – so the sum of explainable variance is restricted and interpretation of the predictors is complicated by this. Another effect of this problem is that the model’s predictions become less reliable.
So how do we find out whether there’s multicollinearity in our data? For starters, we simply take a look at the correlation matrix. There’s also the factor of variance-inflation which can be computed to figure this out. As a rule of thumb, if the largest VIF > 10, we have cause for concern – as we have when the average VIF is substantially > 1 (this may indicate that the regression is biased). If we do find individual variables which are particularly highly correlated with each other (as indicated by the VIF), we should try and see whether we can drop one of them, or maybe summarize them; with all of this, we have to bear in mind that our actions also need to make sense from a theoretical point of view. We should run a PCA before the MR in this case.
Additionally, we could take a look at the tolerance statistic, which is computed by dividing 1/VIF. A tolerance < 0.1 points to a serious problem; a value < 0.2 just means there might be a problem.
Another problem with regression (and really, with any data) are outliers and unduly influential cases. Outliers can be identified by looking at the standardized residuals: anything with standardized residuals < -2 or > 2 is a potential outlier and should be investigated further. Specifically, we should get Cook’s distances, leverages and covariance ratios for each of these outlier-cases to determine whether they exert undue influence on the regression model. The most important of these indices is Cook’s distance: if Cook’s distance is < 1, it means the case does not unduly influence the model; so, even if the case is an outlier, it’s not a problem for our regression analysis. The boundary for problematic leverage values is either twice or three times the average leverage, depending on the statistician one relies on. The CVR (covariance ratio) statistic works the same way: two boundaries can be computed (using a somewhat complicated formula), and as long as values are within those boundaries, everything’s fine.
Testing these assumptions
The assumption of independent errors is assessed using the Durbin-Watson test, the results of which should be between 1 and 3; anything above or below these boundaries might indicate that the errors are related in some way. The test also produces a p-value which should not be significant.
The data should also be inspected visually, as this is a good way to check assumptions related to the residuals (homoscedasticity, normality, linearity). A histogram of studentized residuals is a good way to check the assumption of normal distribution (though it’s worth noting that a visual check isn’t the be-all and end-all of normal distribution, because in small samples, distributions can look very non-normal even if they aren’t).
If assumptions have been violated, the model can still be good for the data at hand; however, the findings can then not be generalized beyond the sample. There are some solutions for this, depending on which assumption it is that has been violated.
- heteroscedasticity or non-normally distributed residuals: transforming the raw data may (or may not) help
- linearity: try a logistic regression instead
- generally: robust regression (bootstrapping)
As a ground rule, we are looking for the simplest model that fits our data well, meaning a model that uses as many predictors as necessary, yet as few predictors as possible. There are three different types of methods to try and figure out which predictors should be included in a multiple regression:
- Backward Selection: starts by including all predictor variables in the model and then removing them one by one until the removal of a variable does no longer significantly improve the model.
- Forward Elimination: does the exact opposite (i.e. starts with zero predictors and keeps including new ones until the model no longer explains significantly more variance).
- Stepwise Regression: combines the former two iteratively.
I just said that these methods do what they do until R^2 no longer significantly increases; this is only half-true, in that this is the most popular criterion, but you can also apply other criteria here, e.g. RSS-reduction (residual sums of squares).
A general problem for all of these methods is yet again caused by multicollinearity, in that it leads to instable models; so be careful when using automatised model-selection methods!
For logistic regression, there are two types of methods:
- Forced Entry Method: this is the default method; it adds all the predictors to the model in one block and estimates the parameters.
- Stepwise Methods: just like with multiple regression, there are forward, backward, and a combined method. Again, the combined methods (forward/backward or backward/forward) are to be preferred.
The selection of methods should ultimately depend on whether you are trying to carry out exploratory or confirmatory work.
Indices for model parsimony you want to look at are the AIC (Akaike’s Information Criterion; the smaller the value, the better the model fit) and the BIC (the Bayesian Information Criterion, which is a slightly stricter version of the AIC and to which the same rule applies).
Diagnosis of the regression model
Closely tied in with the subject of model parsimony is the general question: does our model fit the data well? If not, we may find ourselves needing a more complex model. To find this out, we should take a look at some fit indices as well as the residuals. The residuals may suggest a more complex model, a transformation, or even a different trend (quadratic, cubic, … rather than linear).
The standardised residuals before and after deletion of a variable should be looked at, as well as Cook’s Distances, which indicate the deviation between model and observation, i.e. the model fit.
Locally weighted regression
There’s also a specific way to do regression in an explorative manner. We use this when we believe there to be a complex relationship between variables, but are not sure how that relationship might look, and so we want a model that is shaped by the data. This type of regression analysis simply summarizes the data into a function. There are different types of locally weighted regression: Lowess Fit or piecewise defined functions. However, I won’t go into further detail about these right now (maybe one day, when I understand them better); for now, suffice to say they are also a thing.
Specifics for logistic regression
Logistic regression is essentially a multiple regression except the outcome variable is a categorial variable with two categories; consequently, the assumption of linearity is not made here, or not in the same sense as in multiple regression. Instead, here we assume a linear relationship between any continuous predictors and the logit of the outcome variable. The test for this assumption lies in the interaction term between the predictor and its log transformation. Independence of errors is assumed just like in standard regression (i.e. if we want to measure people multiple times and thus cause the data to be related, we need to use a multilevel model instead). Also, multicollinearity is as much a problem here as it is in any other type of regression.
When assessing the logistic regression model, there are some specific statistics that need to be looked at:
- log-likelihood statistic: analogous to the residual sum of squares in multiple regression, i.e. indicates the amount of unexplained variance after the model has been fitted; the higher, the worse
- deviance statistic: = -2LL (because it’s calculated as -2*log-likelihood); subtract baseline model deviance from new model deviance = likelihood-ratio; chi-square distributed and therefore more convenient than log-likelihood statistic
- R and R^2: R can’t simply be squared and interpreted here as it could with linear regression; use Hosmer and Lemeshow’s R2L instead (between 0 and 1, with 0 being bad); or Cox and Snell’s R2CS or Nagelkerke’s R2N – they’re conceptually similar
- information criteria: AIC and BIC => use to judge model fit; they take into account explained variance, but penalize for more predictor variables
- z-statistic: analogous to t-statistic in linear regression; significant test means tested b-coefficient contributes significantly to the prediction; the z-statistic is underestimated when b is large; likelihood ratio is more accurate than z-statistic, and therefore preferable
- odds ratio: odds of an event occurring = probability(event occurs) / probability(event doesn’t occur) => odds ratio = odds after unit change in predictor / original odds; OR > 1 stands for a positive linear relationship between predictor and odds; OR < 1 stands for a negative relationship
Finally, there are two specific problems that one needs to be aware of when doing logistic regression:
- incomplete information from the predictors:
There need to be data for all possible combinations of variables – if this is not the case, R will have problems computing the coefficients, which is signalled by the coefficients’ standard errors being unusually large.
- complete separation:
This is the case when the outcome can be perfectly predicted from one variable or a combination of variables and often arises when too may variables are fitted to too few cases. This is problematic because, when the two outcome categories are perfectly predicted but there are no cases in between, R is unsure about how steep it should make the curve connecting the two extremes. Again, large standard errors hint at this problem.
Let me conclude this rather complicated topic with another relatable stats comic: