community project 
encouraging academics to share statistics support resources 
All stcp resources are released under a Creative Commons licence 
 
© Ellen Marshall and Sofia Maria Karadimitriou                                                            Reviewer: Jean Russell              
University of Sheffield                                                                                                                 University of Sheffield  
 
 stcp-marshall-furtherRegressionS 
 
 
 
Outliers, Durbin-Watson and interactions for regression in SPSS 
Dependent variable: Continuous (scale)  
Independent variables: Continuous/ binary 
Data: The data set ‘Birthweight reduced.sav’ contains details of 42 babies and their parents at birth.  
The dependant variable is birthweight (pounds = lbs) and the two independent variables are the 
gestational age of the baby at birth (in weeks) and whether or not the mother smokes (0 = non-
smoker, 1 = smoker). 
Investigating outliers and influential observations 
An assumption of regression is that there are no influential observations.  These are extreme 
values which pull the regression line towards them therefore having a significant impact on the 
coefficients of the model.   
Outliers: Outliers are observations where the observed dependent 
value does not follow the general trend given the independent 
value (unusual y given x).  In this situation, the residual for that 
observation is likely to be large unless it is also influential and has 
pulled the line towards it.  A residual is the difference between 
observed and predicted values and standardised residuals (with a 
mean of 0 and SD of 1) can be requested in SPSS.  Approximately 
5% of standardised residuals will be outside ±1.96 and 0.3% of 
values are classified as extreme outliers which are outside ±3.  
Large samples are more likely to contain extreme outliers just by 
chance.   
Deleted residuals are the residuals obtained if the regression was 
repeated without the individual observation. 
Leverage: Leverage relates to subjects with unusual values of the 
independent variable which have the potential to influence the slope 
greatly.  An observation with high leverage will pull the regression 
line towards it.  Calculations compare the independent values with 
their mean.  The average leverage score is calculated as (k + 1)/ n 
where k is the number of independent variables in the model and n 
is the number of observations.  Observations with high leverage will 
have leverage scores 2 or 3 times this value. 
Outlier without leverage 
changes intercept only  
y=-6.05+0.33x 
Leverage (unusual x) but not an 
outlier 
y= -5.86+0.34x 
The following resources are associated: 
Simple and Multiple linear regression in SPSS and the SPSS dataset ‘Birthweight_reduced.sav’ 
Further regression in SPSS 
 
                                statstutor community project                                 www.statstutor.ac.uk                        
Influence: An influential observation is one which is an outlier with leverage and affects the 
intercept and slope of a model significantly.  Calculations 
are based on how the predictions would differ if the 
observation was not included. 
Cooks distance: This is calculated for each individual and 
is based on the squared differences between the predicted 
values from regression with and without an individual 
observation.  A large Cook’s Distance indicates an 
influential observation.  Compare the Cooks value for each 
observation with 4/n where n is the number of 
observations.  Values above this indicate observations 
which could be a problem. 
 
Carry out simple linear regression through Analyze  
Regression  Linear with Birthweight as the Dependent 
variable and Gestation as the Independent.  
In the Save menu, select Standardised residuals, Cook’s 
and Leverage values.  The values for each individual will 
be added to the data set. 
 
Then produce a bar chart of the cooks distances by ID. To 
produce a bar chart of Cook’s distance for each 
observation, go to Graphs  Legacy Dialogs  Bar, choose Other statistic (e.g. mean) and move 
Cook’s distance to the Variable box and id to the category axis.  There’s only one observation for 
each baby so the mean is the value. 
 
The cut off for Cook’s is 4/n so here it is 4/42 = 0.095 which can be added to the chart as a 
reference line to make it easier to see.  All of the Cook’s Distances are below this line.   
 
To check for outliers and leverage, produce a 
scatterplot of the Centred Leverage Values and the 
standardised residuals.  There are two 
observations with standardised residuals outside 
±1.96 but there are no extreme outliers with 
standardised residuals outside ±3.  Leverage 
values 3 times (k + 1)/ n are large where k = 
number of independent variables.  The cut off here 
is 3*(1+1)/42 = 0.14.  No observations have 
leverage values above 0.14 
If an observation has a very large leverage score, 
try running the model with and without the value to 
Influential observation is an outlier with 
leverage 
y= -10.87+0.47x 
Further regression in SPSS 
 
                                statstutor community project                                 www.statstutor.ac.uk                        
see how much the coefficients in the model change. 
The Durbin Watson test 
One of the assumptions of regression is that the observations are independent.  If observations are 
made over time, it is likely that successive observations are related.  If there is no autocorrelation 
(where subsequent observations are related), the Durbin-Watson statistic should be between 1.5 
and 2.5.  Carry out simple linear regression through Analyze  Regression  Linear with 
Birthweight as the 
Dependent variable and 
Gestation, the 
Independent.  The 
Durbin-Watson Statistic 
is found in the Statistics 
menu.  
 
The Durbin-Watson 
statistic is 2.39 which is 
between 1.5 and 2.5 and 
therefore the data is not 
autocorrelated. 
 
Interactions in regression 
An interaction is the combined effect of two independent variables on one dependent variable.  
Interactions in SPSS must be calculated before including in a model.  The following example uses 
the birthweight data with birthweight as the dependent variable and gestation and whether or not 
the mother smokes (0 = no, 1 = yes) as the independent variables.   
 
The scatterplot to the right shows the regression lines 
for birthweight (y) without an interaction between 
the two independents in the model.  
The continuous x variable ‘Gestational age’ 
contributes to the slope of the line.  For both lines, the 
slope is 0.34 so a baby increases in weight by 0.34 
lbs for each extra week of gestation.  The binary 
variable ‘Smoking status of mother’ changes the 
intercept so smokers/ non-smokers have a different 
intercept. 
The lines are parallel but smokers tend to have lighter 
babies at each gestational age (intercept is 0.6 lbs 
lower). 
 
If there is an interaction between gestational age and smoking status, the slopes of the two lines 
would be different.  This means that the effect of gestational age (x) on birthweight (y) is different 
depending on whether or not the mother smokes. 
Including interaction terms in regression 
For standard multiple regression, an interaction variable has to be added to the dataset by 
multiplying the two independents using Transform  Compute variable 
y = -5.9 + 0.34x 
y = -6.5 + 0.34x 
Further regression in SPSS 
 
                                statstutor community project                                 www.statstutor.ac.uk                        
 
 
 
 
 
 
To run a regression model: Analyze  Regression  Linear 
Run the regression model with ‘Birth 
weight’ as the Dependent and 
gestational age, smoker and the new 
interaction variable intGESTsmoker 
as Independent(s). 
 
The Coefficients table contains the 
coefficients for the 
model (regression 
equation) and p-values 
for each independent 
variable.  The output 
shows that the 
interaction is not 
significant so the main 
effects can be 
interpreted.  Only gestation is significant (p < 0.001) whilst the interaction term is in the model.  The 
regression analysis can be repeated without the interaction term if it is not significant. 
Calculations for the equations of the lines with an interaction term 
The regression model uses the Unstandardized Coefficients  
 
 
Birth weight y = -3.431 –5.734*(smoker) + 
0.282*(Gest) + 0.13*(Smoker*Gest) 
For non-smokers, smoker = 0 so the model 
becomes y = -3.431 + 0.282(Gest) 
For smokers, smoker = 1: y = -3.431 –5.734*(1) + 
0.282*(Gest) + 0.13*(1*Gest) 
 = -9.165 + 0.412*(Gest) 
 
 
 
Note: Where there are interactions between two scale variables, the coefficient of the interaction 
can be quite small and more difficult to interpret. 
Use * to multiply 
the two 
independent 
variables together 
Give the new 
variable a name