Sunday, December 22, 2013

2.9 Multipe or simple linear regression analysis with categorical variables

Multipe or simple linear regression analysis with categorical variables

In the previous two sections the regression analysis is performed with numeric variables, variables with numerical values. It might appear that a dataset contains not-numerical values or variables. These are called categorical variables because these contain categories or characters instead of numbers.


The example dataset

In the last section of this tutorial the file Projects2.csv is used in the examples. By opening the file you can see that it contains the same data like the file Projects.csv. However, in this file three new categorical variables with character values are added: WorkingGroup (A,B or C), Month (januari t/m december) and TypeProject (Alfa, Beta, Gamma, Delta). The data is imported and converted to a variable and matrix with the commands: Projects2<-read.csv('Projects2.csv') and attach(Projects2).

Figure 26: Importing the new dataset into the R console
Figure 26: Importing the new dataset into the R console

The command

The command that is used to perform a regression analysis with categorical variables is quietly the same as a regression analysis with numerical variables. The difference is that categorical variables have to be converted by a command into numerical variables. This could be done with the command factor(*name of the categorical variable*). This piece of command has to be applied in the regression analysis. To make this more clear, an example is being used. 

Simple linear regression in with categorical variables in R

In the example in Figure 27 the dependent variable Profit (Winst) is being predicted by the independent variable WorkingGroup. Because WorkingGroup is a categorical variable and it has to be calculated as a numerical variable in the regression analysis, the categorical variable has to be converted to a numerical variable. In the example this is done by the command: Regression<-lm(Profit~factor(WorkingGroup),Projects2). By typing in the command Regression the prediction model is presented: Profit =1479.1 + (0A) + (1987 B) + (4097 C). In this case only the variable of the specific working group is included in the model. For example if the profit has to be estimated if working group C executes the project, the following calculation has to be performed: 1479.1 + 1987 = 3466.1. 
By typing in the command summary(Regression) the regression analysis is presented. You can see that the command factor() is the command that is placed in the regression. This could be seen as a little addition to the categorical variable.

Figure 27: Simple linear regression with categorical variables in R
Figure 27: Multiple linear regression with categorical variables in R

Multiple linear linear regression with categorical variables in R

To execute a regression analysis with multiple categorical variables you use the following command: *name of regression*<-lm(*dependent variable*~factor(*independent variable 1*) + as.numeric(*independent variable 2*) + factor(*independent variable.....etc*), *variable of the dataset*). In the example of Figure 28 you see that the dependent variable Profit (Winst) is predicted by the independent variables WorkingGroup and TypeProject. For that the following command is used:
Regression<-lm(Profit~factor(WorkingGroup)+as.numeric(ProjectType), Projects2).

By typing in the command Regression, the prediction model appears: Profit = 2706 + (0 A) + (1123 B) + (3274 C) + (-1121 Beta) + (-2563 Delta) + (-140 Gamma). Only the categorical variables that are considered should be included when calculating the formula. For example, if the profit of working group C is going to execute project type Gamma, the predicted value is 2706 + 3274 -140 = 5840. 

By entering the code summary(Regression) the regression analysis is presented. By looking at the rows in the analysis you can see which variables are not significant. In the example of Figure 28  shows the the row with the factor Gamma is not significant. This means that predictions that are done with the independent variable Gamma are not quietly reliable. 

The Adjusted R-sqaure of 0.5759 shows that the quality of the prediction model is quite high.



Figure 28: Interpreting the multiple regression analysis with categorical variables in R
Figure 28: Interpreting the multiple regression analysis with categorical variables in R

- End of Part 2 -

No comments:

Post a Comment