Saturday, December 21, 2013

1.4 The basic functions of R for analyzing data

In the previous chapter it is explained how a variable for a data file or dataset could be created. It is also explained how a matrix could be made for the variable with the command attach().

Basic R function: Making a summary of the data set.

With the command summary(*name of the variable*) a summary of the given variable appears about the data set that is attached to it. In Figure 7 an example is given of a summary about the dataset Bloemenverkoop.csv (Dutch version of Flowersales.csv). 

Figure 7: Summary of the dataset in R
Figure 7: Summary of the dataset in R

Because a variable is made about the dataset, a summary could simply be given with the command summary(Bloemenverkoop). After pressing Enter, the summary of the dataset appears in the R console. R simply calculates for the category Month how much a specific month appears in the dataset. This is because the category month does not contain numbers, but only names of different months. For this reason the category month is not a numeric variable, but a categoric variable

Notice in the example in Figure 7 that the variable Bloemenverkoop (Flowersales) also contains a set of variables, namely the variables in the dataset which are; Maand (Month), Rozen (Roses), Tulpen (Tulips) and Viooltjes (Violets).

About the categories Roses, Tulips and Violets the following information is displayed:
  • Min. (Minimum);
  • Max. (Maximum);
  • 1st Qu., 3rd Qu. and Median:
  • Mean

Min. and Max.

Min. shows the minimal value of the specific variable. The the example of Figure 7 shows that in the lowest month 12 roses (Rozen) are sold. There is no month in which less roses are sold since 12 is the minimum. The same counts for Max., which indicates what the maximum sales of a month.

1st Qu., 3rd Qu. and Median:

At this part the first and third quartile of the category are given. The first quartile is shown by 1st Qu. At the first quartile is the sum of the lowest 25% of values. The example in Figure 7 shows the 1st Qu. for roses (Rozen) is 29. The third quartile 3rd Qu. shows the 25% hightest sales numbers, the exampole of Figure 7 shows that this is 58,25.
Median shows the median of the variable, this is someting like the second quartile. Median shows, if the sales number are sorted from low to high, what the middelmost value would be.

Mean

Mean. shows the avarage/mean of the category. In the example in Figuur 7 we can see that the avarage amount of roses sold per month is 57,42.

Showing and analyzing descriptive statistics with R:

You can use R to analyze a data set with different kinds of statistical functions in an easy way. By all the previous actions performed in the R console, like creating a variable for the data set and converting this data set in a matrix, these statistical functions could be applied with simple commands. At this section a few examples are given about the functions that could be used to analyse the file Flowersales.csv (Bloemenverkoop.csv)

Plotting a histogram in R : hist():

With the command hist(*variable*) you could present a histogram in the R console about the specific variable you choose. Figure 8 shows a histogram of the category Violets (Viooltjes). For this the following command is used in the R console; hist(Viooltjes)

Figure 8: Showing a histogram about a variable in R
Figure 8: Showing a histogram about a variable in R

Plotting a graph in R : plot()

With the command plot(*variable x*, *variable y*) you could let R present a graph of the two variables of your own choice. In Figure 9 a graph is shown about the sales of Tulips (Tulpen) each month during the year. For this the following command is used in the R console: plot(Maanden, Tulpen).

Figure 9: Plotting a graph in R
Figure 9: Plotting a graph in R

Plotting a pie chart in R : pie()

With the command pie(*variable*) you can make R present a pie chart about the category or variable you choose. Figure 10 presents a pie chart about the variable Roses (Rozen). Notice that in month 2, fabruari, the most roses are sold. Probably because of valentines day.

Figure 10: pie chart in R
Figure 10: A pie chart in R

Analyzing data with other simple statistical functions in R

Because the data set is connected to a variable and a matrix is created of this variable, easy commands can make R perform different statistical functions to analyze the data. After each command only the variable that you would like to perform calculations on has to be added, which will present the results after pressing Enter. In the Appendix of this manual you could find the codes for the specific statistical functions. In this part of the tutorial the most common used functions in statistics are explained.
Notice that during the input of statistical functions it is important to pay attention to the case sensitiveness. For example Figure 11 shows that an error message is shown after using an uppercase for the command max, which is wrong.

Figure 11: Different statistical functions analyzing R
Figure 11: Different statistical functions in analyzing with R

Total : sum()


The sum of all values of a variable could easily be calculated in R by using the command sum(*variable*). Figure 11 shows that the sum of the total sales of violets (Viooltjes) is presented by entering the code sum(Viooltjes). The result is 403, which concludes that in the current year (sum of all months) a total of 403 violets are sold. 

Range : range()


The command range(*variable*) presents the range, the minimum and maximum, of the specific variable. Figure 11 shows the variable Roses (Rozen). This gives the result 12 216. The mimimum and maximum amount of roses sold in the specific year are respectively 12 and 216.

Standaard deviation and variance : sd() and var()


The standard deviation and variance of a variable could be calculated in R by using the command sd(*variable*) and the variance with the command var(*variable*). In Figure 11 the standard deviation of the variable Tulips (Tulpen) is shown after entering the command sd(Tulpen). This shows the result of 93,08008. The variance of the category Tulpen is found with the code var(Tulpen) and gives 8663,902 as result. Notice that the standard deviation is the square root of the variance.

Correlation : cor()


With the correlation function you could find the cohesion between two different variables, or in this case categories of sales of flower types. A correlation of 1 stands for a perfect positive cohesion, a correlation of -1 stands for a perfect negative cohesion. A correlation of 0 means no cohesion. So the value of a correlation varies between -1 and 1. 

To show the correlation between two different variables you enter the following command in the R console: cor(*variable 1*,*variable 2*). Figure 11 shows the correlation between the sales of roses (Rozen) and violets (Viooltjes), this gives the result 0.5384868. This concludes that there is indeed a sense of cohesion between the sales of roses and violets since 0.5384868 is quite a high correlation value.

Finding and displaying values and data from the matrix of the data set in R

If you want to present data about a certain fraction of a category, there are several commands you could use in the R console. First you type in the specific varaible. Behind that you type the number of the row and/or column that you want to present [ between square brackets ]. The following examples will help you perform these executions.

Interval : *naam of the variable of the data set*[interval]


Figure shows that the command Rozen[2:5] is used to show the sales amounts of the months februari till may. These sales were respectively 216, 23, 31 and 41.

Rows and columns 


If you want to find information about a certain row, you can use the code *name of variable of the data set*[*number of the row*,]. Figure 11 shows the data about the month february (row 2) with the code Bloemenverkoop[2,] ( or Flowersales[2, ] ). This gives the result that in the month februari 216 roses, 54 tulips and 13 violets are sold. These results could also be shown by entering the command Bloemenverkoop[februari, ] (or Flowersales[february, ] ).

You could also find information about columns, in this case the categories of flowers. Figure  11 shows that the code Bloemenverkoop[,3] ( or Flowersales[,3] ) shows information about the third column, in this case tulips (Tulpen). This could be done by using the command Bloemenverkoop[,"Tulpen"] ( or Flowersales[,"Tulips"] ).

By this method you could easy look for specific information in a data set. For example if you want to know how much roses are sold in november, you use the command Flowersales[november,'Roses'].
Please notice that you always use the name of the variable that is connected tot the data set. The name of the Matrix or m do not work or will present an error.

Remarks

  1. In calculatin the median, if ther is no value in the middle, the mean of the two middle values are presented as median.
  2. This section shows that is is better to erase the row of totals in the original excel of csv-file. If you leave the row of totals in the file, the row of totals is also calculated when using the command sum which will give you a result that is twice as big.
  3. The standard deviation and variance are mostly used in advanced statistical methods. For not-staticians these values are not important but it might be interesting to know.
  4. When looking for information about a category, place the category/variable between quotion marks.








No comments:

Post a Comment