Saturday, December 21, 2013

1.2 Preparing the dataset for usage with R

Checking the data set


Data files come in different shapes and sizes. In the introduction to this manual it is
demonstrated how to convert an .xls file to a csv file. R may, without the right packages installed, not be able to read.xls files. R is always able to read .csv files.

Four criteria to check the data set

It is also important to check the content of the .xls file or .csv file to determine if the data set
is well suited to perform analysis on. In order to determine whether the data is of good
quality, the following four criteria could be used:

Accuracy:

Control of the correctness and reliability of the data set.

Timeliness:

Control if the data is up-to-date or if it is about the right period of time.

Completeness:

Check if there is data missing and check if the data set is voluminous enough to perform analysis on.

Consistency:

Check if the data uses the same values and terms over different data sets and data sources.


Figure 5: Example of a simple data set for analysis with R
Figure 5: Simple dataset for analysis with R

Transformation

To analyze a data file with R, it is recommended organize the file as simple and easy as possible. All kinds of text, colors or images should be removed removed from the file if you want to make the analysis go smoothly. This will avoid potential errors or other nasty complications in R. Figure 3 shows an example of the simple file Flowersales.csv. The file has been converted from .xls file to a .csv file in the previous section.

Remarks:

  1. To make R competable to read different types of files, different packages could be installed. At the page packages of this tutorial you learn how to install packages. For the actions with R performed in this manual it is not required to install packages.
  2. Important! In the page you could see that the data set contains the totals of the different flowers. R reads the first row of the csv-file as catagories (in this case Months, Roses, Tullips and Violets) and the other rows as the data about these catagories. R does not recognize the row Total. R some kind of thinks that Total is a thirteenth month. So I recommend to remove the row containing the totals of the catagories. By doing this you won't perceive problems during analyzing. R is able to calculate the totals by itself if by inserting commands in the R console.

No comments:

Post a Comment