Introduction to Data Analysis

2.1 Data Table

Data analysis begins with, well, data. Analyze the data values for at least one variable, such as the company’s employee annual salaries. Organize the data values into a specific kind of structure from which analysis proceeds. To use any data analysis system such as R, organize the data values into a table.

Video: Data Table [3:24]

Data table

Organize data values into a rectangular data table with the name of each variable at the top of a column followed by its data values in the remainder of the column.

Store the structured data values within a computer file located on your computer, an accessible local network, or the world wide web. Encode the data table in one of a variety of computer file formats. The formats we encounter are Excel files, indicated by a file type of .xlsx, and text files in the form of comma-separated value files (csv). Identify a text file with one of several potential file types, such as .txt, but usually .csv.

The Excel data table in Figure 2.1 contains four variables: Years, Gender, Dept, and Salary, plus an ID field called Name, for a total of five columns. Figure 2.1 displays their data values for the first six employees.

Describe the data table by its columns, rows, and cell entries.

Variable name

A short, concise word or abbreviation that identifies a column of data values in a data table.

Analysis of data can only proceed with the data table identified and the relevant variables in the data table identified.

Analyze variables

All R functions analyze the data values within a data table for one or more specified variables, identified by their names, such as Salary.

Analysis requires the correct spelling of each variable name, including the same pattern of capitalization.

Data value

The contents of a single cell of a data table, a specific measurement, except for the first row, which (usually) contains the variable names.

The name variable was chosen because the data values for a variable vary. Doing data analysis is the analysis of that variability. Data analysis is the analysis of the data values of a single variable and analyses that relate the data values of different variables.

Variables define the columns of a data table. What about the rows?

Observation

A row of the data table that contains the data for a specific instance of a single person, organization, place, event, or whatever is the object of analysis.

Unfortunately, the reference for the rows of the data table is not standardized. Observations are also referred to as cases, examples, samples, and instances.

Consider employee Darnell Ritchie. He has worked at the company for seven years, identifies as a man, and works in administration with an annual salary of $43,788.26. Two data values in this section of the data table are missing. The number of years James Wu has worked at the company is not recorded, nor is the department in which Alissa Jones works.

2.2 Read Data into R

To begin an analysis, read your data stored as a computer file into R. Your data organized as a data table exists somewhere as a data file stored somewhere on a computer system, your computer or a network, including the web. Here is how you access your data.

Read your data into R, then analyze

Read the data table into an R data frame (table) with the Read() function, then analyze specific variables in that data table, each referenced by its name.

The data table can exist in one of many different formats, including Excel. Figure 2.2 shows a data table as an Excel file named employee.xlsx stored on a (Macintosh) computer. Figure 2.1 shows the first several lines of this data table in detail.

Figure 2.2: Data table, named *employee.xlsx*, stored as an Excel file.

The data table is stored as a computer file. To analyze your data, read the data table from the computer file, which copies the data into a corresponding data table within a running R session.

Data frame

R (and Python’s) name for a data table stored within an active R (or Python) session, referenced by its name.

In the function call to read the data, reference the data table stored on a computer system, including the web, by its file name and location. Each variable in a data table has a name, and so does the data table itself. When read into R, name the data table, the R data frame, with a name of your choice. Regardless of the file name of your data on your computer system, typically name a data table within the active R session, the data frame, as simply d for data. Not only is d easy to type, but it is also the lessR default data frame name for the data processed by its various analysis functions.

When analyzing data read into R, the same data exists in two locations: a computer file on your computer system and an R data frame within a running R app. Different locations, different names: same data. On your computer system, identify the data table by its file name and location. Within an active R session, R identifies the same data from the data file by its data frame name as read into R, such as d.

Analogous to multiple Excel worksheets in a single Excel file, a running R session can contain multiple data frames, limited only by the amount of available memory.

Do a function call to read the data from a file into a data frame of a running R application. Multiple read functions are available from R as downloaded and from functions in different packages to read the data into R. We use the lessR function Read() for its simplicity and helpful output to better understand the data that R reads into a data frame.

2.2.1 Browse for the Data Table File

To read the data, direct R to the location of the data file. R cannot read the data file until it knows where the data is stored. One option is to browse for the location of the data file on your computer system. You navigate your file system until you locate the file.

Browse to locate your data file to read

To locate your data file by browsing through your file system, call the Read() function with an empty file reference, (""), nothing between the quotes: Read("").

As with all R (and Excel and Python and everything else) functions, the call to invoke the function includes a matching set of parentheses. Information within the parentheses specifies the information provided to the function for analysis.

If you are running R/RStudio in the cloud, your “local” computer is your cloud account, not the computer from which you are accessing the cloud. That “computer” could be any device, such as a tablet or an iPhone that does not even run R. First upload your data file to your cloud account, as shown in the previously link reference cloud directions.

The following Read() statement reads the data stored as a rectangular data table from an external file stored on your computer system such as an Excel file.

Video: Read Data [3:35]

Example 2.1

d <- Read("")

We need a way to instruct R where to store the data it reads. The Read() function does the reading from an external data file into R, but that data needs to be stored in a data frame within R. The above Read() statement reads the data from the file into an R data frame called d.

Assignment statement <-

The <- indicates to assign what is on the right of the expression, here the data read from an external file, to the object on the left, here the R data frame stored within the R session.

The text output of any R function goes somewhere. If you do not specify an object to receive that output, it goes to the console. Doing a Read() without assigning the output to a data frame dumps the contents of the data frame to the R console, without access for later analysis.

You can also use an ordinary equals sign, =, to indicate the assignment, but the <- shows the flow of information in the assignment, and is more widely used by R practitioners.

2.2.2 Specify Location of the Data Table File

One way to locate a data file to be read explicitly specifies the location of the file within the quotes and parentheses of the Read() function. Specify either the full path name of a file on your computer system, or specify a web address that locates the data table on the web. Again, read the data into the d data frame, remembering to include the quotes.

Read data from a specified location

d <- Read("path name" or "web address")

With Excel, R, or any other computer apps that process data, enclose character string values, such as a file name or web address (URL), in quotes. For example, to read the data from the web data file employee.xlsx into the data frame d, invoke the following Read() function call.

Example 2.2

d <- Read("http://web.pdx.edu/~gerbing/data/employee.xlsx")

To specify a location of the data file on your computer, provide the full path name that locates and names your data file. To obtain this path name, first browse for the file with Read(""). The resulting output displays the path name of the identified file. Copy this path name and insert between the quotes of Read(""), save this and other R function calls in a text file for future analyses without needing to browse for its location.

In summary, with the Read() function, either put nothing between the quotes to browse for a data file, or specify the data file’s location on your computer system or the web. Direct the data read from a file into an R data frame, usually named d, but can choose any valid name.

2.2.3 Output of `Read()`

R organizes analyses by variable name, so knowing the exact variable names is critical. This specification includes the pattern of capitalization. The Read() function automatically displays these names. The variables are in the columns, so to specify a variable is to select a column of data values.

Read() also displays the type of how each variable is stored in the computer, as numbers with or without decimal digits, or as character strings in this example. Also listed are the number of complete and missing values for each variable, the number of unique values for each variable, and sample data values. Figure 2.3 lists the output from reading the employee.xlsx data file.

Figure 2.3: Annotated output of Read() function with the Variable Name column highlighted.

Always compare the output of Read() with the actual data file to ensure that your data was correctly read. Never read data into R or any other system without first ensuring that the data values in the data table stored on some computer system correspond to the variables and data values read into a R data frame.

To allow for the display of many variables, Read() lists the information for each variable in a row. Of course, the data file organizes the variables by column. Compare the output of Read() with the description of the data file in Figure 2.1 and Figure 2.2.

2.3 Display the Data

To analyze data, first understand the data. You should know what the data values look like for each variable, and you should know the variable names. The output of the lessR function Read() assists this understanding, but often you want to view the data directly.

Examine your data

After reading the data into R, you should view all or some of the contents of the newly created data frame to better understand and verify your data.

One way to view the contents of any R object, of which there are many types, is to enter the name of the object at the console, in response to the command prompt >.

Video: Display the Data [1:49]

Example 2.3

Of course, for even medium size data tables we typically do not need or want to view the entire data table. Use the R head() function to list the variable names and, by default, the first six rows of data, here for the data frame d.

R also provides a corresponding function tail() that lists the data values at the end of the file..

Example 2.4

head(d)

              Name Years Gender Dept    Salary JobSat Plan Pre Post
1 Ritchie, Darnell     7      M ADMN  53788.26    med    1  82   92
2        Wu, James    NA      M SALE  94494.58    low    1  62   74
3      Hoang, Binh    15      M SALE 111074.86    low    3  96   97
4    Jones, Alissa     5      W <NA>  53772.58   <NA>    1  65   62
5   Downs, Deborah     7      W FINC  57139.90   high    2  90   86
6   Afshari, Anbar     6      W ADMN  69441.93   high    2 100  100

Compare this output, the representation of the data within R, to the data table in Figure 2.1 as an Excel file. Same data, different locations.

Another option to view the data read into R invokes the Base R View() function, which works directly from within RStudio.

View(d)

One advantage of this form of viewing the data is that you can view much data just by scrolling.

Figure 2.4 shows the display of data within RStudio with View(), with the scroll bar at the right-side of the window pane.

Figure 2.4: The `View()` data display from within RStudio.

The separation of data from the instructions to process that data is a welcome benefit of R over Excel. You should, however, view your data on a regular basis in order to understand what you are analyzing.

When something does not work the way you expected it to work, look at your data! Often the problem can be fixed because the computer stored your data differently than the way you thought the data would be stored. Viewing your data as you proceed with an analysis is a crucial step toward successful analysis.

To fix an error, begin with your data

Instead of trying to fix a problem by guessing, first look at your data.

A discrepancy between what you thought was your data and what actually is your data is often the source of an error trying to do a data analysis.

Note also the representation of missing data within a R data frame.

R missing data code

NA and <NA>for not available indicates missing data for numerical and non-numerical variables, respectively.

The blank cells in the Excel file, Figure 2.1, are replaced with either NA for the numerical variable Years, and <NA> for the variable Dept with non-numerical values.

2.4 Two Types of Variables

Always distinguish continuous variables from categorical variables. This distinction between these two types of variables is fundamental in data analysis.

Continuous (quantitative) variable

A numerical variable with many possible values.

Categorical (qualitative) variable

A variable with relatively few unique labels as data values.

Examples of continuous variables are Salary or Time, defined on a numerical scale with many unique values. Examples of categorical variables are Gender or State of Residence. Each categorical variable has just a relatively few number of possible values compared to a continuous value. This distinction of continuous and categorical variables is common to virtually every data analysis project.

Sometimes that distinction gets a little confusing because variables with integer values, which are numeric, could be quantitative or qualitative. For example, sometimes Man, Woman, and Other are encoded as 0, 1, and 2, respectively, for three levels of the categorical variable Gender. However, these integer values are just labels for different non-numeric categories. Best to avoid this confusion. Instead, encode categorical variables with non-numeric values, such as Gender, for example, with M, W, and O for Other.

To distinguish between continuous and categorical variables, determine if the values of a variable are on a numerical scale, presumably with a relatively large number of unique values that can be ordered from smallest to largest value. A categorical variable such as Gender, however, coded numerically, does not imply the values are on a numerical scale. For example, Woman coded a 1, is not more than Man, coded as 0, or vice versa.

Recognize integer categorical variables

A variable with integer data values is categorical if there are small number of unique data values compared to the total number of data values.

For example, Gender encoded as 0, 1, and 2 has three possible values, and usually tens if not hundreds of rows of data, each of which contains one of those three values.

2.1 Data Table

2.2 Read Data into R

2.2.1 Browse for the Data Table File

2.2.2 Specify Location of the Data Table File

2.2.3 Output of Read()

2.3 Display the Data

2.4 Two Types of Variables

2.2.3 Output of `Read()`