Intro R¶
R is a programming environment for statistics and graphics
- Does basically everything, can also be extended
- It’s the default when statisticians implement new methods
- Free, open-source
But
- Steeper learning curve than e.g. Excel, Stata
- Command-line driven (programming, not drop-down menus)
- Gives only what you ask for!
1. What is RStudio?¶
Often, learning a programming language is made worse by an unintuitive and unhelpful user interface. For our workshop, we will be using RStudio, a graphical user interface (front-end) for R that is slightly more user-friendly than ‘Classic’ R’s GUI.
1.1 The console window¶
There are some useful features available in the console:
- Use the UP arrow to see past commands you have typed
- Help window appears as you type to help you complete your thought
- Tab autocomplete
- When just typing, use tab autocomplete to fill in object names for you
- When inside quotes
' ', you can press tab to help you spell folder names correctly
1.2 Trying out the Console¶
We’ll use the Console window first – as a (fancy!) calculator
2+2
# [1] 4
2^5+7
# [1] 39
2^(5+7)
# [1] 4096
exp(pi)-pi
# [1] 19.9991
log(20+pi)
# [1] 3.141632
0.05/1E6 # a comment; note 1E6 = 1,000,000
# [1] 5e-08
- All common math functions are available; parentheses (round brackets) work as per high school math
- Try to get used to bracket matching. A
+prompt means the line isn’t finished – hit Escape to get out, then try again.
We can also compare things in R using operators.
We can see if things are equal by using two equal signs ==. To see if something is NOT equal, we use the exclamation mark !.
3 == 2
# [1] FALSE
3-1 == 2
# [1] TRUE
TRUE == TRUE
# [1] TRUE
TRUE == FALSE
#[1] FALSE
'a' == 'a'
# [1] TRUE
'abc' != 'ABC'
# [1] TRUE
!is.na(NA)
# [1] FALSE
Note
We can represent missing data with NA, and use a function/command called is.na() to ask if data is missing. More on this later.
We can use greater than > or less than < signs as you would expect.
300 > 200
# [1] TRUE
0 > 999
# [1] FALSE
- Exercise - 1
Which of the following will NOT return TRUE?
- FALSE == FALSE
- 10-5 == sqrt(25)
- TRUE > FALSE
- ‘a’ > ‘b’
1.3 Storing Data¶
We can quickly make comparisons, but we usually want to do things more sophisticated than that. For example, instead of typing “This is an important string that we want to do analysis on” into the console over and over again, we might want to give it a shorter name and then reference it later.
# Use the **UP** arrow to see past commands you have typed
x <- "This is an important string that we want to do analysis on"
This shows up in the Environment tab in R Studio. This is very useful, because now when we want to print out this string, we can just type x into the Console.
x
# [1] "This is an important string that we want to do analysis on"
R stores data (and everything else) as objects. New objects are created when we assign them values;
x <- 3
y <- 2 # now check the Environment window
x+y
# [1] 5
1.4 Using the script window¶
While fine for occasional use, entering every command by hand is error-prone, and quickly gets tedious. A much better approach is to use a Script window
– open one with Ctrl-Shift-N, or the drop-down menus - Opens a nice editor, enables saving code (.R extension) - Run current line (or selected lines) with Ctrl-Enter, or Ctrl-R
Important
From now on, we assume you are using a script editor.
- First-time users tend to be reluctant to switch! – but it’s worth it, ask any experienced user
- Scripts make it easy to run slightly modified code, without re-typing everything – remember to save them as you work
- Also remember the Escape key, if e.g. your bracket-matching goes wrong
For a very few jobs, e.g. changing directories, we’ll still use drop-down menus. But commands are available, for all tasks.
We can save our scripts wherever we want, but it makes it easier if we set a working directory in R. This makes it easier to find files, and also can make research more reproducible because it gives you the ability to share data structure with a collaborator.
Before we can set the working directory, we need to know where we are on our computer right now. Just like the command line’s pwd command, R has a command called getwd(). Notice that it returns the absolute path to your home directory.
getwd()
# [1] "/Users/lori"
You can point to files from anywhere on the computer RELATIVE to your current position. If you need to change this working directory, such as to go into the new folder, you can do so with setwd(). Let’s try this. Make sure you put the path in quotes.
You can use tab complete in R Studio, so once you open the quotes, press tab to see all the files and directories listed for you. If you type a letter, that list will shorten.
Note
You can also use the Files tab in R Studio. Your home directory can be found by clicking the Home button.
setwd("/Users/lori/Documents")
- EXERCISE - 2
What is the output when we execute the following code?
x <- 3
y <- 2
y <- 17.4
x+y
A. [1] 3 2 17.4
B. [1] 22.4
C. [1] 20.4
D. [1] 5
Warning
Assigning new values to existing objects over-writes the old version – and be aware there is no Ctrl-Z ‘undo’
y <- 17.4 # check the Environment window again
x+y
# [1] 20.4
Note
- Anything after a hash (#) is ignored – e.g. comments
- Spaces don’t matter outside of quotes (except for the <- symbol)
- Capital letters do matter
Tip
What’s a good name for my new object?
- Something memorable (!) and not easily-confused with other objects, e.g. X isn’t a good choice if you already have x
- Names must start with a letter or period (”.”), after that any letter, number or period is okay
- Avoid other characters; they get interpreted as math (”-”,”*”) or are hard to read (” ”) so should not be used in names
- Avoid names of existing functions – e.g. summary. Some oneletter choices (c, C, F, t, T and S) are already used by R as names of functions, it’s best to avoid these too
2. Data Types¶
There are 6 main types: double (numeric), integer, complex, logical and character. The sixth one “raw” will not be discussed in this workshop.
R provides many functions to examine features of vectors and other objects, for example
class()- what kind of object is it (high-level)?typeof()- what is the object’s data type (low-level)?length()- how long is it? What about two dimensional objects?attributes()- does it have any metadata?
2.1 Character¶
Surround with quotes, can be any keyboard character
c <- 'Hello world! 123'
class(c)
# [1] "character"
typeof(c)
# [1] "character"
2.2 Double (numeric)¶
No quotes, can be any real number, decimal, or whole numbers
n <- 3.4
class(n)
# [1] "numeric"
typeof(n)
# [1] "double"
2.3 Integer¶
No quotes, can be any whole number. Place an L behind it, otherwise R will read it as a numeric
i <- 2L
class(i)
# [1] "integer"
2.4 Complex¶
Can use notation like + -, and values like i for imaginary units in complex numbers.
comp <- 1+4i
class(comp)
# [1] "complex"
2.5 Logical¶
Are equal to either TRUE or FALSE in all caps
l <- TRUE
l <- FALSE
class(l)
# [1] "logical"
3. Data Structures¶
3.1 Atomic Vector¶
Use c() notation (stands for combine). All elements of a vector have to be of the same data type.
log_vector <- c(TRUE, TRUE, FALSE, TRUE)
char_vector <- c("Uwe", "Gaius", "Liz")
char_vector <- c(char_vector, "Helper1", NA) #NA represents empty data
char_vector
# [1] "Uwe" "Gaius" "Liz" "Helper1" NA
length(char_vector)
# [1] 5
class(char_vector)
# [1] "character"
anyNA(char_vector)
# [1] TRUE
Given that atomic vectors must be of all one data type, what will happen when data is mixed?
mixed <- c("True", TRUE)
mixed
# [1] "True" "TRUE"
typeof(mixed)
# [1] "character"
#It has converted the logical to a character
R will create a resulting vector with a mode that can most easily accommodate all the elements it contains. This is something called type coercion, and it is the source of many surprises and the reason why we need to be aware of the basic data types and how R will interpret them.
- EXERCISE - 3
Uncover the heirarchy of the data types using the elements in this vector “anothermixed”.
anothermixed <- c("Stanford",FALSE, 2L, 3.14)
test_c_i <- c("Stanford", 2L)
typeof(test_c_i)
# [1] "character"
test_c_d <- c("Stanford", 3.14)
typeof(test_c_d)
# [1] "character"
test_l_i <- c(FALSE, 2L)
typeof(test_l_i)
# [1] "integer"
test_l_d <- c(FALSE, 3.14)
typeof(test_l_d)
# [1] "double"
test_i_d <- c(2L, 3.14)
typeof(test_i_d)
[1] "double"
Using as.datatype (as.logical, as.character, as.factor, etc) will make R try to force it to be the this data type.
as.logical(mixed)
# [1] TRUE TRUE
3.2 List¶
Lists are like vectors except that you can use multiple data types. Make a list using the list() function.
mylist <- list(chars = 'coffee', nums = c(1.4, 5), logicals=TRUE, anotherList = list(a = 'a', b = 2))
mylist
# $chars
# [1] "coffee"
# $nums
# [1] 1.4 5.0
# $logicals
# [1] TRUE
# $anotherList
# $anotherList$a
# [1] "a"
# $anotherList$b
# [1] 2
Holds multiple of the above data types, including other lists.
class(mylist)
# [1] "list"
str(mylist) # compactly displays internal structure of R object
# List of 4
# $ chars : chr "coffee"
# $ nums : num [1:2] 1.4 5
# $ logicals : logi TRUE
# $ anotherList:List of 2
# ..$ a: chr "a"
# ..$ b: num 2
Warning
Don’t forget that the command str() also lists the class of each column within a data frame. It is good to use to make sure all of your data was imported correctly.
We can access a value of a list by referencing the index or by using the label preceded by the dollar sign ‘$’.
mylist[1]
# $chars
# [1] "coffee"
mylist$nums
# [1] 1.4 5.0
- EXERCISE - 4
What is the difference in the returned objects? # reinforce that lists are made up of other vectors and lists
mylist[3]
mylist$logicals
3.3 Matrices¶
Matrices are 2 dimensional structures that hold only one data type. Using ncol and nrow, you can define its shape. You can fill in the matrix by assigning to data. By default, it fills in by column, but you can change this using the byrow argument.
m <- matrix(nrow=2, ncol=3)
m
# [,1] [,2] [,3]
# [1,] NA NA NA
# [2,] NA NA NA
m <- matrix(data=1:6, nrow=2, ncol=3)
m
# [,1] [,2] [,3]
# [1,] 1 3 5
# [2,] 2 4 6
m <- matrix(data=1:6, nrow=2, ncol=3, byrow=TRUE)
m
# [,1] [,2] [,3]
# [1,] 1 2 3
# [2,] 4 5 6
Note
You can also have multi-dimensional structures called arrays. You can create this using the array() function, but it is outside the scope of this course.
3.4 Data Frames¶
Data Frames are like matrices, but can hold multiple data types.
Vectors are to Lists as Matrices are to Data Frames
df <- data.frame(id=letters[1:10], x=1:10, y=11:20)
df
# id x y
# 1 a 1 11
# 2 b 2 12
# 3 c 3 13
# 4 d 4 14
# 5 e 5 15
# 6 f 6 16
# 7 g 7 17
# 8 h 8 18
# 9 i 9 19
# 10 j 10 20
class(df)
# [1] "data.frame"
typeof(df)
# [1] "list"
head(df)
# id x y
# 1 a 1 11
# 2 b 2 12
# 3 c 3 13
# 4 d 4 14
# 5 e 5 15
# 6 f 6 16
tail(df)
# id x y
# 5 e 5 15
# 6 f 6 16
# 7 g 7 17
# 8 h 8 18
# 9 i 9 19
# 10 j 10 20
nrow(df)
# [1] 10
ncol(df)
# [1] 3
str(df)
# 'data.frame': 10 obs. of 3 variables:
# $ id: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10
# $ x : int 1 2 3 4 5 6 7 8 9 10
# $ y : int 11 12 13 14 15 16 17 18 19 20
summary(df)
# id x y
# a :1 Min. : 1.00 Min. :11.00
# b :1 1st Qu.: 3.25 1st Qu.:13.25
# c :1 Median : 5.50 Median :15.50
# d :1 Mean : 5.50 Mean :15.50
# e :1 3rd Qu.: 7.75 3rd Qu.:17.75
# f :1 Max. :10.00 Max. :20.00
# (Other):4
names(df)
# [1] "id" "x" "y"
3.5 Factors¶
Factors are very useful when running statistics, and also clog up less memory than character vectors.
They do this by storing each unique value as an integer, which takes up less space in memory than characters in a string. Then it references that integer to the corresponding string so that it is human readable.
state <- factor(c("Arizona", "Colorado", "Arizona"))
state
# [1] Arizona Colorado Arizona
# Levels: Arizona Colorado
nlevels(state)
# [1] 2
levels(state)
# [1] "Arizona" "Colorado"
Factors by default don’t actually have hierarchy. That is to say, Arizona is not more or less than Colorado. But sometimes we want factors to have hierarchy (e.g. low comes before medium comes before high).
ratings <- factor(c("low", "high", "medium", "low"))
ratings
# [1] low high medium low
# Levels: high low medium
If we look for the minimum of the factors, we get an error because they are not ordered
min(ratings)
# Error in Summary.factor(c(2L, 1L, 3L, 2L), na.rm = FALSE) :
# ‘min’ not meaningful for fact
levels(ratings)
# [1] "high" "low" "medium"
We can add an order by putting ordered=TRUE into the arguments of the factor() function. Then when we run min(), it understands that “low” is the minimum value. Notice that the Levels change to less than symbols, showing there is a hierarchy.
ratings <- factor(ratings, levels=c("low", "medium", "high"), ordered=TRUE)
levels(ratings)
# [1] "low" "medium" "high"
min(ratings)
# [1] low
# Levels: low < medium < high
When we run the str() function on a dataframe with factors, notice that it lists the type as a Factor and tells us how many levels it has. summary lists each factor level and tells us how many are in each group.
survey <- data.frame(number=c(1,2,2, 1, 2), group=c("A", "B","A", "A", "B"))
str(survey)
# 'data.frame': 5 obs. of 2 variables:
# $ number: num 1 2 2 1 2
# $ group : Factor w/ 2 levels "A","B": 1 2 1 1 2
summary(survey)
# number group
# Min. :1.0 A:3
# 1st Qu.:1.0 B:2
# Median :2.0
# Mean :1.6
# 3rd Qu.:2.0
# Max. :2.0
A useful command to count how many values overlap is the table() function. Here we see that 2 rows in the table have a 1 in the number column and an A in the group column, but there are 0 rows that have a B and a 1.
table(survey$number, survey$group)
# A B
# 1 2 0
# 2 1 2
- EXERCISE - 5
Create the following data frame in R:
| Day | Magnification | Observation |
|---|---|---|
| 1 | 2 | Growth |
| 2 | 10 | Death |
| 3 | 5 | No Change |
| 4 | 2 | Death |
| 5 | 5 | Growth |
4. Reading in Data¶
Let’s move the gapminder file into our intro_R directory using command line
First, let’s see how we can read in data using base R, using the read.csv() command:
gapminder <- read.csv(file = "gapminder.txt", header=TRUE, sep = "\t", stringsAsFactors = FALSE)
After successfully reading in the data;
- The environment now includes a
gapminderobject – or whatever you called the data read from file - A copy of the data can be examined in the Excel-like data viewer – if it looks weird, find out why & fix it!
What can I do with my data?
Well you can several things. To operate on data, type commands in the Console window, just like our earlier calculator-style approach;
summary(gapminder)
# country continent year lifeExp pop
# Length:1704 Length:1704 Min. :1952 Min. :23.60 Min. :6.001e+04
# Class :character Class :character 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06
# Mode :character Mode :character Median :1980 Median :60.71 Median :7.024e+06
# Mean :1980 Mean :59.47 Mean :2.960e+07
# 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07
# Max. :2007 Max. :82.60 Max. :1.319e+09
str(gapminder)
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1704 obs. of 6 variables:
# $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
# $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
# $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
summary()summarizes the object and provide basic summary statistics for each column within your datastr()tells us the structure of an object (i.e., it’s dimensions/size and the class of the each data column)
We can also use these commands on any object – e.g. the single numbers we created earlier (try it!)
There are also commands to get these statistics alone. For this we use the $ symbol to tell R which column we are interested in.
min(gapminder$lifeExp)
# [1] 23.599
median(gapminder$lifeExp)
# [1] 60.7125
max(gapminder$lifeExp)
# [1] 82.603
These are called FUNCTIONS (we will more on this later), and are used to do a particular task on a set of data. Here we are accessing columns by using the dollar sign. We are telling R that we are only interested in one column.
We can also do more sophisticated things with these commands. Let’s try a simple plot:
plot(gapminder$lifeExp, gapminder$gdpPercap)
The gapminder data we just imported is in an object called a Data Frame. A data frame holds data in a table format, like what you might be used to in Excel. A “tidy” data frame has columns that each represent a variable and rows which hold one observation.
As we saw before, individual columns in data frames are identified using the $ symbol – just seen in the str() output.
Think of $ as apostrophe-S, i.e. gapminder`’S`lifeExp
New columns are created when you assign their values – here containing the life expectancy in months instead of years;
gapminder$lifeExpMonths <- gapminder$lifeExp*12
str(gapminder)
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1704 obs. of 7 variables:
# $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
# $ continent : chr "Asia" "Asia" "Asia" "Asia" ...
# $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
# $ lifeExp : num 28.8 30.3 32 34 36.1 ...
# $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
# $ gdpPercap : num 779 821 853 836 740 ...
# $ lifeExpMonths: num 346 364 384 408 433 ...
summary(gapminder$lifeExpMonths)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 283.2 578.4 728.5 713.7 850.1 991.2
- Assigning values to existing columns over-writes existing values – again, with no warning
- With e.g. gapminder$newcolumn <- 0, the new column has every entry zero; R recycles this single value, for every entry
- It’s unusual to delete columns… but if you must; use
gapminder$lifeExpMonths <- NULL
Other functions useful for summarizing data frames, and their columns;
names(gapminder)
# [1] "country" "continent" "year" "lifeExp" "pop" "gdpPercap"
# [7] "lifeExpMonths"
dim(gapminder) # dim is short for dimension
# [1] 1704 7
length(gapminder$lifeExp) # how many rows in our dataset?
# [1] 1704
min(gapminder$lifeExp)
# [1] 23.599
max(gapminder$lifeExp)
# [1] 82.603
range(gapminder$lifeExp)
# [1] 23.599 82.603
mean(gapminder$lifeExp)
# [1] 59.47444
sd(gapminder$lifeExp) # sd is short for standard deviation
# [1] 12.91711
median(gapminder$lifeExp)
# [1] 60.7125
median(gapminder$li) # uses pattern-matching (but hard to debug later)
# [1] 60.7125
- EXERCISE - 6
Import the gapminder data frame again.
Use str() to look at the structure of the dataframe and summary() to get information about the variables.
- What are its columns?
- How many rows and columns are there?
- What is the earliest year in the year column?
- What is the average life expectancy?
- What is the largest population?
gapminder <- read_delim("datasets/02_gapminder.txt",
"\t", escape_double = FALSE, trim_ws = TRUE)
str(gapminder)
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1704 obs. of 6 variables:
# $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
# $ continent: chr "Asia" "Asia" "Asia" "Asia" ...
# $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
# $ lifeExp : num 28.8 30.3 32 34 36.1 ...
# $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
# $ gdpPercap: num 779 821 853 836 740 ...
dim(gapminder)
5.1 Subsetting¶
5.1.1 Base R¶
Suppose we were interested in the life expectancy (i.e. 4th column) for 1957 for Afganistan in the years 1952, 1962, and 1977 (i.e. rows 1, 3, and 5). How to select these multiple elements?
gapminder[c(1, 3, 5), 4]
# A tibble: 3 × 1
# lifeExp
# <dbl>
# 1 28.801
# 2 31.997
# 3 36.088 # check these against data view
But what is c(1,3,5)? It’s a vector of numbers – c() is for combine;
length(c(1, 3, 5))
# [1] 3
str(c(1, 3, 5))
# num [1:3] 1 3 5
We can select these rows and all the columns;
gapminder[c(1, 3, 5),]
# A tibble: 3 × 6
# country continent year lifeExp pop gdpPercap
# <chr> <chr> <int> <dbl> <int> <dbl>
# 1 Afghanistan Asia 1952 28.801 8425333 779.4453
# 2 Afghanistan Asia 1962 31.997 10267083 853.1007
# 3 Afghanistan Asia 1972 36.088 13079460 739.9811
A very useful special form of vector;
1:10
# [1] 1 2 3 4 5 6 7 8 9 10
6:2
# [1] 6 5 4 3 2
-1:-3
# [1] -1 -2 -3
R expects you to know this shorthand – see e.g. its use of 1:3 in the output from str(), on the previous slide. For a ‘rectangular’ selection of rows and columns;
gapminder[20:22, 3:4]
# A tibble: 3 x 2
# year lifeExp
# <int> <dbl>
# 1 1987 72.000
# 2 1992 71.581
# 3 1997 72.950
Negative values correspond to dropping those rows/columns;
gapminder[-3:-1704,] # everything but the first two rows will be dropped
# A tibble: 2 x 6
# country continent year lifeExp pop gdpPercap
# <chr> <chr> <int> <dbl> <int> <dbl>
# 1 Afghanistan Asia 1952 28.801 8425333 779.4453
# 2 Afghanistan Asia 1957 30.332 9240934 820.8530
As well as storing numbers and character strings (like “United States”, “Canada”) R can also store logicals – TRUE and FALSE. To make a new vector, with elements that are TRUE if life expectancy is above 71.5 and FALSE otherwise;
is.above.avg <- gapminder$lifeExp > 71.5
Let’s see how many of the total were TRUE and how many were FALSE using the table() function. The table() function will create a count table from a vector of categorical data.
table(is.above.avg)
# is.above.avg
# FALSE TRUE
# 1329 375
Which countries and during what years were these? (And what was the avg. life expectancy?)
gapminder[is.above.avg,] # just the rows for which is.above.avg is TRUE
# A tibble: 375 x 6
# country continent year lifeExp pop gdpPercap
# <chr> <chr> <int> <dbl> <int> <dbl>
# 1 Albania Europe 1987 72.000 3075321 3738.933
# 2 Albania Europe 1992 71.581 3326498 2497.438
# 3 Albania Europe 1997 72.950 3428038 3193.055
# 4 Albania Europe 2002 75.651 3508512 4604.212
# 5 Albania Europe 2007 76.423 3600523 5937.030
# 6 Algeria Africa 2007 72.301 33333216 6223.367
# 7 Argentina Americas 1992 71.868 33958947 9308.419
# 8 Argentina Americas 1997 73.275 36203463 10967.282
# 9 Argentina Americas 2002 74.340 38331121 8797.641
# 10 Argentina Americas 2007 75.320 40301927 12779.380
> gapminder[is.above.avg,4] # combining TRUE/FALSE (rows) and numbers (columns)
# A tibble: 375 x 1
# lifeExp
# <dbl>
# 1 72.000
# 2 71.581
# 3 72.950
# 4 75.651
# 5 76.423
# 6 72.301
# 7 71.868
# 8 73.275
# 9 74.340
# 10 75.320
One final method… for now!
Instead of specifying rows/columns of interest by number, or through vectors of `TRUE`s/`FALSE`s, we can also just give the names – as character strings, or vectors of character strings.
gapminder[,'lifeExp']
# A tibble: 1,704 x 1
# lifeExp
# <dbl>
# 1 28.801
# 2 30.332
# 3 31.997
# 4 34.020
# 5 36.088
# 6 38.438
# 7 39.854
# 8 40.822
# 9 41.674
# 10 41.763
# # ... with 1,694 more rows
gapminder[gapminder$country == 'Gabon',c("lifeExp","gdpPercap")]
# A tibble: 12 x 2
# lifeExp gdpPercap
# <dbl> <dbl>
# 1 37.003 4293.476
# 2 38.999 4976.198
# 3 40.489 6631.459
# 4 44.598 8358.762
# 5 48.690 11401.948
# 6 52.790 21745.573
# 7 56.564 15113.362
# 8 60.190 11864.408
# 9 61.366 13522.158
# 10 60.461 14722.842
# 11 56.761 12521.714
# 12 56.735 13206.485
gapminder[gapminder$country == 'Gabon',4] # okay to mix & match
# A tibble: 12 x 1
# lifeExp
# <dbl>
# 1 37.003
# 2 38.999
# 3 40.489
# 4 44.598
# 5 48.690
# 6 52.790
# 7 56.564
# 8 60.190
# 9 61.366
# 10 60.461
# 11 56.761
# 12 56.735
This is more typing than the other options, but is much easier to debug/reuse.
Base R vs tidyverse
-You know how when you get a new smartphone, it comes with an email and calendar app… but they’re not the greatest? I usually download the Google Calendar and Gmail apps on my phone because, even though they technically do the same thing, they do it better. R is similar in this way.
-When you downloaded R, it came with capabilities to import, analyze, and export data.
-But since R’s creation, users have created packages which act like plug-ins or addons or apps. These add or improve the functionality of R. We’ll be using a suite of packages called the tidyverse that tries to make R more straightforward for beginners.
-The tidyverse has two main goals:
1. Work with tidy (not messy) data
2. Make code more human readable
-Each package within the tidyverse is meant to do a particular thing, but each ultimately goes back to those two goals. We’ll be using two packages in the tidyverse, called dplyr (for manipulating tidy data in R), and ggplot2 (for visualizing tidy data in R).
5.1.2 Dplyr¶
Remember how we mentioned earlier that data should be “tidy”, that is each variable should be represented in one column and each row represents one observation. The tidyverse has a package to help us work with data in a tidy way. We are now going to discuss a package that helps you to manipulate your data, dplyr.
If you haven’t already, install dplyr and load the package so we can use its functionality
install.packages('dplyr')
library(dplyr)
dplyr works by piping commands, like you learned to do in the command line. Instead of the pipe |, we use %>%.
Subsetting in dplyr uses two functions:
- select()
- filter()
## Using select()
If, for example, we wanted to move forward with only a few of the variables in our dataframe we could use the select() function. This will subset the dataframe by columns.
# without pipes
select(gapminder,year,country,gdpPercap)
# with pipes
gaminder %>% select(year,country,gdpPercap)
To help you understand why we wrote that in that way, let’s walk through it step by step.
First we summon the gapminder dataframe and pass it on, using the pipe symbol %>%, to the next step, which is the select() function.
In this case we don’t specify which data object we use in the select() function since in gets that from the previous pipe.
Important
An important difference between dplyr and base R is when use character strings we don’t need to enclose them in quotation marks as we did above (i.e. gapminder[,’year’])
## Using filter()
Now what about subsetting rows? For this we use the filter command:
gapminder %>% filter(lifeExp > 71.5)
# A tibble: 375 x 7
# country continent year lifeExp pop gdpPercap lifeExpMonths
# <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
# 1 Albania Europe 1987 72.000 3075321 3738.933 864.000
# 2 Albania Europe 1992 71.581 3326498 2497.438 858.972
# 3 Albania Europe 1997 72.950 3428038 3193.055 875.400
# 4 Albania Europe 2002 75.651 3508512 4604.212 907.812
# 5 Albania Europe 2007 76.423 3600523 5937.030 917.076
# 6 Algeria Africa 2007 72.301 33333216 6223.367 867.612
# 7 Argentina Americas 1992 71.868 33958947 9308.419 862.416
# 8 Argentina Americas 1997 73.275 36203463 10967.282 879.300
# 9 Argentina Americas 2002 74.340 38331121 8797.641 892.080
# 10 Argentina Americas 2007 75.320 40301927 12779.380 903.840
# ... with 365 more rows
If we now wanted to subset by columns and rows, we can combine select and filter
year_country_gdp_mex <- gapminder %>%
select(year,country,gdpPercap) %>%
filter(country=="Mexico")
# A tibble: 12 x 3
# year country gdpPercap
# <int> <fct> <dbl>
# 1 1952 Mexico 3478
# 2 1957 Mexico 4132
# 3 1962 Mexico 4582
# 4 1967 Mexico 5755
# 5 1972 Mexico 6809
# 6 1977 Mexico 7675
# 7 1982 Mexico 9611
# 8 1987 Mexico 8688
# 9 1992 Mexico 9472
#10 1997 Mexico 9767
#11 2002 Mexico 10742
#12 2007 Mexico 11978
- EXERCISE - 7
Write a single command (which can span multiple lines and includes pipes) that will produce a dataframe that has the African values for lifeExp, country and year, but not for other Continents. How many rows does your dataframe have and why?
## Solution to Challenge 1
- year_country_lifeExp_Africa <- gapminder %>%
- filter(continent==”Africa”) %>% select(year,country,lifeExp)
If we want to select all columns except 1, we can do that with the - operator.
gapminder %>% select(-gdpPercap)
# A tibble: 1,704 x 5
# country continent year lifeExp pop
# <chr> <chr> <int> <dbl> <int>
# 1 Afghanistan Asia 1952 28.801 8425333
# 2 Afghanistan Asia 1957 30.332 9240934
# 3 Afghanistan Asia 1962 31.997 10267083
# 4 Afghanistan Asia 1967 34.020 11537966
# 5 Afghanistan Asia 1972 36.088 13079460
# 6 Afghanistan Asia 1977 38.438 14880372
# 7 Afghanistan Asia 1982 39.854 12881816
# 8 Afghanistan Asia 1987 40.822 13867957
# 9 Afghanistan Asia 1992 41.674 16317921
# 10 Afghanistan Asia 1997 41.763 22227415
# ... with 1,694 more rows
If we want to make a new column, use mutate. Don’t forget we have to assign it if we want to keep the changes
gapminder <- gapminder %>% mutate(lifeExpMonths= lifeExp*12)
gapminder
# A tibble: 1,704 x 7
# country continent year lifeExp pop gdpPercap lifeExpMonths
# <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
# 1 Afghanistan Asia 1952 28.801 8425333 779.4453 345.612
# 2 Afghanistan Asia 1957 30.332 9240934 820.8530 363.984
# 3 Afghanistan Asia 1962 31.997 10267083 853.1007 383.964
# 4 Afghanistan Asia 1967 34.020 11537966 836.1971 408.240
# 5 Afghanistan Asia 1972 36.088 13079460 739.9811 433.056
# 6 Afghanistan Asia 1977 38.438 14880372 786.1134 461.256
# 7 Afghanistan Asia 1982 39.854 12881816 978.0114 478.248
# 8 Afghanistan Asia 1987 40.822 13867957 852.3959 489.864
# 9 Afghanistan Asia 1992 41.674 16317921 649.3414 500.088
# 10 Afghanistan Asia 1997 41.763 22227415 635.3414 501.156
# ... with 1,694 more rows
We can pipe several commands, just like with the command line:
gapminder %>% select(lifeExp, country) %>% filter(lifeExp > 71.5) %>% mutate(lifeExpdays = lifeExp * 365)
# A tibble: 375 x 3
# lifeExp country lifeExpdays
# <dbl> <chr> <dbl>
# 1 72.000 Albania 26280.00
# 2 71.581 Albania 26127.07
# 3 72.950 Albania 26626.75
# 4 75.651 Albania 27612.61
# 5 76.423 Albania 27894.40
# 6 72.301 Algeria 26389.87
# 7 71.868 Argentina 26231.82
# 8 73.275 Argentina 26745.38
# 9 74.340 Argentina 27134.10
# 10 75.320 Argentina 27491.80
# ... with 365 more rows
We can also use outside information to help subset data.
two.countries <- c('Kenya', 'Gibon')
gapminder %>% filter(country %in% two.countries)
# A tibble: 12 x 7
# country continent year lifeExp pop gdpPercap lifeExpMonths
# <chr> <chr> <int> <dbl> <int> <dbl> <dbl>
# 1 Kenya Africa 1952 42.270 6464046 853.5409 507.240
# 2 Kenya Africa 1957 44.686 7454779 944.4383 536.232
# 3 Kenya Africa 1962 47.949 8678557 896.9664 575.388
# 4 Kenya Africa 1967 50.654 10191512 1056.7365 607.848
# 5 Kenya Africa 1972 53.559 12044785 1222.3600 642.708
# 6 Kenya Africa 1977 56.155 14500404 1267.6132 673.860
# 7 Kenya Africa 1982 58.766 17661452 1348.2258 705.192
# 8 Kenya Africa 1987 59.339 21198082 1361.9369 712.068
# 9 Kenya Africa 1992 59.285 25020539 1341.9217 711.420
# 10 Kenya Africa 1997 54.407 28263827 1360.4850 652.884
# 11 Kenya Africa 2002 50.992 31386842 1287.5147 611.904
# 12 Kenya Africa 2007 54.110 35610177 1463.2493 649.320
%in% will enable you to search all lines in the column country for all character strings in the two.countries file and will return a TRUE if it finds an one of them.
- EXERCISE - 8
Create a new dataframe that contains the total GDP for years after 1980 for countries in Europe.
## Solution to Exercise 8
EU_gdp <- gapminder %>% filter(continent == ‘Europe’ & year > 1980) %>% mutate(GDP = pop * gdpPercap)
## Using group_by() and summarize()
Now, we were supposed to be reducing the error prone repetitiveness of what can be done with base R, but up to now we haven’t done that since we would have to repeat the above for each continent. Instead of filter(), which will only pass observations that meet your criteria (in the above: continent==”Africa”), we can use group_by(), which will essentially use every unique criteria that you could have used in filter.
str(gapminder)
#'data.frame': 1704 obs. of 6 variables:
# $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
# $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
# $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
#$ lifeExp : num 28.8 30.3 32 34 36.1 ...
#$ gdpPercap: num 779 821 853 836 740 ...
str(gapminder %>% group_by(continent))
#Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
#$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
#$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
#$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
#$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
#$ lifeExp : num 28.8 30.3 32 34 36.1 ...
#$ gdpPercap: num 779 821 853 836 740 ...
#- attr(*, "vars")= chr "continent"
#- attr(*, "drop")= logi TRUE
#- attr(*, "indices")=List of 5
#..$ : int 24 25 26 27 28 29 30 31 32 33 ...
#..$ : int 48 49 50 51 52 53 54 55 56 57 ...
#..$ : int 0 1 2 3 4 5 6 7 8 9 ...
#..$ : int 12 13 14 15 16 17 18 19 20 21 ...
#..$ : int 60 61 62 63 64 65 66 67 68 69 ...
#- attr(*, "group_sizes")= int 624 300 396 360 24
#- attr(*, "biggest_group_size")= int 624
#- attr(*, "labels")='data.frame': 5 obs. of 1 variable:
#..$ continent: Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
#..- attr(*, "vars")= chr "continent"
#..- attr(*, "drop")= logi TRUE
You will notice that the structure of the dataframe where we used group_by() (grouped_df) is not the same as the original gapminder (data.frame). A grouped_df can be thought of as a list where each item in the list`is a `data.frame which contains only the rows that correspond to the a particular value continent (at least in the example above).
## Using summarize()
The above was a bit on the uneventful side but group_by() is much more exciting in conjunction with summarize(). This will allow us to create new variable(s) by using functions that repeat for each of the continent-specific data frames. That is to say, using the group_by() function, we split our original dataframe into multiple pieces, then we can run functions (e.g. mean() or sd()) within summarize().
gdp_bycontinents <- gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap=mean(gdpPercap))
#continent mean_gdpPercap
# <fctr> <dbl>
#1 Africa 2193.755
#2 Americas 7136.110
#3 Asia 7902.150
#4 Europe 14469.476
#5 Oceania 18621.609
That allowed us to calculate the mean gdpPercap for each continent, but it gets even better.
- EXERCISE - 9
Calculate the average life expectancy per country. Which nation has the longest average life expectancy and which has the shortest average life expectancy?
## Solution to Excerise 9
- lifeExp_bycountry <- gapminder %>%
- group_by(country) %>% summarize(mean_lifeExp=mean(lifeExp))
- lifeExp_bycountry %>%
- filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))
# A tibble: 2 x 2 # country mean_lifeExp # <fct> <dbl> #1 Iceland 76.5 #2 Sierra Leone 36.8