dir.create("~/data-carpentry")
Introduction to R and RStudio
What is R? What is R Studio?
R
is a (mostly statistical) programming language and a software that interprets scripts written inR
RStudio
is an interface used to interact withR
Why learn R?
Increased reproducibility
When using code, your analysis can be clearly understood and easily rerun.
Increased extensibility
Combine R
with 10,000+ packages to extend capabilities including image analysis, time-series, population genetics, additional languages, making websites, and more.
Works with a variety of data types, shapes, and sizes
R
has special data structures and data types to handle all sorts of data.
R
can also connect to spreadsheets, databases, and read in special data formats.
Can produce publication-quality graphics Base functionality and packages allow you to create a variety of graphics, while controlling minute details.
Has a large and welcoming community
The R
user-community is VERY active, helpful, and welcoming.
Many resources have been developed to help people learn how to use R.
Knowing your around RStudio
RStudio
is an Integrated Development Environment (IDE) for working with R
and can be use to do many things. We will cover things in bold and italics.
- write code and scripts
- run code
- navigate files on your computer
- inspect variables and data objects
- visualize plots
- enable version control
- develop packages
- write
Shiny
apps
RStudio
is divided into 4 “panes”:
- The Source for your scripts and documents (top-left, in the default layout)
- Your Environment/History (top-right) which shows all the objects in your working space (Environment) and your command history (History)
- Your Files/Plots/Packages/Help/Viewer (bottom-right)
- The
R
Console (bottom-left)
Getting Set Up
We want to keep our work (data, analyses, text) self-contained in a single directory, called the working directory. This allows us to use relative paths to other files in our scripts within that working directory so that we can move our project to other locations and share our project with others without breaking any scripts.
We can do this using “Projects” in RStudio
.
- Start
RStudio
. - Under the
File
menu, click onNew Project
. ChooseNew Directory
, thenNew Project
. - Enter a name for this new folder (or “directory”), and choose a convenient location for it. This will be your working directory for the rest of the day (e.g.,
~/data-carpentry
). - Click on
Create Project
. - Download the code handout, place it in your working directory and rename it (e.g.,
data-carpentry-script.R
). - (Optional) Set Preferences to ‘Never’ save workspace in
RStudio
.
The Working Directory
The working directory is the place from there R
(and your computer) will be looking for and saving files and directories.
It’s good practice to develop a common structure for your projects to:
- keep things organized
- ensure you and others can find things in various projects
- increase the portability of scripts that might be useful in multiple projects
Directory | Contains |
---|---|
scripts/ |
R scripts for analyses, plotting, etc. |
data/ |
Raw data and processed data (probably should be kept separately) |
documents/ |
Outlines, drafts, other texts |
Creating our directory structure
For our purposes, we’ll create the following structure
That means a working directory called data-carpentry
, then 3 subdirectories, data_raw
, data
, and fig
.
Let’s do this programmatically in the Console
.
- Use
R
command calleddir.create
, passing the value “~/data_carpentry” as the “path”.
- Create the subdirectories.
dir.create("~/data-carpentry/data_raw")
dir.create("~/data-carpentry/data")
dir.create("~/data-carpentry/fig")
Interacting with R
In order to do anything in R
we need to tell the computer what to do. We do this with a command. An example is the dir.create
command that you used to create the directory structure just now. We can do this in 2 ways
- using the console (as you did above)
- using the script pane above
What’s the benefit of using the script?
RStudio
allows you to execute commands directly from the script editor.
OS | Shortcut |
---|---|
Windows | Ctrl + Enter |
macOS | Ctrl + Enter OR Cmd + Return |
You can find other keyboard shortcuts in this RStudio cheatsheet about the RStudioIDE.
Seeking help
If you need help with a specific function, for instance mean()
, you can type ?mean()
and press enter. Documentation on the function will pop up in the Help window.
Automatic Code Completion
When you write code in RStudio
, you can use its automatic code completion to remind yourself of a function’s name or arguments.
- Start typing the function name and pay attention to the suggestions that pop up.
- Use the up and down arrow to select a suggested code completion and Tab to apply it.
Dealing with error messages
We WILL encounter errors, and a lot of them! Watch for red “X”’s next to the line number in RStudio
to catch them.
If you need more help
- Google the error
- Ask a question (with appropriate context and an example) on Stack Overflow
Creating objects in R
You can get output from R
simply by typing math in the console:
3 + 5
[1] 8
12 / 7
[1] 1.714286
This can be more useful if we store these values as objects, which would allow us to refer back to these objects later.We do this by using the assignment operator <-
.
For example, if we wanted to create an object named weight_kg
and assign it the value of 55
, we could type the following
<- 55 weight_kg
When assigning a value to an object, R
does not print anything. You can force R
to print the value by using parentheses or by typing the object name:
<- 55 # doesn't print anything
weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg` (weight_kg
[1] 55
# and so does typing the name of the object weight_kg
[1] 55
Now that R
has weight_kg
in memory, we can do arithmetic with it. Let’s convert this weight into pounds (weight in pounds is 2.2 times the weight in kg):
2.2 * weight_kg
[1] 121
We can also change an object’s value by assigning it a new one:
<- 57.5
weight_kg 2.2 * weight_kg
[1] 126.5
Assigning a value to one object does not change the values of other objects.
For example, let’s store the animal’s weight in pounds in a new object, weight_lb
:
<- 2.2 * weight_kg weight_lb
Now, let’s change weight_kg
to 100.
<- 100 weight_kg
What do you think is the current content of the object weight_lb
? 126.5 or 220?
Saving your code
Until now, we’ve used the console. Since we want to save our work, let’s
- Open up a script – by pressing Ctrl + Shift + N.
- Save the script – by pressing Ctrl + S and selecting the save location, and naming it.
Adding Comments
In R
the comment character is #
. Anything to the right of #
in a script will be ignored by R
. We use comments to leave notes and explanations in scripts.
Challenge
Q&A: What are the values after each statement in the following?
<- 47.5 # mass?
mass <- 122 # age?
age <- mass * 2.0 # mass?
mass <- age - 20 # age?
age <- mass/age # mass_index? mass_index
Functions and their arguments
Functions are “canned scripts” that automate more complicated sets of commands including operations assignments, etc.
- Usually takes one or more inputs called arguments
- Often return a value (but not always)
- Example:
sqrt()
takes a number as the argument and returns the square root of the number
Executing a function (‘running it’) is called calling the function. An example of a function call is:
<- sqrt(10) weight_kg
Let’s break it down
- the value of 10 is given to the
sqrt()
function - the
sqrt()
function calculates the square root - the function returns the value which is then assigned to the object
weight_kg
Return values
As noted earlier, a value is not always returned. If returned, the return ‘value’
- can be any value, or thing
- can be multiple values (a set of things)
- can even be a dataset
Arguments
Arguments can be anything, not only numbers or filenames, but also other objects.
- Each argument can differ based on the function, and must be looked up in the documentation (see below).
- Some functions take multiple arguments
- Some arguments MUST be specified by the user when calling the function
- Some arugments might take on a default value (these are called options) if left out
Let’s try a function that can take multiple arguments: round()
.
round(3.14159)
[1] 3
We called round()
with just one argument, 3.14159
, and it returned the value 3
. That’s because the default is to round to the nearest whole number.
Let’s modify the number of digits we want in the answer. How do we know how to find more information about the round
function?
We can use args(round)
to find what arguments it takes.
args(round)
function (x, digits = 0)
NULL
We can also look at the help for this function using ?round
.
?round
We see that if we want a different number of digits, we can type digits = 2
or however many we want. Let’s try it.
round(3.14159, digits = 2)
[1] 3.14
If you provide the arguments in the exact same order as they are defined you don’t have to name them:
round(3.14159, 2)
[1] 3.14
And if you do name the arguments, you can switch their order:
round(digits = 2, x = 3.14159)
[1] 3.14
Vectors and data types
A vector is the most common and basic data type in R
.
- Can be a series of values (numbers or characters).
- Assigned using
c()
function
For example, let’s create a vector of animal weights and assign it to an object weight_g
:
<- c(50, 60, 65, 82)
weight_g weight_g
[1] 50 60 65 82
A vector can also contain characters:
<- c("mouse", "rat", "dog")
animals animals
[1] "mouse" "rat" "dog"
The quotes around “mouse”, “rat”, etc. are essential here. Without the quotes R
will assume objects have been created called mouse
, rat
and dog
. As these objects don’t exist in R
’s memory, there will be an error message.
There are many functions that allow you to inspect the content of a vector. length()
tells you how many elements are in a particular vector:
length(weight_g)
[1] 4
length(animals)
[1] 3
All of the elements are the same type of data. The function class()
indicates what kind of object you are working with:
class(weight_g)
[1] "numeric"
class(animals)
[1] "character"
The function str()
provides an overview of the structure of an object and its elements. It is a useful function when working with large and complex objects:
str(weight_g)
num [1:4] 50 60 65 82
str(animals)
chr [1:3] "mouse" "rat" "dog"
You can use the c()
function to add other elements to your vector:
<- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g weight_g
[1] 30 50 60 65 82 90
Let’s break this down:
- In the first line, we take the original vector
weight_g
, add the value90
to the end of it, and save the result back intoweight_g
. - Then we add the value
30
to the beginning, again saving the result back intoweight_g
.
We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating.
An atomic vector is the simplest R
data type and is a linear vector of a single type. These are the basic building blocks that all R
objects are built from.
The there are 6 main atomic vector thar R
uses:
"character"
for character (A
) or string values (Apple
)"numeric"
(or"double"
) for all real numbers with our without decimal values"logical"
forTRUE
andFALSE
(the boolean data type)"integer"
for integer numbers (e.g.,2L
, theL
suffix indicates toR
that it’s an integer)"complex"
to represent complex numbers with real and imaginary parts (e.g.,1 + 4i
) and that’s all we’re going to say about them"raw"
for bitstreams that we won’t discuss further
You can check the type of your vector using the typeof()
function and inputting your vector as the argument.
Vectors are one of the many data structures that R
uses. Other important ones are lists (list
), matrices (matrix
), data frames (data.frame
), factors (factor
) and arrays (array
).
Challenge
Q&A: What happens if we try to mix these types in a single vector?
Q&A: What will happen in each of these examples? (hint: use class()
to check the data type of your objects):
<- c(1, 2, 3, "a")
num_char <- c(1, 2, 3, TRUE)
num_logical <- c("a", "b", "c", TRUE)
char_logical <- c(1, 2, 3, "4") tricky
Q&A: Why do you think it happens?
Q&A: How many values in combined_logical
are "TRUE"
(as a character) in the following example (reusing the 2 ..._logical
s from above):
<- c(num_logical, char_logical) combined_logical
Q&A: You’ve probably noticed that objects of different types get converted into a single, shared type within a vector. In R
, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?
Subsetting vectors
If we want to extract one or several values from a vector, we must provide one or several indices (positions of elements in the object, starting with 1) in square brackets.
For instance:
<- c("mouse", "rat", "dog", "cat")
animals 2] animals[
[1] "rat"
c(3, 2)] animals[
[1] "dog" "rat"
We can also repeat the indices to create an object with more elements than the original one:
<- animals[c(1, 2, 3, 2, 1, 4)]
more_animals more_animals
[1] "mouse" "rat" "dog" "rat" "mouse" "cat"
R
indices start at 1. Programming languages like Fortran
, MATLAB
, Julia
, and R
start counting at 1, because that’s what human beings typically do. Languages in the C
family (including C++
, Java
, Perl
, and Python
) count from 0 because that’s simpler for computers to do.
Conditional subsetting
Another common way of subsetting is by using a logical vector.
TRUE
will select the element with the same indexFALSE
will not select the element with the same index
<- c(21, 34, 39, 54, 55)
weight_g c(TRUE, FALSE, FALSE, TRUE, TRUE)] weight_g[
[1] 21 54 55
Some functions and logical tests will output vectors of logical values, which can be useful.
For exmaple, if you wanted to select only the values above 50:
> 50 # will return logicals with TRUE for the indices that meet the condition weight_g
[1] FALSE FALSE FALSE TRUE TRUE
## so we can use this to select only the values above 50
> 50] weight_g[weight_g
[1] 54 55
Let’s break it down. 1. The weight_g > 50
inside the brackets is evaluated and returns a list of TRUE
for indices less than 50 2. The outside weight_g
is subsetted based on that returned list, pulling out only the indices that are TRUE
You can combine multiple tests using &
(both conditions are true, AND) or |
(at least one of the conditions is true, OR):
> 30 & weight_g < 50] weight_g[weight_g
[1] 34 39
<= 30 | weight_g == 55] weight_g[weight_g
[1] 21 55
>= 30 & weight_g == 21] weight_g[weight_g
numeric(0)
A quick overview of some operators
Operator | Meaning |
---|---|
& |
and |
| |
or |
> |
greater than |
< |
less than |
>= |
greater than or equal to |
<= |
less than or equal to |
== |
is equal to |
The function %in%
allows you to test if any of the elements of a search vector are found:
<- c("mouse", "rat", "dog", "cat", "cat")
animals
# return both rat and cat
== "cat" | animals == "rat"] animals[animals
[1] "rat" "cat" "cat"
# return a logical vector that is TRUE for the elements within animals
# that are found in the character vector and FALSE for those that are not
%in% c("rat", "cat", "dog", "duck", "goat", "bird", "fish") animals
[1] FALSE TRUE TRUE TRUE TRUE
# use the logical vector created by %in% to return elements from animals
# that are found in the character vector
%in% c("rat", "cat", "dog", "duck", "goat", "bird", "fish")] animals[animals
[1] "rat" "dog" "cat" "cat"
Challenge
Q&A: Can you figure out why "four" > "five"
returns TRUE
?
Missing data
Missing data are represented in vectors as NA
. When doing operations on numbers, most functions will return NA
if the data you are working with include missing values.
For example, the mean()
below:
<- c(2, 4, 4, NA, 6)
heights mean(heights)
[1] NA
This feature makes it harder to overlook the cases where you are dealing with missing data. You can add the argument na.rm = TRUE
to calculate the result as if the missing values were removed (rm
stands for ReMoved) first.
<- c(2, 4, 4, NA, 6)
heights mean(heights, na.rm = TRUE)
[1] 4
If your data include missing values, you may want to become familiar with the functions is.na()
, na.omit()
, and complete.cases()
. See below for examples.
## Extract those elements which are not missing values.
!is.na(heights)] heights[
[1] 2 4 4 6
## Returns the object with incomplete cases removed.
#The returned object is an atomic vector of type `"numeric"` (or #`"double"`).
na.omit(heights)
[1] 2 4 4 6
attr(,"na.action")
[1] 4
attr(,"class")
[1] "omit"
## Extract those elements which are complete cases.
#The returned object is an atomic vector of type `"numeric"` (or #`"double"`).
complete.cases(heights)] heights[
[1] 2 4 4 6
Recall that you can use the typeof()
function to find the type of your atomic vector.
Challenge
Q&A: 1. Using this vector of heights in inches, create a new vector, heights_no_na
, with the NAs removed.
<- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65) heights
Use the function
median()
to calculate the median of theheights
vector.Use
R
to figure out how many people in the set are taller than 67 inches.?
Citations
- Data Analysis and Visualization in R for Ecologists. https://datacarpentry.org/R-ecology-lesson/index.html