Large projects = many people = diverse skills. Poses some risks, but also opportunities. Version control is essential for complex programming projects with multiple inter-dependencies.
Necessary items:
Good coding style will make you more efficient even if you are the only person who reads it. When your code is read by multiple readers or you are developing code with co-workers, having a consistent style is even more important.
First we’ll separate the workflow (personal taste and habits e.g., IDE) from the product (logic and output). The product in our case being raw and cleaned data should not be influenced by workflow-related operations.
Use them!
Do not save your work environment - you should rerun your code each
time to make sure you haven’t broken anything. If the analyses are
lengthy or complex then save the output - which can then be sourced. In
Rstudio you can go to
Tools > Global Options> Save workspace to .Rdata on exit
change it to Never
and your workspace will not be
These habits are synergistic: you’ll get the biggest payoff if you practice all of them together.
Bad methods
Meh methods Helpful package imports (aren’t always helpful…), the method below is a halfway step to an R package but may lead to unintended conflicts.
libs <- c("tidyverse", "RODBC", "lubridate")
if(length(libs[which(libs %in% rownames(installed.packages()) == FALSE )]) > 0) {
install.packages(libs[which(libs %in% rownames(installed.packages()) == FALSE)])}
lapply(libs, library, character.only = TRUE)
Good methods
Use an appropriate IDE organized into self-contained projects. RStudio
is extremely well setup for a project-oriented workflow. Using
is another option.
Using here()
: Do not use absolute
[1] "C:/Users/Ben.Williams/Work"
There is a bit more to using here
which can be found
If you are using the R GUI - well, just stop, there are much better tools (RStudio, VSCode, Emacs) that all support projects.
|-- year
|--catch.csv # cleaned catch data
|--fsh_age_comp.csv # fishery age comp data
The general structure of a script should have:
# load ----
# data ----
to create breaks and make it easy to navigate
within your scripts and make lengthy analyses much easier to
follow.For example:
# notes ----
# This is a demonstration of how scripts should be setup
# Author: Ben Williams
# contact:
# Last edited: 2022-04
# load ----
# data ----
# typically this would be read_csv("data/iris.csv")
# but the iris dataset is built in
# change names
iris %>%
rename_all(tolower) -> iris
# eda ----
iris %>%
ggplot(aes(sepal.length, sepal.width, color=species)) +
geom_point() +
ylab('Sepal Width') +
xlab('Sepal Length') +
scale_color_scico_d(palette = 'roma')
# analysis ----
Consistent scripts are handy however, we are building an R package so will want to follow some package development protocols.
Do use:lower_case_and_underscore.R
for different search strings
If you are changing files or versions often then adding a date can be
or add a versiondata_2017_01_20_v2.csv
Do not use:DontUseMixedCases.R
no yelling
Definitely don’t use a name that only you can understandX10_20.16_T_AND_Y410.csv
Be consistent within and across projects. For example, all data file
names contain the same information:datasource_briefdescription_firstyear_lastyear.csv
Examples: survey_bio_1988_2016.csv
However, this will be challenging when developing functions to work
for multiple years in which case it may be easier to use simpler names
(e.g., survey_bio.csv
) then your folder structure
becomes very important.
Rename all columns on import of a dataset. Do this at the beginning of a script (or function) to make sure that files can be joined without naming conflicts.
Keep names short, but descriptive.
or Year
or std.cpue
(update: I’ve moved away from
using .
and now use _
No spaces
works better than c
and you can
search for and change catch
(plus c()
is a
function command…)
Know your data types, if an object is a character or factor use a capital letter, if it is numeric (double, or integer) use a lower case name. Define data types at the beginning of a script.
For example you have the integer year
in your data but
you create figures based on the factor. If you keep the two separate you
have the ability to easily plot and run various analyses without getting
errors plus:
Year <- factor(year)
is quite a bit different than
Instead of writing catch_kg
make a note at the beginning
of the script that catch
is in kg unless otherwise stated
then use catch.
Making notes in your code files is just
good practice in general.
Believe it or not there is an international date/time standard (ISO
8601) the format is: yyyy-mm-dd
Use it! Dates regularly have confounding errors - clean them at the beginning of a script and make all formats consistent (because what people send you will not be).
Place spaces around all binary operators (=
, -
, <-
, etc.). Do not place
a space before a comma, but always place one after. Place a space before
left parenthesis, except in a function call. There should be a hard
return after each pipe %>%
What does all that mean?
Do this
mtcars %>%
tibble::rownames_to_column('car') %>%
mutate(ratio = mpg / hp,
Cyl = factor(cyl)) %>%
ggplot(aes(qsec, ratio, color = Cyl)) +
geom_point() +
scale_color_scico_d(palette = 'roma')
Not this
The function()
call in R is itself a function, all R
functions have three parts: the formals()
, the
, and the environment()
. Since we are
creating an R package the majority of our functions with be in the
package environment (as opposed to global). Since this function calls
we will not use return()
at the end
of our functions. The function formals()
are the variables,
, or NULL values.
If a variable must have a value then it should be named or named w/a
value (e.g., year
or year=2017
). If a variable
will be TRUE
then set it at the most
oft used status (e.g., by_strata = FALSE
). If the function
is dependent upon a variable that may or may not be included then use
(e.g., alt_file = NULL
What does this mean?
Do this
double <- function(x){
x * x
Not this
double <- function(x){
d <- x * x
We will use roxygen2 documentation - a standard function will look something like:
#' query fishery catch data
#' @param year max year to retrieve data from
#' @param species species group code e.g., "DUSK"
#' @param area sample area "GOA", "AI" or "BS" (or combos)
#' @param db data server to connect to
#' @param save saves a file to the data/raw folder, otherwise sends output to global enviro (default: TRUE)
#' @export
#' @examples
#' \dontrun{
#' q_catch(year = 2022, species = 'NORK', area = 'GOA', db = akfin, save = TRUE)
#' }
q_catch <- function(year, species, area, db, save = TRUE) {
# function code...
We will use informative error messages throughout our functions -
this can be done using the stop()
q_catch <- function(year, species, area, db, save = TRUE) {
area = toupper(area)
if(!(area %in% c("GOA", "AI", "BS"))) {
stop("the area name is incorrect it must be 'GOA', 'BS', 'AI', or a combo e.g., ('BS', 'AI')")
When building an R package that uses functions from other packages they must be called explicitly:
fsc_table <- function(year, folder){
option(scipen = 999)
fsc = vroom::vroom(here::here(year, "data", "output", "fish_size_comp.csv"))
fsc %>%
tidytable::select(n_s, n_h) %>%
t(.) %>% %>%
tibble::rownames_to_column("name") -> samps
fsc %>%
tidytable::select(-n_s, -n_h, -AA_Index) %>%
tidytable::pivot_longer(-year) %>%
tidytable::pivot_wider(names_from = year, values_from = value, names_prefix = "y") %>% %>%
tidytable::mutate(tidytable::across(tidytable::where(is.numeric), round, digits = 4)) %>%
tidytable::mutate(name = gsub("X", "", name),
name = ifelse(tidytable::row_number() == tidytable::n(),
paste0(name, "+"), name )) %>%
tidytable::rename_with(~stringr::str_replace(., "y", "")) -> comp
names(samps) <- base::names(comp)
tidytable::bind_rows(comp, samps) %>%
vroom::vroom_write(here::here(year, folder, "tables", "tbl_10_07.csv"), delim = ",")
In order to run this function we need to have five packages loaded
(vroom, here, tidytable, tibble, stringr
), note that
is also called, but not technically necessary.
The packages can be added to the R package using the
This will add the packages to the DESCRIPTION
file as
. Once packages are added
is run to update the package.
We want to keep the number of dependencies to a minimum.
Establish assessments on annual basis.
|-- raw_data Rpackage-1 # specific to data query
|-- cleaned_data Rpackage-2 # basic data cleaning
|-- eda Rpackage-2 # visualization of raw data
|-- assessment_functions Rpackage-2 # comps, .dat builder, process output, etc.
|-- safe_figs Rpackage-3 # consistent figures (both safe & presentation)