Prepare input data and parameters for REMA model

After the data is read into R (either manually from a .csv or other data file or by using read_admb_re), this function prepares the data and parameter settings for fit_rema. The model can be set up to run in single survey mode with one or more strata, or in multi-survey mode, which uses an additional relative abundance index (i.e. cpue) to inform predicted biomass. The optional inputs described below related to the CPUE survey data or scaling parameter q, such as cpue_dat and options_q are only used when multi_survey = 1. The function structure and documentation is modeled after wham::prepare_wham_input.

Usage

prepare_rema_input(
  model_name = "REMA for unnamed stock",
  multi_survey = 0,
  admb_re = NULL,
  biomass_dat = NULL,
  cpue_dat = NULL,
  sum_cpue_index = FALSE,
  start_year = NULL,
  end_year = NULL,
  wt_biomass = NULL,
  wt_cpue = NULL,
  PE_options = NULL,
  q_options = NULL,
  zeros = NULL,
  extra_biomass_cv = NULL,
  extra_cpue_cv = NULL
)

Arguments

model_name

name of stock or other identifier for REMA model

multi_survey

switch to run model in single or multi-survey mode. 0 (default) = single survey, 1 = multi-survey.

admb_re

list object returned from read_admb_re.R, which includes biomass survey data (admb_re$biomass_dat), optional cpue survey data (admb_re$cpue_dat), years for model predictions (admb_re$model_yrs), and model predictions of log biomass by strata in the correct format for input into REMA (admb_re$init_log_biomass_pred). If supplied, the user does not need enter biomass_dat or cpue_dat.

biomass_dat

data.frame of biomass survey data in long format with the following columns:

strata: character; the survey name, survey region, management unit, or depth strata. Note that the user must include this column even if there is only one survey strata
year: integer; survey year. Note that the user only needs to include years for which there are observations (i.e. there is no need to supply NULL or NA values for missing survey years)
biomass: numeric; the biomass estimate/observation (e.g. bottom trawl survey biomass in mt). By default, if biomass == 0, this value will be treated as an NA (i.e., a failed survey). If the user wants to make other assumptions about zeros (e.g. adding a small constant), they must define it in the data manually.
cv: numeric; the coefficient of variation (CV) of the biomass estimate (i.e. sd(biomass)/biomass)

cpue_dat

(optional) data.frame of relative abundance index (i.e. cpue) data in long format with the following columns:

strata: character; the survey name, survey region, management unit, or depth strata (note that the user must include this column even if there is only one survey strata)
year: integer; survey year. Note that the user only needs to include years for which there are observations (i.e. there is no need to supply NULL or NA values for missing survey years)
cpue: numeric; the cpue estimate/observation (e.g. longline survey cpue or relative population number); By default, if cpue == 0, this value will be treated as an NA (i.e., a failed survey). If the user wants to make other assumptions about zeros (e.g. adding a small constant), they must define it in the data manually.
cv: numeric; the coefficient of variation (CV) of the cpue estimate (i.e. sd(cpue)/cpue)

sum_cpue_index

T/F or 1/0, is the CPUE survey index able to be summed across strata to get a total CPUE survey index? For example, Longline survey relative population numbers (RPNs) are summable but longline survey numbers per hachi (CPUE) are not. Default = FALSE.

start_year

(optional) integer value specifying the start year for estimation in the model; if admb_re is supplied, this value defaults to start_year = min(admb_re$model_yrs); if admb_re is not supplied, this value defaults to the first year in either biomass_dat or cpue_dat

end_year

(optional) integer value specifying the last year for estimation in the model; if admb_re is supplied, this value defaults to end_year = max(admb_re$model_yrs); if admb_re is not supplied, this value defaults to the last year in either biomass_dat or cpue_dat

wt_biomass

(optional) a multiplier on the biomass survey data component of the negative log likelihood. For example, nll = wt_biomass * nll. Defaults to wt_biomass = 1

wt_cpue

(optional) a multiplier on the CPUE survey data component of the negative log likelihood. For example, nll = wt_cpue * nll. Defaults to wt_cpue = 1

PE_options

(optional) customize implementation of process error (PE) parameters, including options to share PE across biomass survey strata, change starting values, fix parameters, and add penalties or priors (see details)

q_options

(optional) customize implementation of scaling parameters (q), including options to define q by biomass or cpue survey cpue strata, change starting values, fix parameters, and add penalties or priors (see details). only used when multi_survey = 1

zeros

(optional) define assumptions about how to treat zero biomass or CPUE observations, including treating zeros as NAs, changing the zeros to small constants with fixed CVs, or modeling the zeros using a Tweedie distribution (see details).

extra_biomass_cv

(optional) estimate additional observation error for the biomass survey data (see details). By default, assumption = "extra_cv" will estimate one extra CV parameter, regardless of the number of biomass survey strata.

extra_cpue_cv

(optional) estimate additional observation error for the CPUE survey data (see details). By default, assumption = "extra_cv" will estimate one extra CV parameter, regardless of the number of CPUE survey strata.

Value

This function returns a named list with the following components:

data: Named list of data, passed to TMB::MakeADFun
par: Named list of parameters, passed to TMB::MakeADFun
map: Named list defining how to optionally collect and fix parameters, passed to TMB::MakeADFun
random: Character vector of parameters to treat as random effects, passed to TMB::MakeADFun
model_name: Name of stock or other identifier for REMA model
biomass_dat: A tidied long format data.frame of the biomass survey observations and associated CVs by strata. This data.frame will be 'complete' in that it will include all modeled years, with missing values treated as NAs. Note that this data.frame could differ from the admb_re$biomass_dat or input biomass if assumptions about zero biomass observations are different between the ADMB model and what the user specifies for REMA. The user can change their assumptions about zeros using the zeros argument.
cpue_dat: If optional CPUE survey data are provided and multi_survey = 1, this will be a tidied long-format data.frame of the CPUE survey observations and associated CVs by strata. This data.frame will be 'complete' in that it will include all modeled years, with missing values treated as NAs. Note that this data.frame could differ from the admb_re$biomass_dat or input biomass if assumptions about zero CPUE observations are different between the ADMB model and what the user specifies for REMA. The user can change their assumptions about zeros using the zeros argument. If optional CPUE survey data are not provided or multi_survey = 0, this object will be NULL.

Details

PE_options allows the user to specify options for process error (PE) parameters. If NULL, default PE specifications are used: one PE parameter is estimated for each biomass survey strata, initial values for log_PE are set to 1, and no penalties or priors are added. The user can modify the default PE_options using the following list of entries:

$pointer_PE_biomass

An index to customize the assignment of PE parameters to individual biomass strata. Vector with length = number of biomass strata, starting with an index of 1 and ending with the number of unique PE estimated. For example, if there are three biomass survey strata and the user wants to estimate only one PE, they would specify pointer_PE_biomass = c(1, 1, 1). By default there is one unique log_PE estimated for each unique biomass survey stratum

$initial_pars

A vector of initial values for log_PE. The default initial value for each log_PE is 1.

$fix_pars

Option to fix PE parameters, where the user specifies the index value of the PE parameter they would like to fix at the initial value. For example, if there are three biomass survey strata, and the user wants to fix the log_PE for the second stratum but estimate the log_PE for the first and third strata they would specify fix_pars = c(2) Note that this option is not recommended.

$penalty_options

Warning: the following options are experimental and not well-tested. Options for penalizing the PE likelihood or adding a prior on log_PE include the following:

"none": (default) no penalty or prior used
"wt": a multiplier on the PE and random effects component of the negative log likelihood. For example, nll = wt * nll, where wt = 1.5 is specified as a single value in the penalty_values argument
"squared_penalty": As implemented in an earlier version of the RE.tpl, this penalty prevents the PE from shrinking to zero. For example, nll = nll + (log_PE + squared_penalty)^2, where squared_penalty = 1.5. A vector of squared_penalty values is specified for each PE in the penalty_values argument
"normal_prior": Normal prior in log space, where nll = nll - dnorm(log_PE, pmu_log_PE, psig_log_PE, 1) and pmu_log_PE and psig_log_PE are specified for each PE parameter in the penalty_values argument

penalty_values

user-defined values for the penalty_options. Each penalty type will is entered as follows:

"none": (default) NULL For example, penalty_values = NULL
"wt": a single numeric value. For example, penalty_values = 1.5
"squared_penalty": a vector of numeric values with length = number of PE parameters. For example, if three PE parameters are being estimated and the user wants them to have the same penalty for each one, they would use penalty_values = c(1.5, 1.5, 1.5)
"normal_prior": a vector of paired values for each PE parameter, where each vector pair is the prior mean of log_PE pmu_log_PE and the associated standard deviation psig_log_PE. For example, if three PE parameters are being estimated and the user wants them to have the same normal prior of log_PE ~ N(1.0, 0.08), penalty_values = c(c(1.0, 0.08), c(1.0, 0.08), c(1.0, 0.08))

q_options allows the user to specify options for the CPUE survey scaling parameters (q). If multi_survey = 0 (default), no q parameters are estimated regardless of what the user defines in q_options. multi_survey = 0 and q_options = NULL, default q specifications are used: one q parameter is estimated for each CPUE survey strata, biomass and CPUE surveys are assumed to share strata definitions (i.e., biomass_dat and cpue_dat have the same number of columns and the columns represent the same strata), initial values for log_q are set to 1, and no penalties or priors are added. The user can modify the default q_options using the following list of entries:

$pointer_q_cpue

An index to customize the assignment of q parameters to individual CPUE survey strata. Vector with length = number of CPUE strata, starting with an index of 1 and ending with the number of unique q parameters estimated. For example, if there are three CPUE survey strata and the user wanted to estimate only one q, they would specify pointer_q_cpue = c(1, 1, 1). The recommended model configuration is to estimate one log_q for each CPUE survey stratum.

$pointer_biomass_cpue_strata

An index to customize the assignment of biomass predictions to individual CPUE survey strata. Vector with length = the number of biomass survey strata, starting with an index of 1 and ending with the number of unique CPUE survey strata. This pointer only needs to be defined if the number of biomass and CPUE strata are not equal. The pointer_biomass_cpue_strata option allows the user to calculate predicted biomass at the CPUE survey strata level under the scenario where the biomass survey strata is at a higher resolution than the CPUE survey strata. For example, if there are 3 biomass survey strata that are represented by only 2 CPUE survey strata, the user may specify pointer_biomass_cpue_strata = c(1, 1, 2). This specification would assign the first 2 biomass strata to the first CPUE strata, and the third biomass stratum to the second CPUE stratum. If there is no CPUE data to compliment a specific biomass stratum, the user can populate these with NAs. For example if pointer_biomass_cpue_strata = c(1, NA, 3), it means there is CPUE data for biomass strata 1 and 3 but not 2. NOTE: there cannot be a scenario where there are more CPUE survey strata than biomass survey strata because the CPUE survey is used to inform the biomass survey trend. An error will be thrown if q_options$pointer_biomass_cpue_strata is not defined and the biomass and CPUE survey strata definitions are not the same.

$initial_pars

A vector of initial values for log_q. The default initial value for each log_q is 1.

$fix_pars

Option to fix q parameters, where the user specifies the index value of the q parameter they would like to fix at the initial value. For example, if there are three CPUE survey strata, and the user wants to fix the log_q for the second stratum but estimate the log_q for the first and third strata they would specify fix_pars = c(2)

$penalty_options

Options for penalizing the q likelihood or adding a prior on log_q include the following:

"none": (default) no penalty or prior used
"normal_prior": Warning, experimental and not well-tested. Normal prior in log space, where nll = nll - dnorm(log_q, pmu_log_q, psig_log_q, 1) and pmu_log_q and psig_log_q are specified for each q parameter in the penalty_values argument

penalty_values

user-defined values for the penalty_options. Each penalty type will is entered as follows:

"none": (default) NULL For example, penalty_values = NULL
"normal_prior": a vector of paired values for each q parameter, where each vector pair is the prior mean of log_q pmu_log_q and the associated standard deviation psig_log_q. For example, if 2 q parameters are being estimated and the user wants them to have the same normal prior of log_q ~ N(1.0, 0.05), penalty_values = c(c(1.0, 0.05), c(1.0, 0.05))

zeros allows the user to specify options for how to treat zero biomass or CPUE survey observations. By default zero observations are treated as NAs and a warning msg to that effect is returned to the console. zeros allows the user to specify non-default zero assumptions using the following list of entries:

$assumption: character, name of assumption using. Only three alternatives are currently implemented, zeros = list(assumption = c("NA", "small_constant", "tweedie"). "NA" is the default; this option assumes the zero estimates are failed surveys and removes them. "small_constant" is an ad hoc method where a fixed value is added to the zero with an assumed CV. By default, the small constant = 0.0001 and the CV is the value entered by the user in the data. The user can change the assumed value and CV using options_small_constant. "tweedie" uses the Tweedie as the assumed error distribution of the survey data, which allows zeros. This alternative estimates one additional power parameter. The assumed CV for zero biomass or zero cpue survey observations defaults to 1.5. The user can change this assumed CV, change initial values for the inverse logit transformed power parameter, or fix it at initial values using options_tweedie.
$options_small_constant: a vector length of two numeric values. The first value is the small constant to add to the zero observation, the second is the user-defined coefficient for this value. The user can specify the small value but use the input CV by specifying an NA for the second value. E.g., 'options_small_constant = c(0.0001, NA)'.
$options_tweedie: a list of entries to control initial or fixed values for Tweedie parameters. Currently, this argument accepts the following entries:

extra_biomass_cv allows the user to specify options for estimating an additional CV parameter (log_tau_biomass in the source code, estimated in log-space) for the biomass survey observations. If extra_biomass_cv = NULL (default), no extra CV is estimated. The user can modify the default extra_biomass_cv options using the following list of entries:

$assumption: A string identifying what assumption is used for the biomass survey observations. Options include "none" (default in which no extra CV is estimated) or "extra_cv". If assumption = "extra_cv", by default only one extra CV will be estimated, regardless of how many biomass strata are defined. If extra_biomass_cv is not NULL, user must define appropriate assumption.
$pointer_extra_biomass_cv: An index to customize the assignment of extra CV parameters to individual biomass survey strata. Vector with length = number of biomass strata, starting with an index of 1 and ending with the number of unique extra CV parameters estimated. If there are three biomass survey strata and user wanted to estimate an extra CV per stratum, they would specify pointer_extra_biomass_cv = c(1, 2, 3). By default, only one additional parameter is estimated, regardless of how many strata are defined (i.e. pointer_extra_biomass_cv = c(1, 1, 1)).
$initial_pars: A vector of initial values for the extra biomass log_tau_biomass. The default initial value for each log_tau_biomass is log(1e-7) (approximately 0 on the arithmetic scale).
$fix_pars: Option to fix extra biomass CV parameters, where the user specifies the index value of the parameter they would like to fix at the initial value. For example, if there are three biomass survey strata defined in pointer_extra_biomass_cv, and the user wants to fix the log_tau_biomass for the second stratum but estimate the log_tau_biomass for the first and third strata they would specify fix_pars = c(2).

extra_cpue_cv allows the user to specify options for estimating an additional CV parameter (log_tau_cpue in the source code, estimated in log-space) for the cpue survey observations. If extra_cpue_cv = NULL (default), no extra CV is estimated. The user can modify the default extra_cpue_cv options using the following list of entries:

$assumption: A string identifying what assumption is used for the cpue survey observations. Options include "none" (default in which no extra CV is estimated) or "extra_cv". If assumption = "extra_cv", by default only one extra CV will be estimated, regardless of how many cpue strata are defined. If extra_cpue_cv is not NULL, user must define appropriate assumption.
$pointer_extra_cpue_cv: An index to customize the assignment of extra CV parameters to individual cpue survey strata. Vector with length = number of cpue strata, starting with an index of 1 and ending with the number of unique extra CV parameters estimated. If there are three cpue survey strata and user wanted to estimate an extra CV per stratum, they would specify pointer_extra_cpue_cv = c(1, 2, 3). By default, only one additional parameter is estimated, regardless of how many strata are defined (i.e. pointer_extra_cpue_cv = c(1, 1, 1)).
$initial_pars: A vector of initial values for the extra cpue log_tau_cpue. The default initial value for each log_tau_cpue is log(1e-7) (approximately 0 on the arithmetic scale).
$fix_pars: Option to fix extra cpue CV parameters, where the user specifies the index value of the parameter they would like to fix at the initial value. For example, if there are three cpue survey strata defined in pointer_extra_cpue_cv, and the user wants to fix the log_tau_cpue for the second stratum but estimate the log_tau_cpue for the first and third strata they would specify fix_pars = c(2).

Examples

if (FALSE) { # \dontrun{
# place holder for example code
} # }