Skip to content

Commit

Permalink
secondary documentation for ropensci std
Browse files Browse the repository at this point in the history
  • Loading branch information
pachadotdev committed Aug 15, 2024
1 parent bc374ea commit cff8bb5
Show file tree
Hide file tree
Showing 12 changed files with 162 additions and 31 deletions.
25 changes: 23 additions & 2 deletions R/feglm.R
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@
#' out. It is also possible to pass clustering variables to \code{\link{feglm}}
#' as \code{y ~ x | k | c}.
#' @param data an object of class \code{"data.frame"} containing the variables
#' in the model.
#' in the model. The expected input is a dataset with the variables specified
#' in \code{formula} and a number of rows at least equal to the number of
#' variables in the model.
#' @param family the link function to be used in the model. Similar to
#' \code{\link[stats]{glm.fit}} this has to be the result of a call to a family
#' function. Default is \code{gaussian()}. See \code{\link[stats]{family}} for
Expand All @@ -34,7 +36,26 @@
#' category. In this case, you should carefully inspect your model
#' specification.
#'
#' @return A named list of class \code{"feglm"}.
#' @return A named list of class \code{"feglm"}. The list contains the following
#' fifteen elements:
#' \item{coefficients}{a named vector of the estimated coefficients.}
#' \item{eta}{a vector of the linear predictor.}
#' \item{weights}{a vector of the weights used in the estimation.}
#' \item{hessian}{a matrix with the numerical second derivatives.}
#' \item{deviance}{the deviance of the model.}
#' \item{null_deviance}{the null deviance of the model.}
#' \item{conv}{a logical indicating whether the model converged}
#' \item{iter}{the number of iterations needed to converge}
#' \item{nobs}{a named vector with the number of observations used in the
#' estimation indicating the dropped and perfectly predicted observations}
#' \item{lvls_k}{a named vector with the number of levels in each fixed
#' effects}
#' \item{nms_fe}{a list with the names of the fixed effects variables}
#' \item{formula}{the formula used in the model}
#' \item{data}{the data used in the model after dropping non-contributing
#' observations}
#' \item{family}{the family used in the model}
#' \item{control}{the control list used in the model}
#'
#' @references Gaure, S. (2013). "OLS with Multiple High Dimensional Category
#' Variables". Computational Statistics and Data Analysis, 66.
Expand Down
21 changes: 19 additions & 2 deletions R/felm.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,23 @@
#'
#' @inheritParams feglm
#'
#' @return The function \code{\link{felm}} returns a named list of class
#' \code{"felm"}.
#' @return A named list of class \code{"felm"}. The list contains the following
#' eleven elements:
#' \item{coefficients}{a named vector of the estimated coefficients.}
#' \item{fitted.values}{a vector of the estimated dependent variable.}
#' \item{weights}{a vector of the weights used in the estimation.}
#' \item{hessian}{a matrix with the numerical second derivatives.}
#' \item{null_deviance}{the null deviance of the model.}
#' effects}
#' \item{nobs}{a named vector with the number of observations used in th
#' estimation indicating the dropped and perfectly predicted observations}
#' \item{lvls_k}{a named vector with the number of levels in each fixed
#' effects}
#' \item{nms_fe}{a list with the names of the fixed effects variables}
#' \item{formula}{the formula used in the model}
#' \item{data}{the data used in the model after dropping non-contributing
#' observations}
#' \item{control}{the control list used in the model}
#'
#' @references Gaure, S. (2013). "OLS with Multiple High Dimensional Category
#' Variables". Computational Statistics and Data Analysis, 66.
Expand Down Expand Up @@ -36,6 +51,8 @@ felm <- function(formula = NULL, data = NULL, weights = NULL) {
names(reslist)[which(names(reslist) == "eta")] <- "fitted.values"

# reslist[["hessian"]] <- NULL
reslist[["conv"]] <- NULL
reslist[["iter"]] <- NULL
reslist[["family"]] <- NULL
reslist[["deviance"]] <- NULL

Expand Down
28 changes: 27 additions & 1 deletion R/fenegbin.R
Original file line number Diff line number Diff line change
@@ -1,11 +1,15 @@
#' @title Negative Binomial model fitting with high-dimensional k-way fixed
#' effects
#'
#' @description A routine that uses the same internals as \code{\link{feglm}}.
#'
#' @inheritParams feglm
#'
#' @param init_theta an optional initial value for the theta parameter (see
#' \code{\link[MASS]{glm.nb}}).
#' @param link the link function. Must be one of \code{"log"}, \code{"sqrt"}, or
#' \code{"identity"}.
#'
#' @examples
#' # same as the example in fepoisson but with overdispersion/underdispersion
#' mod <- fenegbin(
Expand All @@ -15,7 +19,29 @@
#'
#' summary(mod)
#'
#' @return A named list of class \code{"feglm"}.
#' @return A named list of class \code{"feglm"}. The list contains the following
#' eighteen elements:
#' \item{coefficients}{a named vector of the estimated coefficients.}
#' \item{eta}{a vector of the linear predictor.}
#' \item{weights}{a vector of the weights used in the estimation.}
#' \item{hessian}{a matrix with the numerical second derivatives.}
#' \item{deviance}{the deviance of the model.}
#' \item{null_deviance}{the null deviance of the model.}
#' \item{conv}{a logical indicating whether the model converged}
#' \item{iter}{the number of iterations needed to converge}
#' \item{theta}{the estimated theta parameter}
#' \item{iter.outer}{the number of outer iterations}
#' \item{conv.outer}{a logical indicating whether the outer loop converged}
#' \item{nobs}{a named vector with the number of observations used in the
#' estimation indicating the dropped and perfectly predicted observations}
#' \item{lvls_k}{a named vector with the number of levels in each fixed
#' effects}
#' \item{nms_fe}{a list with the names of the fixed effects variables}
#' \item{formula}{the formula used in the model}
#' \item{data}{the data used in the model after dropping non-contributing
#' observations}
#' \item{family}{the family used in the model}
#' \item{control}{the control list used in the model}
#'
#' @export
fenegbin <- function(
Expand Down
3 changes: 3 additions & 0 deletions R/fepoisson.R
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
#' @title Poisson model fitting high-dimensional with k-way fixed effects
#'
#' @description A wrapper for \code{\link{feglm}} with
#' \code{family = poisson()}.
#'
#' @inheritParams feglm
#'
#' @examples
#' # same as the example in feglm but with less typing
#' mod <- fepoisson(
Expand Down
14 changes: 7 additions & 7 deletions R/srr-stats-standards.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
#' @srrstats {G1.5} *Software should include all code necessary to reproduce results which form the basis of performance claims made in associated publications.*
#' @srrstats {G1.6} *Software should include code necessary to compare performance claims with alternative implementations in other R packages.*
#' @srrstats {G2.0} *Implement assertions on lengths of inputs, particularly through asserting that inputs expected to be single- or multi-valued are indeed so.*
#' @srrstatsTODO {G2.0a} Provide explicit secondary documentation of any expectations on lengths of inputs
#' @srrstats {G2.0a} Provide explicit secondary documentation of any expectations on lengths of inputs
#' @srrstats {G2.1} *Implement assertions on types of inputs (see the initial point on nomenclature above).*
#' @srrstats {G2.1a} *Provide explicit secondary documentation of expectations on data types of all vector inputs.*
#' @srrstats {G2.2} *Appropriately prohibit or restrict submission of multivariate input to parameters expected to be univariate.*
Expand Down Expand Up @@ -52,24 +52,22 @@
#' @srrstats {G5.1} *Data sets created within, and used to test, a package should be exported (or otherwise made generally available) so that users can confirm tests and run examples.*
#' @srrstats {G5.2} *Appropriate error and warning behaviour of all functions should be explicitly demonstrated through tests. In particular,*
#' @srrstats {G5.2a} *Every message produced within R code by `stop()`, `warning()`, `message()`, or equivalent should be unique*
#' @srrstatsTODO {G5.2b} *Explicit tests should demonstrate conditions which trigger every one of those messages, and should compare the result with expected values.*
#' @srrstats {G5.2b} *Explicit tests should demonstrate conditions which trigger every one of those messages, and should compare the result with expected values.*
#' @srrstatsTODO {G5.3} *For functions which are expected to return objects containing no missing (`NA`) or undefined (`NaN`, `Inf`) values, the absence of any such values in return objects should be explicitly tested.*
#' @srrstatsTODO {G5.4} **Correctness tests** *to test that statistical algorithms produce expected results to some fixed test data sets (potentially through comparisons using binding frameworks such as [RStata](https://github.com/lbraglia/RStata)).*
#' @srrstats {G5.4a} *For new methods, it can be difficult to separate out correctness of the method from the correctness of the implementation, as there may not be reference for comparison. In this case, testing may be implemented against simple, trivial cases or against multiple implementations such as an initial R implementation compared with results from a C/C++ implementation.*
#' @srrstats {G5.4b} *For new implementations of existing methods, correctness tests should include tests against previous implementations. Such testing may explicitly call those implementations in testing, preferably from fixed-versions of other software, or use stored outputs from those where that is not possible.*
#' @srrstats {G5.4c} *Where applicable, stored values may be drawn from published paper outputs when applicable and where code from original implementations is not available*
#' @srrstatsTODO {G5.6} **Parameter recovery tests** *to test that the implementation produce expected results given data with known properties. For instance, a linear regression algorithm should return expected coefficient values for a simulated data set generated from a linear model.*
#' @srrstatsTODO {G5.6a} *Parameter recovery tests should generally be expected to succeed within a defined tolerance rather than recovering exact values.*
#' @srrstatsTODO {G5.6b} *Parameter recovery tests should be run with multiple random seeds when either data simulation or the algorithm contains a random component. (When long-running, such tests may be part of an extended, rather than regular, test suite; see G5.10-4.12, below).*
#' @srrstatsTODO {G5.7} **Algorithm performance tests** *to test that implementation performs as expected as properties of data change. For instance, a test may show that parameters approach correct estimates within tolerance as data size increases, or that convergence times decrease for higher convergence thresholds.*
#' @srrstats {G5.7} **Algorithm performance tests** *to test that implementation performs as expected as properties of data change. For instance, a test may show that parameters approach correct estimates within tolerance as data size increases, or that convergence times decrease for higher convergence thresholds.*
#' @srrstatsTODO {G5.8} **Edge condition tests** *to test that these conditions produce expected behaviour such as clear warnings or errors when confronted with data with extreme properties including but not limited to:*
#' @srrstats {G5.8a} *Zero-length data*
#' @srrstats {G5.8b} *Data of unsupported types (e.g., character or complex numbers in for functions designed only for numeric data)*
#' @srrstats {G5.8c} *Data with all-`NA` fields or columns or all identical fields or columns*
#' @srrstats {G5.8d} *Data outside the scope of the algorithm (for example, data with more fields (columns) than observations (rows) for some regression algorithms)*
#' @srrstatsTODO {G5.9} **Noise susceptibility tests** *Packages should test for expected stochastic behaviour, such as through the following conditions:*
#' @srrstatsTODO {G5.9a} *Adding trivial noise (for example, at the scale of `.Machine$double.eps`) to data does not meaningfully change results*
#' @srrstatsTODO {G5.9b} *Running under different random seeds or initial conditions does not meaningfully change results*
#' @srrstats {G5.10} *Extended tests should included and run under a common framework with other tests but be switched on by flags such as as a `<MYPKG>_EXTENDED_TESTS="true"` environment variable.* - The extended tests can be then run automatically by GitHub Actions for example by adding the following to the `env` section of the workflow:
#' @srrstats {G5.11} *Where extended tests require large data sets or other assets, these should be provided for downloading and fetched as part of the testing workflow.*
#' @srrstatsTODO {G5.12} *Any conditions necessary to run extended tests such as platform requirements, memory, expected runtime, and artefacts produced that may need manual inspection, should be described in developer documentation such as a `CONTRIBUTING.md` or `tests/README.md` file.*
Expand All @@ -81,7 +79,6 @@
#' @srrstats {RE1.4} *Regression Software should document any assumptions made with regard to input data; for example distributional assumptions, or assumptions that predictor data have mean values of zero. Implications of violations of these assumptions should be both documented and tested.*
#' @srrstats {RE2.0} *Regression Software should document any transformations applied to input data, for example conversion of label-values to `factor`, and should provide ways to explicitly avoid any default transformations (with error or warning conditions where appropriate).*
#' @srrstats {RE2.1} *Regression Software should implement explicit parameters controlling the processing of missing values, ideally distinguishing `NA` or `NaN` values from `Inf` values (for example, through use of `na.omit()` and related functions from the `stats` package).*
#' @srrstatsTODO {RE2.2} *Regression Software should provide different options for processing missing values in predictor and response data. For example, it should be possible to fit a model with no missing predictor data in order to generate values for all associated response points, even where submitted response values may be missing.*
#' @srrstats {RE2.3} *Where applicable, Regression Software should enable data to be centred (for example, through converting to zero-mean equivalent values; or to z-scores) or offset (for example, to zero-intercept equivalent values) via additional parameters, with the effects of any such parameters clearly documented and tested.*
#' @srrstats {RE2.4} *Regression Software should implement pre-processing routines to identify whether aspects of input data are perfectly collinear, notably including:*
#' @srrstats {RE2.4a} *Perfect collinearity among predictor variables*
Expand Down Expand Up @@ -130,9 +127,12 @@ NULL
#' to `@srrstatsNA`, and placed together in this block, along with explanations
#' for why each of these standards have been deemed not applicable.
#' (These comments may also be deleted at any time.)
#' @srrstatsNA {G5.5} *Correctness tests should be run with a fixed random seed*
#' @srrstatsNA {G2.14c} *replace missing data with appropriately imputed values*
#' @srrstatsNA {G2.16} *All functions should also provide options to handle undefined values (e.g., `NaN`, `Inf` and `-Inf`), including potentially ignoring or removing such values.*
#' @srrstatsNA {G5.4} **Correctness tests** *to test that statistical algorithms produce expected results to some fixed test data sets (potentially through comparisons using binding frameworks such as [RStata](https://github.com/lbraglia/RStata)).*
#' @srrstatsNA {G5.5} *Correctness tests should be run with a fixed random seed*
#' @srrstatsNA {G5.9b} *Running under different random seeds or initial conditions does not meaningfully change results*
#' @srrstatsNA {G5.11a} *When any downloads of additional data necessary for extended tests fail, the tests themselves should not fail, rather be skipped and implicitly succeed with an appropriate diagnostic message.*
#' @srrstatsNA {RE2.2} *Regression Software should provide different options for processing missing values in predictor and response data. For example, it should be possible to fit a model with no missing predictor data in order to generate values for all associated response points, even where submitted response values may be missing.*
#' @noRd
NULL
25 changes: 23 additions & 2 deletions man/feglm.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

23 changes: 20 additions & 3 deletions man/felm.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit cff8bb5

Please sign in to comment.