Skip to content

Commit

Permalink
Add check for large files in 'data/' directory (Bioconductor#67)
Browse files Browse the repository at this point in the history
* Add checkLargeFiles to checks.R that checks if files with more than
5 MB exist in 'data/' and if yes recommends ExperimentHub/AnnotationHub
* Add unit test: creates large file with for loop
* Add call to BiocCheck.R that calls checkLargeFiles() on real package.
Uses same flag as checkIndivFileSizes
  • Loading branch information
const-ae committed May 18, 2020
1 parent 4687a90 commit 5fad21c
Show file tree
Hide file tree
Showing 3 changed files with 45 additions and 0 deletions.
6 changes: 6 additions & 0 deletions R/BiocCheck.R
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,12 @@ BiocCheck <- function(package=".", ...)
checkIndivFileSizes(package_dir)
}

if (is.null(dots[["no-check-file-size"]])){
handleCheck("Checking files sizes in data...")
checkLargeFiles(package_dir)
}


if (is.null(dots[["no-check-bioc-views"]]))
{
handleCheck("Checking biocViews...")
Expand Down
23 changes: 23 additions & 0 deletions R/checks.R
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,29 @@ checkIndivFileSizes <- function(pkgdir)
}
}

checkLargeFiles <- function(pkgdir){
pkgType <- getPkgType(pkgdir)
if (is.na(pkgType) || pkgType == "Software") {
maxSize <- 5*10^6 ## 5MB
allFiles <- list.files(file.path(pkgdir, "data"), all.files=TRUE)
allFilesFullName <- file.path(pkgdir, "data", allFiles)

This comment has been minimized.

Copy link
@grimbough

grimbough May 21, 2020

I haven't run this, but are you appending the full path to the file names here? If so, you can use list.files(..., full.names = TRUE) to do this in one step.

This comment has been minimized.

Copy link
@lshep

lshep May 21, 2020

I can't remember exactly without running it and debugging it -- I think it had to be done this way depending on whether BiocCheck was run in R as function vs command line and that it potentially be in a temporary directory so the full path wasn't quite what we wanted -- but I could be wrong and it might be possible to simplify -- wouldn't know without fully testing.


sizes <- file.size(allFilesFullName)
largeFiles <- paste(allFiles[sizes > maxSize], collapse=" ")
if (any(sizes > maxSize)) {
handleWarning(
"Found the following large files (over 5MB) in the ",
"'data/' folder: ",
paste0("'", largeFiles, "'", collapse = " "),
"\nConsider uploading them to ExperimentHub ",
"<https://www.bioconductor.org/packages/ExperimentHub/> or ",
"AnnotationHub <https://bioconductor.org/packages/AnnotationHub/>."
)
return(TRUE)
}
}
}

checkBiocViews <- function(pkgdir)
{
dirty <- FALSE
Expand Down
16 changes: 16 additions & 0 deletions inst/unitTests/test_BiocCheck.R
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,22 @@ test_badFiles <- function(){
unlink(pkgdir)
}

test_largeDataFile <- function(){
pkgdir <- create_test_package("testpkg")
dir.create(file.path(pkgdir, "data"))
datafile <- file.path(pkgdir, "data/large_data.rda")
file.create(datafile)
zz <- base::file(datafile, "w") # open an output file connection
for(iter in seq_len(3e5)){ # 1e5 iter = 2MB
cat("Line number: ", iter, "\n", file = zz)
}
close(zz)
BiocCheck:::checkLargeFiles(pkgdir)
checkEquals(1, .warning$getNum())
.zeroCounters()
unlink(pkgdir)
}

test_checkBBScompatibility <- function()
{
pkgdir <- UNIT_TEST_TEMPDIR
Expand Down

1 comment on commit 5fad21c

@lshep
Copy link

@lshep lshep commented on 5fad21c May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the code it might make more sense to implement the check for large data files directly in the checkIndivFileSizes function since currently both functions would pick up on large data files as you pointed out.
You could filter the files picked up by the list.files for any that are in the data (maybe also inst/extdata?) directory vs other (make sure to test that it functions properly with or without a data directory present)

The above is probably recommended. If you continue with two seperate checks with is not as ideal:

  1. I would move the BiocCheck.R function into the section for
 if (is.null(dots[["no-check-file-size"]])){
} 
  1. consider changing the name perhaps checkLargeDataFiles or something indicating data
  2. make sure the two checks don't pick up the same files

Be sure to test the edge case of if the data directory is present or not
present since it does not have to exist in a package.

Unit Testing:
Do not create a large file. This should be avoided as we don't want to use up tmp space on the builders or anyones individual machines.
I would suggest making a helper function to pass in file sizes and a max size. Then in the test make the max size low to trigger the failure for testing.

Please sign in to comment.