- The function
get_dupes()
now uses tidyselect specification, the same as many tidyverse functions such asdplyr::select()
. This allows removal of columns to be considered using-column_name
as well as the matching functionsstarts_with()
,ends_with()
,contains()
, andmatches()
.
- A
quiet
argument was added toremove_empty()
andremove_constant()
providing more information (whenFALSE
) (#70, thanks to @jbkunst for suggesting and @billdenney for implementing). row_to_names()
will now work on matrix input (#320, thanks to @billdenney for suggesting and implementing- The new function
signif_half_up()
rounds a numeric vector to the specified number of significant digits with halves rounded up (#314, thanks to @khueyama for suggesting and implementing).
- The
name
argument toadorn_totals()
is correctly applied to 3-way tabyls (#306) Thanks to @jzadra for reporting. remove_constant()
works correctly with tibbles in addition to data.frames and matrices which already worked (thanks to @billdenney for implementing).
- The new function
make_clean_names()
takes a character vector and returns the cleaned text, with the same functionality as the existingclean_names()
, which runs on a data.frame, manipulating its names. (#197, thanks @tazinho and everyone who contributed to the discussion).
This function can be supplied as a value for the .name_repair
argument of as_tibble()
in the tibble
package. For example: as_tibble(iris, .name_repair = make_clean_names)
.
-
The new function
compare_df_cols()
compares the names and classes of columns in a set of supplied data.frames or tibbles, reporting on the specific columns that are or are not similar. This is for the common use case where a set of data files should all have the same specifications but, in practice, may not. A companion functioncompare_df_cols_same()
gives aTRUE/FALSE
result indicating if the columns are the same (and therefore bindable, though FALSE is not definitive that binding will fail).- Its helper function
describe_class()
is exported for developers who wish to extend it so that thecompare_df_
functions treat their custom classes appropriately.
- Its helper function
This feature (#50) took almost 3 years from conception to implementation. Major thanks to @billdenney for making it happen!
-
A new function
round_to_fraction()
allows rounding to a fraction with specified denominator, e.g., to the nearest 1/7 (#235, thanks to @billdenney for suggesting & implementing). -
The functions
janitor::chisq.test()
andjanitor::fisher.test()
to enable running these statistical tests from the basestats
package on two-waytabyl
objects. While the package loading message says the base functions are masked, the base tests still run ontable
objects (#255, thanks @juba for implementing). -
remove_empty()
now has a companion functionremove_constant()
which removes columns containing only a single unique value, optionally ignoringNA
(#222, thanks to @billdenney for suggesting & implementing).
-
excel_numeric_to_date()
now returns a POSIXct object and includes a time zone. (#225, thanks to @billdenney for the feature.) -
clean_names()
can now be called on a simple features object from thesf
package. (#247, thanks to @JosiahParry for suggesting & implementing.) -
adorn_totals()
gains an argument"name"
that allows the user to specify a value other than "Total" to appear as the name of the added row and/or column (#263). Thanks to @StephieLaPugh for suggesting and @daniel-barnett for implementing. -
remove_empty()
andremove_constant()
now work with matrices (returning a matrix). (#215) Thanks to @jsta for reporting and @billdenney for patching. -
If the third variable in a three-way tabyl is a factor, the resulting list is sorted in order of its levels (#250). Empty factor levels in the 3rd variable are still omitted regardless of the value of
show_missing_levels
.
-
excel_numeric_to_date()
no longer gives an overflow error for integer input (for dates since 1968). (#241) Thanks to @hideaki for reporting and @billdenney for patching. -
clean_names()
andmake_clean_names()
now support 'none' as a case option, passed through tosnakecase::to_any_case()
. (#269) Thanks to @andrewbarros for reporting and patching.
Patches a bug introduced in version 1.1.0 where excel_numeric_to_date()
would fail if given an input vector containing an NA
value.
excel_numeric_to_date()
again handlesNA
correctly, in version 1.1.0 the function would error if any values of the input vector wereNA
. (#220). Thanks @emilelatour for reporting and @billdenney for patching.
This release was requested by CRAN to address some minor package dependency issues. It also contains several updates and additions described below.
The new function row_to_names()
handles the case where a dirty data file is read in with its names stored as a row of the data.frame, rather than in the names. This function sets the names of the data.frame to this row and optionally cleans up the rows above and including where the names were stored. Thanks to @billdenney for writing this feature.
excel_numeric_to_date()
can now convert fractions of a day to time, e.g., excel_numeric_to_date(43001.01, include_time = TRUE)
returns the POSIXlt value "2017-09-23 00:14:24"
. Thanks to @billdenney.
As part of excel_numeric_to_date()
now handling times, if a Date-only result is requested (the default behavior of include_time = FALSE
), any fractional part of the date is now removed. The printed date itself is identical, but the internal representation of this object now contains only the integer part of the date. For example, while under both the old and new versions of this function the call excel_numeric_to_date_old(42001.1)
would return the Date object "2014-12-28"
, calling as.numeric
on this Date result would previously return 16432.1
, while now it returns 16432
.
This an improved behavior, as now excel_numeric_to_date(42001.1, include_time = FALSE) == as.Date("2014-12-28")
returns TRUE, while previously it would appear to be equivalent from the printed value but this comparison would return FALSE.
A stable version 1.0.0, with a new tabyl
API and with breaking changes to the output of clean_names()
.
This builds on the original functionality of janitor, with similar-but-improved tools and significantly-changed implementation.
tabyl()
is now a single function that can count combinations of one, two, or three variables, ala base R's table()
. The resulting tabyl
data.frames can be manipulated and formatted using a family of adorn_
functions. See the tabyls vignette for more.
The now-redundant legacy functions crosstab()
and adorn_crosstab()
have been deprecated, but remain in the package for now. Existing code that relies on the version of tabyl
present in janitor versions <= 0.3.1 will break if the sort
argument was used, as that argument no longer exists in tabyl
(use dplyr::arrange()
instead).
clean_names()
now detects and preserves camelCase inputs, allows multiple options for case outputs of the cleaned names, and preserves whether there's space between letters and numbers. It also transliterates accented letters and turns #
into "number"
.
These changes may cause old code to break. E.g., a raw column name variableName
would now be converted to variable_name
(or variableName
, VariableName
, etc. depending on your preference), where previously it would have been converted to variablename
.
To minimize this inconvenience, there's a quick fix for compatibility: you can find-and-replace to insert the argument case = "old_janitor"
, preserving the old behavior of clean_names()
as of janitor version 0.3.1 (and thus not have to redo your scripts beyond that.)
No further changes are planned to clean_names()
and its results should be stable from version 1.0.0 onward.
-
clean_names()
transliterates accented letters, e.g.,çãüœ
becomescauoe
(#120). Thanks to @fernandovmacedo. -
clean_names()
offers multiple options for variable name styling. In addition tosnake_case
output you can selectsmallCamelCase
,BigCamelCase
,ALL_CAPS
and others. (#131).- Thanks to @tazinho, who wrote the snakecase package that janitor depends on to do this, as well as the patch to incorporate it into
clean_names()
. And thanks to @maelle for proposing this feature.
- Thanks to @tazinho, who wrote the snakecase package that janitor depends on to do this, as well as the patch to incorporate it into
-
Launched the janitor documentation website: http://sfirke.github.io/janitor. Thanks to the pkgdown package.
-
Deprecated the functions
remove_empty_rows()
andremove_empty_cols()
, which are replaced by the single functionremove_empty()
. (#100)- To encourage transparency,
remove_empty()
prints a message if no value is supplied for thewhich
argument; to suppress this, supply a value towhich
, even if it's the defaultc("rows", "cols")
.
- To encourage transparency,
-
The new
adorn_title()
function adds the name of the 2ndtabyl
variable (i.e., the name of the column variable). This un-tidies the data.frame but makes the result clearer to readers (#77)
- The utility function
round_half_up()
is now exported for public use. It's an exact implementation of http://stackoverflow.com/questions/12688717/round-up-from-5-in-r/12688836#12688836, written by @mrdwab. tabyl
objects now print with row numbers suppressedclean_names()
now retains the character#
as"number"
in the resulting names
adorn_totals("row")
handles quirky variable names in 1st column (#118)get_dupes()
returns the correct result when a variable in the input data.frame is already called"n"
(#162)
This is a bug-fix release with no new functionality or changes. It fixes a bug where adorn_crosstab()
failed if the tibble
package was version > 1.4.
Major changes to janitor are currently in development on GitHub and will be released soon. This is not that next big release.
The primary purpose of this release is to maintain accuracy given breaking changes to the dplyr package, upon which janitor is built, in dplyr version >0.6.0. This update also contains a number of minor improvements.
Critical: if you update the package dplyr
to version >0.6.0, you must update janitor to version 0.3.0 to ensure accurate results from janitor's tabyl()
function. This is due to a change in the behavior of dplyr's _join
functions (discussed in #111).
janitor 0.3.0 is compatible with this new version of dplyr as well as old versions of dplyr back to 0.5.0. That is, updating janitor to 0.3.0 does not necessitate an update to dplyr >0.6.0.
- The functions
add_totals_row
andadd_totals_col
were combined into a single function,adorn_totals()
. (#57). Theadd_totals_
functions are now deprecated and should not be used. - The first argument of
adorn_crosstab()
is now "dat" instead of "crosstab" (indicating that the function can be called on any data.frame, not just a result ofcrosstab()
)
- Exported the
%>%
pipe from magrittr (#107).
Deprecated the following functions:
use_first_valid_of()
- usedplyr::coalesce()
insteadconvert_to_NA()
- usedplyr::na_if()
insteadadd_totals_row()
andadd_totals_col()
- replaced by the single functionadorn_totals()
adorn_totals()
andns_to_percents()
can now be called on data.frames that have non-numeric columns beyond the first one (those columns will be ignored) (#57)adorn_totals("col")
retains factor class in 1st column if 1st column in the input data.frame was a factor
clean_names()
now handles leading spaces (#85)adorn_crosstab()
andns_to_percents()
work on a 2-column data.frame (#89)adorn_totals()
now works on a grouped tibble (#97)- Long variable names with spaces no longer break
tabyl()
andcrosstab()
(#87) - An
NA_
column in the result of acrosstab()
will appear at the last column position (#109)
tabyl()
andcrosstab()
now appear in the package manual (#65)- Fixed minor bug per CRAN request -
tabyl()
andcrosstab()
failed to retain ill-formatted variable names only when using R 3.2.5 for Windows (#76) add_totals_row()
works on two-column data.frame (#69)use_first_valid_of()
returns POSIXct-class result when given POSIXct inputs
Submitted to CRAN!
- The count in
tabyl()
for factor levels that aren't present is now0
instead ofNA
(#48)
- Can call tabyl() on the result of a tabyl(), e.g.,
mtcars %>% tabyl(mpg) %>% tabyl(n)
(#54) get_dupes()
now works on variables with spaces in column names (#62)
- Reached 100% unit test code coverage
- Added a function
adorn_crosstab()
that formats the results of acrosstab()
for pretty printing. Shows % and N in the same cell, with the % symbol, user-specified rounding (method and number of digits), and the option to include a totals row and/or column. E.g.,mtcars %>% crosstab(cyl, gear) %>% adorn_crosstab()
. crosstab()
can be called in a%>%
pipeline, e.g.,mtcars %>% crosstab(cyl, gear)
. Thanks to @chrishaid (#34)tabyl()
can also be called in a%>%
pipeline, e.g.,mtcars %>% tabyl(cyl)
(#35)- Added
use_first_valid_of()
function (#32) - Added minor functions for manipulating numeric data.frames for presentation:
ns_to_percents()
,add_totals_row()
,add_totals_col()
,
crosstab()
returns 0 instead of NA when there are no instances of a variable combination.- A call like
tabyl(df$vecname)
retains the more-descriptive$
symbol in the column name of the result - if you want a legal R name in the result, call it asdf %>% tabyl(vecname)
- Single and double quotation marks are handled by
clean_names()
- Added codecov to measure test coverage
- Added unit test coverage
- Added Travis-CI for continuous integration
- Initial draft of skeleton package on GitHub