Skip to content

Releases: tidyverse/readr

readr 2.1.5

10 Jan 23:29
Compare
Choose a tag to compare
  • No major user-facing changes. Patch release with housekeeping changes and
    internal changes requested by CRAN around format specification in compiled
    code.

readr 2.1.4

10 Feb 15:57
Compare
Choose a tag to compare
  • No user-facing changes. Patch release with internal changes requested by CRAN.

readr 2.1.3

01 Oct 15:21
Compare
Choose a tag to compare
  • Help files below man/ have been re-generated, so that they give rise to valid HTML5. (This is the impetus for this release, to keep the package safely on CRAN.)

  • mini-gapminder-africa.csv and friends are new example datasets accessible via readr_example(), which have been added to illustrate reading multiple files at once, into a single data frame.

readr 2.1.2

30 Jan 23:29
Compare
Choose a tag to compare
  • read_table(), read_log(), and read_delim_chunked() (and friends) gain the show_col_types argument found elsewhere. All read_*() functions now respect the show_col_types argument or option, even when using the first edition parsing engine (#1331).

  • show_progress() uses rlang::is_interactive() instead of base::interactive() (#1356).

  • read_builtin() does more argument checking, so that we catch obviously malformed input before passing along to utils::data() (#1361).

  • chickens.csv and whitespace-sample.txt are new example datasets accessible via readr_example() (#1354).

readr 2.1.1

30 Nov 17:52
Compare
Choose a tag to compare
  • Jenny Bryan is now the maintainer.

  • Fix buffer overflow when trying to parse an integer from a field that is over 64 characters long (#1326)

readr 2.1.0

11 Nov 18:55
Compare
Choose a tag to compare
  • All readr functions again read eagerly by default. Unfortunately many users
    experienced frustration from the drawbacks of lazy reading, in particular
    locking files on Windows, so it was decided to disable lazy reading default.
    However options(readr.read_lazy = TRUE) can be used to set the default to by lazy if desired.
  • New readr.read_lazy global option to control if readr reads files lazily or not (#1266)

readr 2.0.2

27 Sep 20:19
Compare
Choose a tag to compare
  • minor test tweak for compatibility with testthat 3.1.0 (#@lionel-, #1304)

  • write_rds() gains a text= argument, to control using a text based object representation, like the ascii= argument in saveRDS() (#1270)

readr v2.0.1

27 Aug 13:39
Compare
Choose a tag to compare
  • options(readr.show_col_types = FALSE) now works as intended (#1250)
  • read_delim_chunked() now again correctly respects the chunk_size parameter (#1248)
  • type_convert() gains a guess_integer argument, passed to guess_parser() (@jmbarbone, #1264)
  • read_tsv() now correctly passes the quote and na arguments to vroom::vroom() (#1254, #1255)
  • Avoid spurious byte compilation errors due to the programatically generated spec_*() functions.

readr 2.0.0

20 Jul 15:25
Compare
Choose a tag to compare

second edition changes

readr 2.0.0 is a major release of readr and introduces a new second edition parsing and writing engine implemented via the vroom package.

This engine takes advantage of lazy reading, multi-threading and performance characteristics of modern SSD drives to significantly improve the performance of reading and writing compared to the first edition engine.

We will continue to support the first edition for a number of releases, but eventually this support will be first deprecated and then removed.

You can use the with_edition() or local_edition() functions to temporarily change the edition of readr for a section of code.

e.g.

  • with_edition(1, read_csv("my_file.csv")) will read my_file.csv with the first edition of readr.

  • readr::local_edition(1) placed at the top of your function or script will use the first edition for the rest of the function or script.

Lazy reading

Edition two uses lazy reading by default.
When you first call a read_*() function the delimiters and newlines throughout the entire file are found, but the data is not actually read until it is used in your program.
This can provide substantial speed improvements for reading character data.
It is particularly useful during interactive exploration of only a subset of a full dataset.

However this also means that problematic values are not necessarily seen
immediately, only when they are actually read.
Because of this a warning will be issued the first time a problem is encountered,
which may happen after initial reading.

Run problems() on your dataset to read the entire dataset and return all of the problems found.
Run problems(lazy = TRUE) if you only want to retrieve the problems found so far.

Deleting files after reading is also impacted by laziness.
On Windows open files cannot be deleted as long as a process has the file open.
Because readr keeps a file open when reading lazily this means you cannot read, then immediately delete the file.
readr will in most cases close the file once it has been completely read.
However, if you know you want to be able to delete the file after reading it is best to pass lazy = FALSE when reading the file.

Reading multiple files at once

Edition two has built-in support for reading sets of files with the
same columns into one output table in a single command.
Just pass the filenames to be read in the same vector to the reading function.

First we generate some files to read by splitting the nycflights dataset by
airline.

library(nycflights13)
purrr::iwalk(
  split(flights, flights$carrier),
  ~ { .x$carrier[[1]]; vroom::vroom_write(.x, glue::glue("flights_{.y}.tsv"), delim = "\t") }
)

Then we can efficiently read them into one tibble by passing the filenames
directly to readr.

files <- fs::dir_ls(glob = "flights*tsv")
files
readr::read_tsv(files)

If the filenames contain data, such as the date when the sample was collected,
use id argument to include the paths as a column in the data.
You will likely have to post-process the paths to keep only the relevant portion for your use case.

Delimiter guessing

Edition two supports automatic guessing of delimiters.
Because of this you can now use read_delim() without specifying a delim argument in many cases.

x <- read_delim(readr_example("mtcars.csv"))

Literal data

In edition one the reading functions treated any input with a newline in it or vectors of length > 1 as literal data.
In edition two vectors of length > 1 are now assumed to correspond to multiple files.
Because of this we now have a more explicit way to represent literal data, by putting I() around the input.

readr::read_csv(I("a,b\n1,2"))

License changes

We are systematically re-licensing tidyverse and r-lib packages to use the MIT license, to make our package licenses as clear and permissive as possible.

To this end the readr and vroom packages are now released under the MIT license.

Deprecated or superseded functions and features

  • melt_csv(), melt_delim(), melt_tsv() and melt_fwf() have been superseded by functions in the same name in the meltr package.
    The versions in readr have been deprecated.
    These functions rely on the first edition parsing code and would be challenging to update to the new parser.
    When the first edition parsing code is eventually removed from readr they will be removed.

  • read_table2() has been renamed to read_table(), as most users expect read_table() to work like utils::read.table().
    If you want the previous strict behavior of the read_table() you can use read_fwf() with fwf_empty() directly (#717).

  • Normalizing newlines in files with just carriage returns \r is no longer supported.
    The last major OS to use only CR as the newline was 'classic' Mac OS, which had its final release in 2001.

Other second edition changes

  • read_*_chunked() functions now include their specification as an attribute (#1143)

  • All read_*() functions gain a col_select argument to more easily choose which columns to select.

  • All read_*() functions gain a id argument to optionally store the file paths when reading multiple files.

  • All read_*() functions gain a name_repair argument to control how column names are repaired.

  • All read_*() and write_*() functions gain a num_threads argument to control the number of processing threads they use (#1201)

  • All write_*() and format_*() functions gain quote and escape arguments, to explicitly control how fields are quoted and how double quotes are escaped. (#653, #759, #844, #993, #1018, #1083)

  • All write_*() functions gain a progress argument and display a progress bar when writing (#791).

  • write_excel_csv() now defaults to quote = "all" (#759)

  • write_tsv() now defaults to quote = "none" (#993)

  • read_table() now handles skipped lines with unpaired quotes properly (#1180)

Additional features and fixes

  • The BH package is no longer a dependency.
    The boost C++ headers in BH have thousands of files, so can take a long time to extract and compiling them takes a great deal of memory, which made readr difficult to compile on systems with limited memory (#1147).

  • readr now uses the tzdb package when parsing date-times (@DavisVaughan, tidyverse/vroom#273)

  • Chunked readers now support files with more than INT_MAX (~ 2 Billion) number of lines (#1177)

  • Memory no longer inadvertently leaks when reading memory from R connections (#1161)

  • Invalid date formats no longer can potentially crash R (#1151)

  • col_factor() now throws a more informative error message if given non-character levels (#1140)

  • problems() now takes .Last.value as its default argument.
    This lets you run problems() without an argument to see the problems in the previously read dataset.

  • read_delim() fails when sample of parsing problems contains non-ASCII characters (@hidekoji, #1136)

  • read_log() gains a trim_ws argument (#738)

  • read_rds() and write_rds() gain a refhook argument, to pass functions that handle references objects (#1206)

  • read_rds() can now read .Rds files from URLs (#1186)

  • read_*() functions gain a show_col_types argument, if set to FALSE this turns off showing the column types unconditionally.

  • type_convert() now throws a warning if the input has no character columns (#1020)

  • write_csv() now errors if given a matrix column (#1171)

  • write_csv() now again is able to write data with duplicated column names (#1169)

  • write_file() now forces its argument before opening the output file (#1158)

readr 1.4.0

06 Oct 13:23
Compare
Choose a tag to compare

Breaking changes

  • write_*() functions first argument is now file instead of path, for consistency with the read_*() functions.
    path has been deprecated and will be removed in a future version of readr (#1110, @brianrice2)

  • write_*() functions now output any NaN values in the same way as NA values, controlled by the na= argument. (#1082).

New features

  • It is now possible to generate a column specification from any tibble (or data.frame) with as.col_spec() and convert any column specification to a short representation with as.character()

    s <- as.col_spec(iris)
    s
    #> cols(
    #>   Sepal.Length = col_double(),
    #>   Sepal.Width = col_double(),
    #>   Petal.Length = col_double(),
    #>   Petal.Width = col_double(),
    #>   Species = col_factor(levels = c("setosa", "versicolor", "virginica"), ordered = FALSE, include_na = FALSE)
    #> )
    as.character(s)
    #> [1] "ddddf"
    
  • The cli package is now used for all messages.

  • The runtime performance for tables with an extreme number of columns is greatly improved (#825)

  • Compressed files are now detected by magic numbers rather than by the file extension (#1125)

  • A memory leak when reading files is now fixed (#1092)

  • write_*() functions gain a eol = argument to control the end of line character used (#857).
    This allows writing of CSV files with Windows newlines (CRLF) if desired.

  • The Rcpp dependency has been removed in favor of cpp11.

  • The build system has been greatly simplified so should work on more systems.

Additional features and fixes

  • The full problem field is now displayed in the problems tibble, as intended (#444).

  • New %h placeholder for parsing unrestricted hours (<0 and >23) to support parsing durations (#549, @krlmlr).

  • as.character.col_spec() now handles logical columns as well (#1127)

  • fwf_positions(end) no longer has a default argument and must be specified (#996)

  • guess_parser() gains a na argument and removes NA values before guessing (#1041).

  • parse_guess() now passes the na argument to guess_parser()

  • read_* functions now close properly all connections, including on errors like HTTP errors when reading from a url (@cderv, #1050).

  • read_delimited() no longer mistakenly stats literal filenames (#1063)

  • read_lines() now ignores quotations when skipping lines (#991).

  • read_lines(skip_empty_rows = TRUE) no longer crashes if a file ends with an empty line (#968)

  • write_*() functions now invisibly return the input data frame unchanged, rather than a version with factors and dates converted to strings. (@jesse-ross, #975).

  • write_csv2() now formats decimal numbers more consistently with utils::write.csv2() (#1087)

  • write_csv2() and format_csv2() no longer pad number columns with whitespaces (@keesdeschepper, #1046).

  • write_excel_csv() no longer outputs a byte order mark when appending to a file (#1075).

  • Uses of tibble::data_frame updated to tibble::tibble (tidyverse/dplyr#4069, @thays42, #1124, @brianrice2)

  • read_delimited() now returns an empty tibble::data_frame() rather than signaling an error when given a connection with an empty file (@pralitp, #963).

  • More helpful error when trying to write out data frames with list columns (@ellessenne, #938)

  • type_convert() removes a 'spec' attribute, because the current columns likely have modified data types. The 'spec' attribute is set by functions like read_delim() (@jimhester, @wibeasley, #1032).

  • write_rds() now can specify the Rds version to use. The default value is 2 as it's compatible to R versions prior to 3.5.0 (@shrektan, #1001).