[R] read_parquet performs to slow #38032

kostovasandra · 2023-10-05T08:33:26Z

Describe the bug, including details regarding any error messages, version, and platform.

I have been using the read_parquet() function to read a file (compressed / uncompressed) from S3, but its too slow (700MB file to be read takes 10m). I tried setting the params: arrow::set_cpu_count(2) & arrow.use_threads = FALSE, but still is slow.
For writing the same file it takes 1-2m which still is not the best.
Bellow is the code:
s3_init <- function(){

env = Sys.getenv(c("ECS_ACCESS_KEY_ID", "ECS_SECRET_ACCESS_KEY","ECS_S3_ENDPOINT_URL", "ECS_S3_BUCKET"))

bucket <- arrow::s3_bucket(env["ECS_S3_BUCKET"],
access_key=env["ECS_ACCESS_KEY_ID"],
secret_key=env["ECS_SECRET_ACCESS_KEY"],
endpoint_override=env["ECS_S3_ENDPOINT_URL"],
region = '')

return(bucket)
}
s3_save_rds <- function(file,s3_path){

bucket = s3_init()
write_parquet(file, bucket$path(s3_path), compression = "gzip")
}

s3_read_rds <- function(s3_path){

bucket = s3_init()
file = read_parquet(bucket$path(s3_path))
return(file)
}

Component(s)

R

mapleFU · 2023-10-05T10:17:35Z

Seems that should be read s3 slow.
Would you mind try some S3 config and, like pre_buffer config or env (like AWS_EC2_METADATA_DISABLED )?

kostovasandra · 2023-10-05T11:53:43Z

i tried:

Sys.setenv("AWS_EC2_METADATA_DISABLED" = "true"
file = read_parquet(bucket$path(s3_path),pre_buffer=true)
But this did not help

mapleFU · 2023-10-05T12:37:38Z

Hmmm would you mind print the metadata of the file metadata? I want to take a look at the info of file

kostovasandra · 2023-10-06T09:00:51Z

The list of Metadata is huge. Its is a dataframe of 122635 rows and 6135 columns (~700Mb). Data types of the columns are diffrent (character, numbers) and of course there can be included NUlls.
Also to mention, to save the df to S3 takes ~1m, but to read it takes 10m

mapleFU · 2023-10-06T09:32:19Z

How many rowgroups does this file containing? Since this will affect how the file is fetched

kostovasandra · 2023-10-06T09:44:46Z

The file has only one row group

mapleFU · 2023-10-06T09:45:04Z

Another is that:

class PARQUET_EXPORT ArrowReaderProperties {
 public:
  explicit ArrowReaderProperties(bool use_threads = kArrowDefaultUseThreads)
      : use_threads_(use_threads),
        read_dict_indices_(),
        batch_size_(kArrowDefaultBatchSize),
        pre_buffer_(true),
        cache_options_(::arrow::io::CacheOptions::LazyDefaults()),
        coerce_int96_timestamp_unit_(::arrow::TimeUnit::NANO) {}

we have a cache_options, can you change this to ::arrow::io::CacheOptions::Defaults()( With pre_buffer enabled), or just disable CacheOptions.lazy. This could make s3 request runs concurrently before parquet read the file(otherwise parquet reader will only send s3 request when touched the io-range).

read a 700MB file using 10min is too slow. This could be well optimized using different fetch options

kostovasandra · 2023-10-06T10:04:58Z

The file i saved in S3, is rds. I used the above advices but still reading the file did not improve (still takes 10m)
Bellow is code:

s3_path="train.rds"
CacheOptions.lazy=FALSE
bucket = s3_init() .. initialize S3 connection
file = read_parquet(bucket$path(s3_path),pre_buffer=TRUE)

mapleFU · 2023-10-06T10:07:27Z

Hmmm would CacheOptions.lazy=FALSE not affect the read_parquet internal?

https://arrow.apache.org/docs/r/reference/read_parquet.html I'm not familiar with R, but seems that the argument should be filled in ParquetArrowReaderProperties?

kostovasandra · 2023-10-06T10:08:05Z

i guess not as the time did not improve

mapleFU · 2023-10-06T10:13:45Z

Hmm can we find out a way to figure out how many io-request is sent to s3? And the range of the io? I just want to make clear how the io works 🤔. Since 10m is too long

And you may take a look at #38032 (comment) . I'm not familiar with R, but I guess the CacheOptions and prebuffer is a parquet of ParquetArrowReaderProperties, they might not filled in arguments?

kostovasandra · 2023-10-06T10:36:57Z

i tried to filled the arg in the ParquetArrowReaderProperties, but is throwing errors. Also i can not find how many io-request are send. Having in mind that write function is working fine i assume the # io-rw can be irrelevant

mapleFU · 2023-10-06T10:39:16Z

Having in mind that write function is working fine i assume the # io-rw can be irrelevant

Hmmm you can use default argument firstly. Also cc @thisisnic for some advices, I'm not familiar with R 🤔

kostovasandra · 2023-10-06T10:47:49Z

I changed to this function:
file = read_parquet(bucket$path(s3_path),pre_buffer=TRUE,as_data_frame=FALSE, .lazy=FALSE), and time improved by 1min, but still slow

mapleFU · 2023-10-06T10:54:35Z

S3 will allow a log setting, would you mind try to set this on ( maybe related to #35260 )? So that we can see the defail for io-requests kostovasandra ***@***.***> 于2023年10月6日周五 18:48写道：

…

I changed to this function: file = read_parquet(bucket$path(s3_path),pre_buffer=TRUE,as_data_frame=FALSE, .lazy=FALSE), and time improved by 1min, but still slow — Reply to this email directly, view it on GitHub <#38032 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFZZCTHCABWBJPPTLW5AUDDX57OWBAVCNFSM6AAAAAA5T3YX22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJQGM4TONJTHE> . You are receiving this because you commented.Message ID: ***@***.***>

kostovasandra · 2023-10-06T11:00:15Z

As i can see there, nobody is providing clear instruction, is on PR and no other answer

mapleFU · 2023-10-06T11:25:12Z

Oh sorry for that. s3_init has a possible argument ArrowS3GlobalOptions, it allow setting s3 log level. You can set the level to debug/trace or other to help us make clear what is happening when reading the file from S3

kostovasandra · 2023-10-06T12:12:02Z

Can you be more specific on the write command please.
S3)init() works, sr_init with paramt you wrote does not work (param not recognized) . Tries s3_init("debug") does not work

mapleFU · 2023-10-06T12:54:36Z

@paleolimbot @thisisnic I'm not familiar with R, do you know how to set prebuffer argument and s3 logging in R S3 FS SDK?

paleolimbot · 2023-10-06T13:09:16Z

how to set prebuffer argument

While I know that it is TRUE by default, I'm not actually sure how to set it from open_dataset() (@thisisnic has spent the most time here recentlky).

s3 logging

I think S3 logging in R is controlled by an environment variable but I forget if @amoeba had time to implement that before 13.0.0!

amoeba · 2023-10-06T18:55:34Z

...had time to implement that before 13.0.0!

Unfortunately, no. And I don't think we have a way to adjust log level from R at the moment. Just Python and C++.

To that point, I think it would be useful here if we could reproduce the same performance from Python (PyArrow) or Arrow C++ since we can set the log level in those environments and they're exercising similar code paths. Something like this should work:

import os

# Workaround from https://github.com/apache/arrow/issues/35575 until fixed
import pyarrow._s3fs
pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Debug)

from pyarrow import fs

s3  = fs.S3FileSystem(
  region = "",
  access_key=os.getenv("ECS_ACCESS_KEY_ID"),
  secret_key=os.getenv("ECS_SECRET_ACCESS_KEY"),
  endpoint_override=os.getenv("ECS_S3_ENDPOINT_URL")
)

table = pq.read_table(os.getenv("ECS_S3_BUCKET") + "/path/to/object", filesystem=s3)

@kostovasandra would you be willing to give that a try for us? I think it'd be a big help here.

kostovasandra · 2023-10-10T13:19:58Z

Hey, i have tries in Python and is taking 30s

mapleFU · 2023-10-10T13:24:40Z

In Python, Would you mind set pre_buffer=True for read_table. It will be true in 14.0.0, but still default false in 13.0.

kostovasandra · 2023-10-10T13:42:53Z

I did it, and it takes 10s

mapleFU · 2023-10-10T13:50:17Z

Ok, what does the s3 debug logging looks like? Would it contains some info about how does it take 10s?

kostovasandra · 2023-10-10T14:09:31Z

After i run command it did not show any log msg. How i should display the log?

mapleFU · 2023-10-10T14:35:21Z

Hmm sorry for being confusion. This is a bit confusing because I don't know how many requests have send to s3, and how they looks like.

10s is an optimization, I think it's because Default prefetch driver works a bit better than Lazy, since it will try it best to send io requests. But I don't know how these requests takes.

I'm not so familiar with s3 SDK, so might take time to debugging, so sorry for that. would you mind enable TRACE as log level ( https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/logging.html and set initialize_s3(pyarrow._s3fs.S3LogLevel.Trace) ) ? I just want to know the io-merging status.

Sorry again for low efficient

kostovasandra · 2023-10-10T14:42:43Z

With this exercise i confirmed that python performs better for reading from S3 than R. Could you maybe give advice on how to improve R functions

mapleFU · 2023-10-10T14:44:54Z

10s is an optimization, I think it's because Default prefetch driver works a bit better than Lazy, since it will try it best to send io requests. But I don't know how these requests takes.

I think they both use underlying C++ impl, but with different argument. R SDK should better has way to set the logging and CacheOptions... The performance is come from PreBuffer I think.

kostovasandra · 2023-10-10T15:02:33Z

i inserted this: pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Trace)
SO how i can see the log?

mapleFU · 2023-10-10T15:37:25Z

It would have some [INFO] [DEBUG] log on console ( since it use stdout as output )

amoeba · 2023-10-10T21:09:40Z

Right, running pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Debug) should immediately start printing to stdout. For reference, with pyarrow 13 I see:

Python 3.11.0 (main, Oct 26 2022, 04:18:06) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow._s3fs
>>>
>>> pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Debug)
[INFO] 2023-10-10 21:06:50.573 Aws_Init_Cleanup [281473376709936] Initiate AWS SDK for C++ with Version:1.10.55
[DEBUG] 2023-10-10 21:06:50.573 FileSystemUtils [281473376709936] Environment value for variable HOME is /root
[DEBUG] 2023-10-10 21:06:50.573 FileSystemUtils [281473376709936] Home directory is missing the final / appending one to normalize
[DEBUG] 2023-10-10 21:06:50.573 FileSystemUtils [281473376709936] Final Home Directory is /root/
[INFO] 2023-10-10 21:06:50.573 Aws::Config::AWSConfigFileProfileConfigLoader [281473376709936] Initializing config loader against fileName /root/.aws/credentials and using profilePrefix = 0
// continues...

kostovasandra · 2023-10-12T09:52:15Z

During read_Parquets as i mentioned i can not print it out because consist of secure data. Also file is big so automatically start filling with too much info from the dataset

mapleFU · 2023-10-12T09:55:45Z

Just want to know the number of IO and the time it costs per-io

kostovasandra · 2023-10-12T11:04:30Z

The time is 27s & the number of records is 217035

kostovasandra · 2023-10-16T14:35:33Z

Sorry, do you have any update on this?

mapleFU · 2023-10-16T17:06:03Z

I haven't got enough message, but I think it would help to allow config prefetching in R.

amoeba · 2023-10-17T20:05:58Z

Hi @kostovasandra, thanks for the patience. Support for modifying the S3 log level was just merged so that should go into the Arrow 15 release. Until then, what could really help would be if you could give us a way to reproduce your issue ourselves. Is it possible to share the Parquet file or for you to write a script that generates a Parquet file that reproduces the issue that you could share?

I did a test with a 17million row (~600MB) Parquet file on S3 and I get almost the exact timing (~17sec) in either R or Python against us-west-2. So while tweaking Parquet reader options might help, it seems to me like some characteristic of your custom S3 endpoint or your file may be involved here.

kostovasandra · 2023-10-18T07:18:47Z

Hey thank you for your answer. I can not share the file or reproduce the error, as the error is slow time of reading a file from S3.
Also, i am reading a .rds file (data frame in R), could you maybe try to test it on an .rds file.
Also can you give me others hints on how to tweak the read_parquet function?
I am not sure that url is problem as writting rds file works fine

kostovasandra · 2023-10-19T13:55:48Z

DO you have update on this?

amoeba · 2023-10-20T22:57:44Z

Hi @kostovasandra, a few thoughts:

Can you elaborate on how RDS is involved here? Looking at your code, I don't see any functions related to reading/writing RDS files, though I do see that your code uses the string 'rds' in a few places, despite using only read_parquet and write_parquet
Have you tried reading the file on other S3-compatible storage and possibly even from your local machine?
Without being able to turn on S3 logging at the moment, could you find some way to track how many packets and how much data is transferred during the 10 minutes?
Can you try any other files on the same storage or a subset of your whole file? I wonder if all reads are slow or if there's some pattern

kostovasandra · 2023-10-24T09:58:53Z

The dataframes in R are .rds file. So i use the read_parquet / write_parquet function to read/write .rds file from S3.
Yes i tries do read from local, but the read_parquet function still performs slow/
During the 10 min ~700Mb file is transferred which is 217K rows
I tried on different files similar size (to assure that file is not corrupted) but same result

I hope my answer helps

amoeba · 2023-10-26T16:51:17Z

Hi @kostovasandra, just for good measure can you give us your OS and arrow R package version?

Other things to try:

Do you get the same timing when you run read_parquet(..., as.data.frame=FALSE)
Do you get the same timing when you use open_dataset instead of read_parquet?

kostovasandra · 2023-10-27T13:09:59Z

The Arrow version is 13.0.0. The OS version is : Debian GNU/Linux 10
I tried read_parquet(..., as.data.frame=FALSE) but i get same results, maybe 30s improvement.
Open dataset i did not try because its for reading multiple files from directory and is not my case

paleolimbot · 2023-10-27T13:12:07Z

Open dataset i did not try because its for reading multiple files from directory and is not my case

I think Bryce suggested this because it uses a different Parquet implementation that is more similar to what Python is doing, and it would help us to narrow down whether this is a problem with the R package or a problem with the ParquetFileReader implementation! You can use open_dataset() to open a single parquet file as well 🙂

kostovasandra · 2023-10-27T14:50:56Z

This is what i tried but i am getting error.

bucket = s3_init()
file = open_dataset(bucket$path("train_arrow.rds"))
Error in open_dataset():
! IOError: Error creating dataset. Could not read schema from 'train_arrow.rds/'. Is this a 'parquet' file?:

Could you send me how to read one parquet file using open_dataset?

amoeba · 2023-10-27T21:27:08Z

I think the easiest thing would be to pass a URI for the file. You can construct the string for the URI following this pattern, replacing the appropriate values with your own:

open_dataset("s3://access_key:secret_key@yourbucketname/your_file.parquet?endpoint_override=yourendpoint")

So for your case you could do,

glue::glue("s3://{env['ECS_ACCESS_KEY_ID']}:{env['ECS_SECRET_ACCESS_KEY']}@{env['ECS_S3_BUCKET']}/your_file.parquet?endpoint_override={env['ECS_S3_ENDPOINT_URL']}")

kostovasandra added the Type: bug label Oct 5, 2023

github-actions bot added the Component: R label Oct 5, 2023

apache deleted a comment from kostovasandra Oct 12, 2023

[R] read_parquet performs to slow #38032

[R] read_parquet performs to slow #38032

Comments

kostovasandra commented Oct 5, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

mapleFU commented Oct 5, 2023

kostovasandra commented Oct 5, 2023 • edited Loading

mapleFU commented Oct 5, 2023

kostovasandra commented Oct 6, 2023

mapleFU commented Oct 6, 2023

kostovasandra commented Oct 6, 2023

mapleFU commented Oct 6, 2023 • edited Loading

kostovasandra commented Oct 6, 2023

mapleFU commented Oct 6, 2023 • edited Loading

kostovasandra commented Oct 6, 2023

mapleFU commented Oct 6, 2023

kostovasandra commented Oct 6, 2023

mapleFU commented Oct 6, 2023

kostovasandra commented Oct 6, 2023

mapleFU commented Oct 6, 2023 via email

kostovasandra commented Oct 6, 2023

mapleFU commented Oct 6, 2023 • edited Loading

kostovasandra commented Oct 6, 2023

mapleFU commented Oct 6, 2023

paleolimbot commented Oct 6, 2023

amoeba commented Oct 6, 2023

kostovasandra commented Oct 10, 2023

mapleFU commented Oct 10, 2023

kostovasandra commented Oct 10, 2023

mapleFU commented Oct 10, 2023

kostovasandra commented Oct 10, 2023

mapleFU commented Oct 10, 2023

kostovasandra commented Oct 10, 2023

mapleFU commented Oct 10, 2023

kostovasandra commented Oct 10, 2023

mapleFU commented Oct 10, 2023

amoeba commented Oct 10, 2023

kostovasandra commented Oct 12, 2023 • edited Loading

mapleFU commented Oct 12, 2023

kostovasandra commented Oct 12, 2023

kostovasandra commented Oct 16, 2023

mapleFU commented Oct 16, 2023

amoeba commented Oct 17, 2023

kostovasandra commented Oct 18, 2023

kostovasandra commented Oct 19, 2023

amoeba commented Oct 20, 2023

kostovasandra commented Oct 24, 2023 • edited Loading

amoeba commented Oct 26, 2023

kostovasandra commented Oct 27, 2023

paleolimbot commented Oct 27, 2023

kostovasandra commented Oct 27, 2023

amoeba commented Oct 27, 2023 • edited Loading

kostovasandra commented Oct 5, 2023 •

edited

Loading

mapleFU commented Oct 6, 2023 •

edited

Loading

mapleFU commented Oct 6, 2023 •

edited

Loading

mapleFU commented Oct 6, 2023 •

edited

Loading

kostovasandra commented Oct 12, 2023 •

edited

Loading

kostovasandra commented Oct 24, 2023 •

edited

Loading

amoeba commented Oct 27, 2023 •

edited

Loading