Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] read_parquet performs to slow #38032

Open
kostovasandra opened this issue Oct 5, 2023 · 47 comments
Open

[R] read_parquet performs to slow #38032

kostovasandra opened this issue Oct 5, 2023 · 47 comments

Comments

@kostovasandra
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

I have been using the read_parquet() function to read a file (compressed / uncompressed) from S3, but its too slow (700MB file to be read takes 10m). I tried setting the params: arrow::set_cpu_count(2) & arrow.use_threads = FALSE, but still is slow.
For writing the same file it takes 1-2m which still is not the best.
Bellow is the code:
s3_init <- function(){

env = Sys.getenv(c("ECS_ACCESS_KEY_ID", "ECS_SECRET_ACCESS_KEY","ECS_S3_ENDPOINT_URL", "ECS_S3_BUCKET"))

bucket <- arrow::s3_bucket(env["ECS_S3_BUCKET"],
access_key=env["ECS_ACCESS_KEY_ID"],
secret_key=env["ECS_SECRET_ACCESS_KEY"],
endpoint_override=env["ECS_S3_ENDPOINT_URL"],
region = '')

return(bucket)
}
s3_save_rds <- function(file,s3_path){

bucket = s3_init()
write_parquet(file, bucket$path(s3_path), compression = "gzip")
}

s3_read_rds <- function(s3_path){

bucket = s3_init()
file = read_parquet(bucket$path(s3_path))
return(file)
}

Component(s)

R

@mapleFU
Copy link
Member

mapleFU commented Oct 5, 2023

Seems that should be read s3 slow.
Would you mind try some S3 config and, like pre_buffer config or env (like AWS_EC2_METADATA_DISABLED )?

@kostovasandra
Copy link
Author

kostovasandra commented Oct 5, 2023

i tried:

  • Sys.setenv("AWS_EC2_METADATA_DISABLED" = "true"
  • file = read_parquet(bucket$path(s3_path),pre_buffer=true)
    But this did not help

@mapleFU
Copy link
Member

mapleFU commented Oct 5, 2023

Hmmm would you mind print the metadata of the file metadata? I want to take a look at the info of file

@kostovasandra
Copy link
Author

The list of Metadata is huge. Its is a dataframe of 122635 rows and 6135 columns (~700Mb). Data types of the columns are diffrent (character, numbers) and of course there can be included NUlls.
Also to mention, to save the df to S3 takes ~1m, but to read it takes 10m

@mapleFU
Copy link
Member

mapleFU commented Oct 6, 2023

How many rowgroups does this file containing? Since this will affect how the file is fetched

@kostovasandra
Copy link
Author

The file has only one row group

@mapleFU
Copy link
Member

mapleFU commented Oct 6, 2023

Another is that:

class PARQUET_EXPORT ArrowReaderProperties {
 public:
  explicit ArrowReaderProperties(bool use_threads = kArrowDefaultUseThreads)
      : use_threads_(use_threads),
        read_dict_indices_(),
        batch_size_(kArrowDefaultBatchSize),
        pre_buffer_(true),
        cache_options_(::arrow::io::CacheOptions::LazyDefaults()),
        coerce_int96_timestamp_unit_(::arrow::TimeUnit::NANO) {}

we have a cache_options, can you change this to ::arrow::io::CacheOptions::Defaults()( With pre_buffer enabled), or just disable CacheOptions.lazy. This could make s3 request runs concurrently before parquet read the file(otherwise parquet reader will only send s3 request when touched the io-range).

read a 700MB file using 10min is too slow. This could be well optimized using different fetch options

@kostovasandra
Copy link
Author

The file i saved in S3, is rds. I used the above advices but still reading the file did not improve (still takes 10m)
Bellow is code:

s3_path="train.rds"
CacheOptions.lazy=FALSE
bucket = s3_init() .. initialize S3 connection
file = read_parquet(bucket$path(s3_path),pre_buffer=TRUE)

@mapleFU
Copy link
Member

mapleFU commented Oct 6, 2023

Hmmm would CacheOptions.lazy=FALSE not affect the read_parquet internal?

https://arrow.apache.org/docs/r/reference/read_parquet.html I'm not familiar with R, but seems that the argument should be filled in ParquetArrowReaderProperties?

@kostovasandra
Copy link
Author

i guess not as the time did not improve

@mapleFU
Copy link
Member

mapleFU commented Oct 6, 2023

Hmm can we find out a way to figure out how many io-request is sent to s3? And the range of the io? I just want to make clear how the io works 🤔. Since 10m is too long

And you may take a look at #38032 (comment) . I'm not familiar with R, but I guess the CacheOptions and prebuffer is a parquet of ParquetArrowReaderProperties, they might not filled in arguments?

@kostovasandra
Copy link
Author

i tried to filled the arg in the ParquetArrowReaderProperties, but is throwing errors. Also i can not find how many io-request are send. Having in mind that write function is working fine i assume the # io-rw can be irrelevant

@mapleFU
Copy link
Member

mapleFU commented Oct 6, 2023

Having in mind that write function is working fine i assume the # io-rw can be irrelevant

Hmmm you can use default argument firstly. Also cc @thisisnic for some advices, I'm not familiar with R 🤔

@kostovasandra
Copy link
Author

I changed to this function:
file = read_parquet(bucket$path(s3_path),pre_buffer=TRUE,as_data_frame=FALSE, .lazy=FALSE), and time improved by 1min, but still slow

@mapleFU
Copy link
Member

mapleFU commented Oct 6, 2023 via email

@kostovasandra
Copy link
Author

As i can see there, nobody is providing clear instruction, is on PR and no other answer

@mapleFU
Copy link
Member

mapleFU commented Oct 6, 2023

Oh sorry for that. s3_init has a possible argument ArrowS3GlobalOptions, it allow setting s3 log level. You can set the level to debug/trace or other to help us make clear what is happening when reading the file from S3

@kostovasandra
Copy link
Author

Can you be more specific on the write command please.
S3)init() works, sr_init with paramt you wrote does not work (param not recognized) . Tries s3_init("debug") does not work

@mapleFU
Copy link
Member

mapleFU commented Oct 6, 2023

@paleolimbot @thisisnic I'm not familiar with R, do you know how to set prebuffer argument and s3 logging in R S3 FS SDK?

@paleolimbot
Copy link
Member

how to set prebuffer argument

While I know that it is TRUE by default, I'm not actually sure how to set it from open_dataset() (@thisisnic has spent the most time here recentlky).

s3 logging

I think S3 logging in R is controlled by an environment variable but I forget if @amoeba had time to implement that before 13.0.0!

@amoeba
Copy link
Member

amoeba commented Oct 6, 2023

...had time to implement that before 13.0.0!

Unfortunately, no. And I don't think we have a way to adjust log level from R at the moment. Just Python and C++.

To that point, I think it would be useful here if we could reproduce the same performance from Python (PyArrow) or Arrow C++ since we can set the log level in those environments and they're exercising similar code paths. Something like this should work:

import os

# Workaround from https://github.com/apache/arrow/issues/35575 until fixed
import pyarrow._s3fs
pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Debug)

from pyarrow import fs

s3  = fs.S3FileSystem(
  region = "",
  access_key=os.getenv("ECS_ACCESS_KEY_ID"),
  secret_key=os.getenv("ECS_SECRET_ACCESS_KEY"),
  endpoint_override=os.getenv("ECS_S3_ENDPOINT_URL")
)

table = pq.read_table(os.getenv("ECS_S3_BUCKET") + "/path/to/object", filesystem=s3)

@kostovasandra would you be willing to give that a try for us? I think it'd be a big help here.

@kostovasandra
Copy link
Author

Hey, i have tries in Python and is taking 30s

@mapleFU
Copy link
Member

mapleFU commented Oct 10, 2023

In Python, Would you mind set pre_buffer=True for read_table. It will be true in 14.0.0, but still default false in 13.0.

@kostovasandra
Copy link
Author

I did it, and it takes 10s

@mapleFU
Copy link
Member

mapleFU commented Oct 10, 2023

Ok, what does the s3 debug logging looks like? Would it contains some info about how does it take 10s?

@kostovasandra
Copy link
Author

After i run command it did not show any log msg. How i should display the log?

@mapleFU
Copy link
Member

mapleFU commented Oct 10, 2023

Hmm sorry for being confusion. This is a bit confusing because I don't know how many requests have send to s3, and how they looks like.

10s is an optimization, I think it's because Default prefetch driver works a bit better than Lazy, since it will try it best to send io requests. But I don't know how these requests takes.

I'm not so familiar with s3 SDK, so might take time to debugging, so sorry for that. would you mind enable TRACE as log level ( https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/logging.html and set initialize_s3(pyarrow._s3fs.S3LogLevel.Trace) ) ? I just want to know the io-merging status.

Sorry again for low efficient

@kostovasandra
Copy link
Author

With this exercise i confirmed that python performs better for reading from S3 than R. Could you maybe give advice on how to improve R functions

@mapleFU
Copy link
Member

mapleFU commented Oct 10, 2023

10s is an optimization, I think it's because Default prefetch driver works a bit better than Lazy, since it will try it best to send io requests. But I don't know how these requests takes.

I think they both use underlying C++ impl, but with different argument. R SDK should better has way to set the logging and CacheOptions... The performance is come from PreBuffer I think.

@kostovasandra
Copy link
Author

i inserted this: pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Trace)
SO how i can see the log?

@mapleFU
Copy link
Member

mapleFU commented Oct 10, 2023

It would have some [INFO] [DEBUG] log on console ( since it use stdout as output )

@amoeba
Copy link
Member

amoeba commented Oct 10, 2023

Right, running pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Debug) should immediately start printing to stdout. For reference, with pyarrow 13 I see:

Python 3.11.0 (main, Oct 26 2022, 04:18:06) [GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow._s3fs
>>>
>>> pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Debug)
[INFO] 2023-10-10 21:06:50.573 Aws_Init_Cleanup [281473376709936] Initiate AWS SDK for C++ with Version:1.10.55
[DEBUG] 2023-10-10 21:06:50.573 FileSystemUtils [281473376709936] Environment value for variable HOME is /root
[DEBUG] 2023-10-10 21:06:50.573 FileSystemUtils [281473376709936] Home directory is missing the final / appending one to normalize
[DEBUG] 2023-10-10 21:06:50.573 FileSystemUtils [281473376709936] Final Home Directory is /root/
[INFO] 2023-10-10 21:06:50.573 Aws::Config::AWSConfigFileProfileConfigLoader [281473376709936] Initializing config loader against fileName /root/.aws/credentials and using profilePrefix = 0
// continues...

@apache apache deleted a comment from kostovasandra Oct 12, 2023
@kostovasandra
Copy link
Author

kostovasandra commented Oct 12, 2023

During read_Parquets as i mentioned i can not print it out because consist of secure data. Also file is big so automatically start filling with too much info from the dataset

@mapleFU
Copy link
Member

mapleFU commented Oct 12, 2023

Just want to know the number of IO and the time it costs per-io

@kostovasandra
Copy link
Author

The time is 27s & the number of records is 217035

@kostovasandra
Copy link
Author

Sorry, do you have any update on this?

@mapleFU
Copy link
Member

mapleFU commented Oct 16, 2023

I haven't got enough message, but I think it would help to allow config prefetching in R.

@amoeba
Copy link
Member

amoeba commented Oct 17, 2023

Hi @kostovasandra, thanks for the patience. Support for modifying the S3 log level was just merged so that should go into the Arrow 15 release. Until then, what could really help would be if you could give us a way to reproduce your issue ourselves. Is it possible to share the Parquet file or for you to write a script that generates a Parquet file that reproduces the issue that you could share?

I did a test with a 17million row (~600MB) Parquet file on S3 and I get almost the exact timing (~17sec) in either R or Python against us-west-2. So while tweaking Parquet reader options might help, it seems to me like some characteristic of your custom S3 endpoint or your file may be involved here.

@kostovasandra
Copy link
Author

Hey thank you for your answer. I can not share the file or reproduce the error, as the error is slow time of reading a file from S3.
Also, i am reading a .rds file (data frame in R), could you maybe try to test it on an .rds file.
Also can you give me others hints on how to tweak the read_parquet function?
I am not sure that url is problem as writting rds file works fine

@kostovasandra
Copy link
Author

DO you have update on this?

@amoeba
Copy link
Member

amoeba commented Oct 20, 2023

Hi @kostovasandra, a few thoughts:

  • Can you elaborate on how RDS is involved here? Looking at your code, I don't see any functions related to reading/writing RDS files, though I do see that your code uses the string 'rds' in a few places, despite using only read_parquet and write_parquet
  • Have you tried reading the file on other S3-compatible storage and possibly even from your local machine?
  • Without being able to turn on S3 logging at the moment, could you find some way to track how many packets and how much data is transferred during the 10 minutes?
  • Can you try any other files on the same storage or a subset of your whole file? I wonder if all reads are slow or if there's some pattern

@kostovasandra
Copy link
Author

kostovasandra commented Oct 24, 2023

  • The dataframes in R are .rds file. So i use the read_parquet / write_parquet function to read/write .rds file from S3.
  • Yes i tries do read from local, but the read_parquet function still performs slow/
  • During the 10 min ~700Mb file is transferred which is 217K rows
  • I tried on different files similar size (to assure that file is not corrupted) but same result

I hope my answer helps

@amoeba
Copy link
Member

amoeba commented Oct 26, 2023

Hi @kostovasandra, just for good measure can you give us your OS and arrow R package version?

Other things to try:

  • Do you get the same timing when you run read_parquet(..., as.data.frame=FALSE)
  • Do you get the same timing when you use open_dataset instead of read_parquet?

@kostovasandra
Copy link
Author

The Arrow version is 13.0.0. The OS version is : Debian GNU/Linux 10
I tried read_parquet(..., as.data.frame=FALSE) but i get same results, maybe 30s improvement.
Open dataset i did not try because its for reading multiple files from directory and is not my case

@paleolimbot
Copy link
Member

Open dataset i did not try because its for reading multiple files from directory and is not my case

I think Bryce suggested this because it uses a different Parquet implementation that is more similar to what Python is doing, and it would help us to narrow down whether this is a problem with the R package or a problem with the ParquetFileReader implementation! You can use open_dataset() to open a single parquet file as well 🙂

@kostovasandra
Copy link
Author

This is what i tried but i am getting error.

bucket = s3_init()
file = open_dataset(bucket$path("train_arrow.rds"))
Error in open_dataset():
! IOError: Error creating dataset. Could not read schema from 'train_arrow.rds/'. Is this a 'parquet' file?:

Could you send me how to read one parquet file using open_dataset?

@amoeba
Copy link
Member

amoeba commented Oct 27, 2023

I think the easiest thing would be to pass a URI for the file. You can construct the string for the URI following this pattern, replacing the appropriate values with your own:

open_dataset("s3://access_key:secret_key@yourbucketname/your_file.parquet?endpoint_override=yourendpoint")

So for your case you could do,

glue::glue("s3://{env['ECS_ACCESS_KEY_ID']}:{env['ECS_SECRET_ACCESS_KEY']}@{env['ECS_S3_BUCKET']}/your_file.parquet?endpoint_override={env['ECS_S3_ENDPOINT_URL']}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants