-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] read_parquet performs to slow #38032
Comments
Seems that should be read s3 slow. |
i tried:
|
Hmmm would you mind print the metadata of the file metadata? I want to take a look at the info of file |
The list of Metadata is huge. Its is a dataframe of 122635 rows and 6135 columns (~700Mb). Data types of the columns are diffrent (character, numbers) and of course there can be included NUlls. |
How many rowgroups does this file containing? Since this will affect how the file is fetched |
The file has only one row group |
Another is that:
we have a cache_options, can you change this to read a 700MB file using 10min is too slow. This could be well optimized using different fetch options |
The file i saved in S3, is rds. I used the above advices but still reading the file did not improve (still takes 10m)
|
Hmmm would https://arrow.apache.org/docs/r/reference/read_parquet.html I'm not familiar with R, but seems that the argument should be filled in |
i guess not as the time did not improve |
Hmm can we find out a way to figure out how many io-request is sent to s3? And the range of the io? I just want to make clear how the io works 🤔. Since 10m is too long And you may take a look at #38032 (comment) . I'm not familiar with R, but I guess the |
i tried to filled the arg in the ParquetArrowReaderProperties, but is throwing errors. Also i can not find how many io-request are send. Having in mind that write function is working fine i assume the # io-rw can be irrelevant |
Hmmm you can use default argument firstly. Also cc @thisisnic for some advices, I'm not familiar with R 🤔 |
I changed to this function: |
S3 will allow a log setting, would you mind try to set this on ( maybe
related to #35260 )?
So that we can see the defail for io-requests
kostovasandra ***@***.***> 于2023年10月6日周五 18:48写道:
… I changed to this function:
file =
read_parquet(bucket$path(s3_path),pre_buffer=TRUE,as_data_frame=FALSE,
.lazy=FALSE), and time improved by 1min, but still slow
—
Reply to this email directly, view it on GitHub
<#38032 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFZZCTHCABWBJPPTLW5AUDDX57OWBAVCNFSM6AAAAAA5T3YX22VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJQGM4TONJTHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
As i can see there, nobody is providing clear instruction, is on PR and no other answer |
Oh sorry for that. |
Can you be more specific on the write command please. |
@paleolimbot @thisisnic I'm not familiar with R, do you know how to set prebuffer argument and s3 logging in R S3 FS SDK? |
While I know that it is
I think S3 logging in R is controlled by an environment variable but I forget if @amoeba had time to implement that before 13.0.0! |
Unfortunately, no. And I don't think we have a way to adjust log level from R at the moment. Just Python and C++. To that point, I think it would be useful here if we could reproduce the same performance from Python (PyArrow) or Arrow C++ since we can set the log level in those environments and they're exercising similar code paths. Something like this should work: import os
# Workaround from https://github.com/apache/arrow/issues/35575 until fixed
import pyarrow._s3fs
pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Debug)
from pyarrow import fs
s3 = fs.S3FileSystem(
region = "",
access_key=os.getenv("ECS_ACCESS_KEY_ID"),
secret_key=os.getenv("ECS_SECRET_ACCESS_KEY"),
endpoint_override=os.getenv("ECS_S3_ENDPOINT_URL")
)
table = pq.read_table(os.getenv("ECS_S3_BUCKET") + "/path/to/object", filesystem=s3) @kostovasandra would you be willing to give that a try for us? I think it'd be a big help here. |
Hey, i have tries in Python and is taking 30s |
In Python, Would you mind set |
I did it, and it takes 10s |
Ok, what does the s3 debug logging looks like? Would it contains some info about how does it take 10s? |
After i run command it did not show any log msg. How i should display the log? |
Hmm sorry for being confusion. This is a bit confusing because I don't know how many requests have send to s3, and how they looks like. 10s is an optimization, I think it's because I'm not so familiar with s3 SDK, so might take time to debugging, so sorry for that. would you mind enable TRACE as log level ( https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/logging.html and set Sorry again for low efficient |
With this exercise i confirmed that python performs better for reading from S3 than R. Could you maybe give advice on how to improve R functions |
I think they both use underlying C++ impl, but with different argument. R SDK should better has way to set the logging and CacheOptions... The performance is come from PreBuffer I think. |
i inserted this: pyarrow._s3fs.initialize_s3(pyarrow._s3fs.S3LogLevel.Trace) |
It would have some |
Right, running
|
During read_Parquets as i mentioned i can not print it out because consist of secure data. Also file is big so automatically start filling with too much info from the dataset |
Just want to know the number of IO and the time it costs per-io |
The time is 27s & the number of records is 217035 |
Sorry, do you have any update on this? |
I haven't got enough message, but I think it would help to allow config prefetching in R. |
Hi @kostovasandra, thanks for the patience. Support for modifying the S3 log level was just merged so that should go into the Arrow 15 release. Until then, what could really help would be if you could give us a way to reproduce your issue ourselves. Is it possible to share the Parquet file or for you to write a script that generates a Parquet file that reproduces the issue that you could share? I did a test with a 17million row (~600MB) Parquet file on S3 and I get almost the exact timing (~17sec) in either R or Python against |
Hey thank you for your answer. I can not share the file or reproduce the error, as the error is slow time of reading a file from S3. |
DO you have update on this? |
Hi @kostovasandra, a few thoughts:
|
I hope my answer helps |
Hi @kostovasandra, just for good measure can you give us your OS and arrow R package version? Other things to try:
|
The Arrow version is 13.0.0. The OS version is : Debian GNU/Linux 10 |
I think Bryce suggested this because it uses a different Parquet implementation that is more similar to what Python is doing, and it would help us to narrow down whether this is a problem with the R package or a problem with the ParquetFileReader implementation! You can use |
This is what i tried but i am getting error.
Could you send me how to read one parquet file using open_dataset? |
I think the easiest thing would be to pass a URI for the file. You can construct the string for the URI following this pattern, replacing the appropriate values with your own: open_dataset("s3://access_key:secret_key@yourbucketname/your_file.parquet?endpoint_override=yourendpoint") So for your case you could do, glue::glue("s3://{env['ECS_ACCESS_KEY_ID']}:{env['ECS_SECRET_ACCESS_KEY']}@{env['ECS_S3_BUCKET']}/your_file.parquet?endpoint_override={env['ECS_S3_ENDPOINT_URL']}")
|
Describe the bug, including details regarding any error messages, version, and platform.
I have been using the read_parquet() function to read a file (compressed / uncompressed) from S3, but its too slow (700MB file to be read takes 10m). I tried setting the params: arrow::set_cpu_count(2) & arrow.use_threads = FALSE, but still is slow.
For writing the same file it takes 1-2m which still is not the best.
Bellow is the code:
s3_init <- function(){
env = Sys.getenv(c("ECS_ACCESS_KEY_ID", "ECS_SECRET_ACCESS_KEY","ECS_S3_ENDPOINT_URL", "ECS_S3_BUCKET"))
bucket <- arrow::s3_bucket(env["ECS_S3_BUCKET"],
access_key=env["ECS_ACCESS_KEY_ID"],
secret_key=env["ECS_SECRET_ACCESS_KEY"],
endpoint_override=env["ECS_S3_ENDPOINT_URL"],
region = '')
return(bucket)
}
s3_save_rds <- function(file,s3_path){
bucket = s3_init()
write_parquet(file, bucket$path(s3_path), compression = "gzip")
}
s3_read_rds <- function(s3_path){
bucket = s3_init()
file = read_parquet(bucket$path(s3_path))
return(file)
}
Component(s)
R
The text was updated successfully, but these errors were encountered: