-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RHive query results transfer extremely slow #50
Comments
rhive.query function was badly designed. |
+1 ;) |
FYI, using RJDBC with HiveServer2 is also painfully slow. Less slow than RHive, but still extremely slow, much slower than with other JDBC drivers. So as I guess the problems comes from HiveServer. Right now the acceptable way to transfer data from Hive to R is to export data from Hive to a local CSV file and to load it using data.table's fread. |
A simple solution I came up with involves simply piping your Hive query to the command line. Then you can use readLines() and separate out fields based on the '\t' delimiter, creating a data.table out of your query results. require(data.table) hive_pipe <- pipe("hive -e 'SELECT my, table, columns FROM my_table'") This is significantly faster than rhive.query. And it doesn't require writing out your query results to a CSV file first. If I have the time, I may fork rhive and add this solution to my version. |
Hi sherath21 Can you please explain how did you connect from RStudio to Hive CLI. Because while trying to proceed as you did I got an error on line : results=data.table(readLines(hive_pipe)). The error says : *sh: 1: hive: not found * . My code is the following : #Intitialize rhive library(data.table) #Establish the connection hive_pipe=pipe("hive -e 'USE hello_db; SELECT * FROM table_txt limit 10'") Any idea ? Thanks |
I think you may be missing the fact that my solution is a way to bypass On Tue, May 10, 2016 at 2:26 PM, imanopholist [email protected]
|
Hi,
library(data.table) If thinking about RStudio ignore the paths or rstudio user: |
@imanopholist For this reason, RHive has a function Thanks. |
@DrakeMin thanks, I'll try that right away ! |
Hi @DrakeMin, any ideas ? I just want to add, that when fetching a small table it works with big.query and no need to use load.table2. Thank you |
Hi, |
@imanopholist Can you provide the full log of hive-server and/or related MR job(related Hive movetask) ? |
Hi @DrakeMin thanks for responding. converting to local hdfs://localhost:9000/rhive/lib/2.0-0.4/rhive_udf.jar #rhive.load.table2 (added comment not in result)--> Query ID = hadoopuser_20160516091619_2c275aec-9cf1-4597-aca1-9b209d1fd45c |
@imanopholist hmm. It's weird. The query is complete, temp data is stored in How about the HDFS Namenode log at that time? I think move task will be a |
@DrakeMin Log : 2016-05-16 16:27:12,770 INFO logs: Aliases are enabled |
@imanopholist sorry, currently I have no idea for this error. I'll try at our test bed for error reproduce. Thanks. |
@DrakeMin thank you ! |
Hi @sherath21 Do you have an explanation to this? Imane |
Hello,
When I do a simple
rhive.query("select * from X limit 10000")
it takes 90s to answer once the query is completed on the hiveserver (OK displayed on the console).
It increases linearily with data size, always exactly 9 ms per line, it does not depend on the line length.
It is several order of magnitude slower than any other kind of data transfer between R and whatever. My guess is that there is some kind of timeout somewhere.
The text was updated successfully, but these errors were encountered: