Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions About xls objects #160

Open
colearendt opened this issue Apr 20, 2020 · 12 comments
Open

Questions About xls objects #160

colearendt opened this issue Apr 20, 2020 · 12 comments

Comments

@colearendt
Copy link
Owner

@RE-Chief opening a separate thread to discuss since #159 is unrelated

Hi Cole,

I have older xls files that I am trying to read into R. I cannot read them into R using any other means so I am trying to use your read.xlsx2 to read them.

I get the following error which I do not understand are you able to help me out please?

Regards Charles

Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 stream, nor an OOXML stream

@colearendt
Copy link
Owner Author

colearendt commented Apr 20, 2020

@RE-Chief I am not super familiar with .xls objects (as they are quite old), however, I believe read.xlsx, write.xlsx and friends are not going to be super helpful (as is indicated by their being named after the xlsx format). However, I know that Apache POI does support the XLS format (which it calls HSSF for "Horrible Spreadsheet Format" 😂 ).

There may be other libraries to work with this file format too.

You can interact with the documents directly using something like this (note I need to create one from scratch since I do not have any handy):

library(xlsx)

new_wb <- createWorkbook(type="xls")
new_wb
#> [1] "Java-Object{org.apache.poi.hssf.usermodel.HSSFWorkbook@65e98b1c}"

s1 <- createSheet(new_wb)
addDataFrame(iris, s1)

new_wb_path <- tempfile(fileext = ".xls")
saveWorkbook(new_wb, new_wb_path)

read_wb <- loadWorkbook(new_wb_path)
all_sheets <- getSheets(read_wb)

head(xlsx::readColumns(all_sheets[[1]], 1, 5, 1))
#>   X. Sepal.Length Sepal.Width Petal.Length Petal.Width
#> 1  1          5.1         3.5          1.4         0.2
#> 2  2          4.9         3.0          1.4         0.2
#> 3  3          4.7         3.2          1.3         0.2
#> 4  4          4.6         3.1          1.5         0.2
#> 5  5          5.0         3.6          1.4         0.2
#> 6  6          5.4         3.9          1.7         0.4

Created on 2020-04-19 by the reprex package (v0.3.0)

@RE-Chief
Copy link

RE-Chief commented Apr 20, 2020

Many thanks for your reply Cole. My xls files must be pre-97 as I ran the code you kindly provided and I was able to open the created xls file with either read.xlsx or read.xlsx2. I cannot open my xls files with either, yet they will open with excel so I know they are valid files. I can pay for some software to do this for me but the money being asked for as crazy dollars. Thanks again for your efforts, stay well.

Kind regards Charles

@colearendt
Copy link
Owner Author

colearendt commented Apr 20, 2020

@RE-Chief Did you try using loadWorkbook() as I showed above? Can you share the results of your output, the errors, etc.? My expectation is that loadWorkbook() along with readRows(), readColumns() or readRange() should work for .xls files. read.xlsx or read.xlsx2 will not work.

@colearendt
Copy link
Owner Author

Also, you might try readxl, which seems to support xls as well: https://readxl.tidyverse.org/

@RE-Chief
Copy link

Thanks Cole, I have tried readxl previously.

read_wb <- loadWorkbook("C:/Users/Aquatic/Desktop/EP Snowy Model/OutputMS1/1.xls")
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.IllegalArgumentException: Your InputStream was neither an OLE2 stream, nor an OOXML stream

Kind regards Charles

@colearendt
Copy link
Owner Author

Interesting!! Well that is a different one! Any chance these files are misnamed and their encoding is actually some other type of file?

I.e. you could try opening the file in a text editor like Notepad or Wordpad. If it is a CSV file (a text file with common separated values), opening it with xlsx or readxl in this manner will fail, but it will open just fine in Excel.

@RE-Chief
Copy link

Hi Cole, I have attached an example file. It opens in notepad but as gobbledigook. It definitely opens in excel and I can save it as an xlsx file which will then open in R. But the whole purpose of this is to be able write a loop in R to open the xls files and save them as xlsx files.

1.zip

Kind regards Charles

@AmmarAli92
Copy link

Hi Cole, I have attached an example file. It opens in notepad but as gobbledigook. It definitely opens in excel and I can save it as an xlsx file which will then open in R. But the whole purpose of this is to be able write a loop in R to open the xls files and save them as xlsx files.

1.zip

Kind regards Charles

Hi Chief, I'm facing the same problem with .XLS files that I'm trying to process in R, I also tried readxl, XLConnect, xlsx and got no chance to open them in R. As you said, I can open them manually in excel and save them in xlsx format which will open in R but I have too many files that I need to process. any chances you solved this problem?

Thank you.

@RE-Chief
Copy link

@AmmarAli92 this issue was not resolved through R. I did manage to find a solution using libreoffice. This problem has reared its head again for me and I am currently working on a better solution. It may be a couple of weeks before it is resolved but I will let you know if and when that happens.

Kind regards Charles

@colearendt
Copy link
Owner Author

colearendt commented Apr 29, 2022

Finally got around to digging into this a smidge. It looks like this particular .xls file example is "invalid" according to the (newer, at least) Apache POI library, so will not be parsable without some modifications.

I tested this by digging in a little bit lower level, and seeing that the binary data format is not what Apache POI is expecting. Docs here and here

library(xlsx)
library(rJava)
myf <- .jnew("java/io/File", normalizePath("~/Downloads/bad.xls"))
poif <- .jnew("org/apache/poi/poifs/filesystem/POIFSFileSystem", myf)
#> Error in .jnew("org/apache/poi/poifs/filesystem/POIFSFileSystem", myf): org.apache.poi.poifs.filesystem.NotOLE2FileException: Invalid header signature; read 0x000A000200040009, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document

Created on 2022-04-29 by the reprex package (v0.3.0)

Unfortunately, it looks like the mechanism that is producing your .xls files is either invalid or quite outdated (perhaps using OLE1?)

If the .xls file is produced by Excel, for instance, things seem to work just fine. (Looking at the binary diff, though - things do differ quite a bit)

library(xlsx)
loadWorkbook("~/Documents/bad2.xls")
#> [1] "Java-Object{org.apache.poi.hssf.usermodel.HSSFWorkbook@1951b871}"

Created on 2022-04-29 by the reprex package (v0.3.0)

Some possible solutions I unfortunately don't have the time to dig into directly:

  • try using a very old version of the Apache POI library directly with rJava - it's possible that support for this older file format existed and was dropped at some point. The mechanism to load and resave the files could be pretty straightforward
  • if this file format predates Apache POI, then you may be looking at lower level file format conversions or another library with support
  • it may be possible to automate the conversion of these files directly inside of Excel using VBA 🙈 It is not a fun language to program in, but again just for opening / re-saving / closing files, it may be the easier path for these older formats.

I hope that helps!! Sorry for the trouble!

@jiangfenglyu
Copy link

Hi RE-Chief
I'm working on xls file which is not produced by EXCEL, and also have the same problem, same error code, and try readxl Package which isn't working.

And I ressolve this problem, here is my analysis:
Issue Analysis:
when xls file is produced by other software which is not by EXCEL, it will occur message box and tell you file format and extension does not match, file maybe corrupted, and so on
I think when R package,like readxl, xlsx , processing the xls file, it can't processing this message box.
Solution:
So I try another way
First, you need install RDCOMClient, here is link https://stackoverflow.com/questions/73190036/how-to-convert-xls-to-xlsx-using-r
and use function convert_XLS_File_To_XLSX as literally
Attention: two parameters, the former one is old file,like "C:\oldfile.xls", the latter one like "C\newfile.xlsx"
Second use read_xlsx or read.xlsx or readxl functions
and it worked.

Thks Emmanuel Hamel

@RE-Chief
Copy link

RE-Chief commented Nov 16, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants