-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chapter 3 - Error executing get.msg() #4
Comments
I wish there was some way to upvote an issue. I'm having the exact same problem. I figured out that the problem seems to be with the "encoding" argument to the "file" function. If you remove it, it works, but the results you get are somewhat different from those in the book. Also, some weird tokens appear in the list of words found in the corpus. Someone also reported this problem at the Unconfirmed Errata page for the book at O'Reilly: http://oreilly.com/catalog/errataunconfirmed.csp?isbn=0636920018483 |
Sorry about the lag on this, all. We'll look into it more this weekend and report back. |
I am having trouble replicating the error. The current version of the code in the repository reads as follows:
It runs fine for me on OS X and Ubuntu. So, perhaps the issue is the use of |
I still get the errors when using file.path. These are the errors I get: Error in seq.default(which(text == "")[1] + 1, length(text), 1) : The problems seems to be with the encoding argument of the file function called in get.msg. If I remove encoding="latin1", the code runs without errors, but the results are quite different from those presented in the book. I'm working on OS X with R 2.15.0. |
What operation system and version of R are you using? -- John On Apr 21, 2012, at 9:01 AM, Cesar L. B. Silveira wrote:
|
I'm using OS X Lion with R 2.15.0 (installed from MacPorts). |
I also has this error.. |
That's because of the data files,not the code, open and check the data/spam/000*..which is not a email,but a file list |
@foxet is right. The file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' causes the problem. I amended the mask function to include files which begin with '0000.': spam.docs <- spam.docs[which( !str_detect(spam.docs,"^0000.") & spam.docs != 'cmds' )] |
It's the problem of encoding. ReadLines should be useful no matter it is an email. |
The encoding changes does NOT seem to alter the behavior. I am running this on R 2.15.2 on Windows 7 x64. Here is my function: get.msg <- function(path) { I have changed encoding to "utf-8", "latin1" and nothing happens. Same error. Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.) I also applied the suggestions by foxet and quasiben. The fact is my spam folder does not have this file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' at all. What am I missing, folks? |
Do not define parameter "encoding", just use con <- file(path, open="rt") 2012/11/1 Kingshuk Chatterjee [email protected]
|
Alright, I did this now: (Removed the encoding parameter) get.msg <- function(path) { Ran the whole bunch again. The outcome:
So, like I said earlier, the encoding parameter does not seem to have any affect. Again, I am running this on Windows 7 x64. And here is my whole bunch:
|
you should check if the length(text) >1. haoyuan hu On Thursday, November 1, 2012 at 11:24 PM, Kingshuk Chatterjee wrote:
|
Lovely, that works!! Thanks mon. One last question: I see (intermittently) the socket open warning:
This I am presuming is because the underlying code failed to close all the File Sockets? It does not happen all the time though. |
Is there a permanent fix for this issue? I'm having the same problem. If I remove the encoding on the file(), then the get.msg function will work, but obviously you lose some encoding information. Using Win 7 (64bit), RStudio 0.96.331, R 2.15.2 |
Can confirm that I am seeing a similar issue as others above - Solved by dropping the encoding on |
I have problem in following code: get.msg <- function(path) The message always begins after the first full line breakmsg <- text[seq(which(text == "")[1] + 1, length(text), 1)] How can i do , please some body help me!! |
I want say that if I am not use the parameter for encoding, it's ok for working, The output error information is following: Who have same situation? Thanks |
I have same issue as y1239051. My system is Win7, 32bit, R version 3.0.2, RStudio Version 0.98.490. |
Ooops!, I try XP system again, and get same error! |
I found a solution following these steps:
The solution made program run normally. But, the results are a little different.
|
@y1239051
|
For those of you still have this problem, I'd suggest try removing the |
After i correct the encoding parameter to con <- file(path, open = "rt", encoding ="native.enc"), the program can run; however it still show the warning "incomplete final line found on 'data/spam/00136.faa39d8e816c70f23b4bb8758d8a74f0' " in the end of command line. Dose anyone knows what's wrong with this warning ? |
Hi Donnie @Donnie-Liu, I tested your solution, however, Error: inherits(doc, "TextDocument") is not TRUE Could you paste the full text of your get.tdm definition? |
Same thing here okamipride |
library(tm) #defining paths spam.path<- "data/spam/" #creating get.msg function get.msg <- function(path) { The message always begins after the first full line breakmsg <- text[seq(which(text=="")[1]+1,length(text),1)] #creating spam training dataset spam.docs <- dir(spam.path) get.tdm <- function(doc.vec) { spam.matrix <- as.matrix(spam.tdm) #creating easyham.df easyham.docs <- dir(easyham.path) get.tdm <- function(doc.vec) { easyham.matrix <- as.matrix(easyham.tdm) creating the classifierclassify.email <- function(path, training.df, prior=0.5, c=1e-6) { Find intersections of wordsmsg.match <- intersect(names(msg.freq), training.df$term) #Testing the classifier hardham.docs <- dir(hardham.path) use this code in chapter 3. delete this file - spam/00002.d94f1b97e48ed3b553b3508d116e6a09. hope, you found this solution genuine and good enough. |
Hello guys,
Great book :-)
Right now, I am in the 3rd chapter (e-mail classification).
I am executing the R commands one by one andi am having a problem getting the list of spam documents (page 81).
The command is : all.spam <- sapply(spam.docs, function(p) get.msg(paste(spam.path,p,sep="")))
and the error i get is
Error in seq.default(which(text == "")[1] + 1, length(text), 1) :
invalid (to - from)/by in seq(.)
Any clue?
Thank you very much
The text was updated successfully, but these errors were encountered: