Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 3 - Error executing get.msg() #4

Open
erwtokritos opened this issue Mar 28, 2012 · 28 comments
Open

Chapter 3 - Error executing get.msg() #4

erwtokritos opened this issue Mar 28, 2012 · 28 comments

Comments

@erwtokritos
Copy link

Hello guys,

Great book :-)
Right now, I am in the 3rd chapter (e-mail classification).
I am executing the R commands one by one andi am having a problem getting the list of spam documents (page 81).
The command is : all.spam <- sapply(spam.docs, function(p) get.msg(paste(spam.path,p,sep="")))

and the error i get is
Error in seq.default(which(text == "")[1] + 1, length(text), 1) :
invalid (to - from)/by in seq(.)

Any clue?
Thank you very much

@cesarblum
Copy link

I wish there was some way to upvote an issue. I'm having the exact same problem. I figured out that the problem seems to be with the "encoding" argument to the "file" function. If you remove it, it works, but the results you get are somewhat different from those in the book. Also, some weird tokens appear in the list of words found in the corpus. Someone also reported this problem at the Unconfirmed Errata page for the book at O'Reilly: http://oreilly.com/catalog/errataunconfirmed.csp?isbn=0636920018483

@johnmyleswhite
Copy link
Owner

Sorry about the lag on this, all. We'll look into it more this weekend and report back.

@drewconway
Copy link
Collaborator

I am having trouble replicating the error. The current version of the code in the repository reads as follows:

# Get all the SPAM-y email into a single vector
spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs != "cmds")]
all.spam <- sapply(spam.docs,
               function(p) get.msg(file.path(spam.path, p)))

It runs fine for me on OS X and Ubuntu. So, perhaps the issue is the use of paste rather than the file command, or an operating system issue. The paste function does appear in the text of the book, which should fixed in future editions.

@cesarblum
Copy link

I still get the errors when using file.path. These are the errors I get:

Error in seq.default(which(text == "")[1] + 1, length(text), 1) :
invalid (to - from)/by in seq(.)
In addition: Warning messages:
1: In readLines(con) :
invalid input found on input connection 'data/spam//00006.5ab5620d3d7c6c0db76234556a16f6c1'
2: In readLines(con) :
invalid input found on input connection 'data/spam//00009.027bf6e0b0c4ab34db3ce0ea4bf2edab'
3: In readLines(con) :
invalid input found on input connection 'data/spam//00031.a78bb452b3a7376202b5e62a81530449'
4: In readLines(con) :
incomplete final line found on 'data/spam//00031.a78bb452b3a7376202b5e62a81530449'
5: In readLines(con) :
invalid input found on input connection 'data/spam//00035.7ce3307b56dd90453027a6630179282e'
6: In readLines(con) :
incomplete final line found on 'data/spam//00035.7ce3307b56dd90453027a6630179282e'

The problems seems to be with the encoding argument of the file function called in get.msg. If I remove encoding="latin1", the code runs without errors, but the results are quite different from those presented in the book.

I'm working on OS X with R 2.15.0.

@johnmyleswhite
Copy link
Owner

What operation system and version of R are you using?

-- John

On Apr 21, 2012, at 9:01 AM, Cesar L. B. Silveira wrote:

I still get the errors when using file.path. These are the errors I get:

Error in seq.default(which(text == "")[1] + 1, length(text), 1) :
invalid (to - from)/by in seq(.)
In addition: Warning messages:
1: In readLines(con) :
invalid input found on input connection 'data/spam//00006.5ab5620d3d7c6c0db76234556a16f6c1'
2: In readLines(con) :
invalid input found on input connection 'data/spam//00009.027bf6e0b0c4ab34db3ce0ea4bf2edab'
3: In readLines(con) :
invalid input found on input connection 'data/spam//00031.a78bb452b3a7376202b5e62a81530449'
4: In readLines(con) :
incomplete final line found on 'data/spam//00031.a78bb452b3a7376202b5e62a81530449'
5: In readLines(con) :
invalid input found on input connection 'data/spam//00035.7ce3307b56dd90453027a6630179282e'
6: In readLines(con) :
incomplete final line found on 'data/spam//00035.7ce3307b56dd90453027a6630179282e'

The problems seems to be with the encoding argument of the file function called in get.msg. If I remove encoding="latin1", the code runs without errors, but the results are quite different from those presented in the book.


Reply to this email directly or view it on GitHub:
#4 (comment)

@cesarblum
Copy link

I'm using OS X Lion with R 2.15.0 (installed from MacPorts).

@hanfeisun
Copy link

I also has this error..

@foxet
Copy link

foxet commented Jul 22, 2012

That's because of the data files,not the code, open and check the data/spam/000*..which is not a email,but a file list

@quasiben
Copy link

quasiben commented Sep 3, 2012

@foxet is right. The file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' causes the problem. I amended the mask function to include files which begin with '0000.':

spam.docs <- spam.docs[which( !str_detect(spam.docs,"^0000.") & spam.docs != 'cmds' )]

@adayone
Copy link

adayone commented Oct 26, 2012

It's the problem of encoding. ReadLines should be useful no matter it is an email.
con <- file(path, open="rt") instead of
con <- file(path, open="rt", encoding="utf-8")
will be work.

@ceekr
Copy link

ceekr commented Nov 1, 2012

The encoding changes does NOT seem to alter the behavior. I am running this on R 2.15.2 on Windows 7 x64. Here is my function:

get.msg <- function(path) {
con <- file(path, open="rt", encoding="native.enc")
text <- readLines(con)
# The message always begins after the first full line break
msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
close(con)
return(paste(msg, collapse="\n"))
}

I have changed encoding to "utf-8", "latin1" and nothing happens. Same error.

Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)

I also applied the suggestions by foxet and quasiben. The fact is my spam folder does not have this file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' at all.

What am I missing, folks?

@adayone
Copy link

adayone commented Nov 1, 2012

Do not define parameter "encoding", just use

con <- file(path, open="rt")

2012/11/1 Kingshuk Chatterjee [email protected]

The encoding changes does NOT seem to alter the behavior. I am running
this on R 2.15.2 on Windows 7 x64. Here is my function:

get.msg <- function(path) {
con <- file(path, open="rt", encoding="native.enc")
text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
close(con)
return(paste(msg, collapse="\n"))
}

I have changed encoding to "utf-8", "latin1" and nothing happens. Same
error.

Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid
(to - from)/by in seq(.)

I also applied the suggestions by foxet and quasiben. The fact is my spam
folder does not have this file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' at
all.

What am I missing, folks?


Reply to this email directly or view it on GitHubhttps://github.com//issues/4#issuecomment-9969386.

@ceekr
Copy link

ceekr commented Nov 1, 2012

Alright, I did this now: (Removed the encoding parameter)

get.msg <- function(path) {
con <- file(path, open="rt")
text <- readLines(con)
# The message always begins after the first full line break
msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
close(con)
return(paste(text, collapse="\n"))
}

Ran the whole bunch again. The outcome:

            Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)

So, like I said earlier, the encoding parameter does not seem to have any affect. Again, I am running this on Windows 7 x64. And here is my whole bunch:

           spam.path <- "datasets/spam/"
           easyham.path <- "datasets/easy_ham/"
           hardham.path <- "datasets/hard_ham/"

           get.msg <- function(path) {
                    con <- file(path, open="rt")
                    text <- readLines(con)
                    # The message always begins after the first full line break
                    msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
                    close(con)
                    return(paste(text, collapse="\n"))
            }

            spam.docs <- dir(spam.path)
            spam.docs <- spam.docs[which(spam.docs!="cmds")]
            spam.docs <- paste(spam.path, spam.docs, sep="")
            all.spam.msgs <- sapply(spam.docs, get.msg) - This is the line that throws the above error

@adayone
Copy link

adayone commented Nov 1, 2012

you should check if the length(text) >1.

haoyuan hu
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

On Thursday, November 1, 2012 at 11:24 PM, Kingshuk Chatterjee wrote:

Alright, I did this now: (Removed the encoding parameter)
get.msg <- function(path) {
con <- file(path, open="rt")
text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
close(con)
return(paste(text, collapse="\n"))
}
Ran the whole bunch again. The outcome:
Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)
So, like I said earlier, the encoding parameter does not seem to have any affect. Again, I am running this on Windows 7 x64. And here is my whole bunch:
spam.path <- "datasets/spam/"
easyham.path <- "datasets/easy_ham/"
hardham.path <- "datasets/hard_ham/"
get.msg <- function(path) {
con <- file(path, open="rt")
text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
close(con)
return(paste(text, collapse="\n"))
}
spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs!="cmds")]
spam.docs <- paste(spam.path, spam.docs, sep="")
all.spam.msgs <- sapply(spam.docs, get.msg) - This is the line that throws the above error


Reply to this email directly or view it on GitHub (#4 (comment)).

@ceekr
Copy link

ceekr commented Nov 1, 2012

Lovely, that works!! Thanks mon. One last question: I see (intermittently) the socket open warning:

             Warning message: closing unused connection 3 (datasets/spam/desktop.ini) 

This I am presuming is because the underlying code failed to close all the File Sockets? It does not happen all the time though.

@jamesbconner
Copy link

Is there a permanent fix for this issue? I'm having the same problem. If I remove the encoding on the file(), then the get.msg function will work, but obviously you lose some encoding information.

Using Win 7 (64bit), RStudio 0.96.331, R 2.15.2

@almartin82
Copy link

Can confirm that I am seeing a similar issue as others above -
`Error in seq.default(which(text == "")[1] + 1, length(text), 1) :
wrong sign in 'by' argument``

Solved by dropping the encoding on con in get.msg. R 3.0.0 on Windows 7, 64 bit.

@y1239051
Copy link

I have problem in following code:

get.msg <- function(path)
{
con <- file(path, open = "rt", encoding = "latin1")
text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text == "")[1] + 1, length(text), 1)]
close(con)
return(paste(msg, collapse = "\n"))
}

How can i do , please some body help me!!

@y1239051
Copy link

I want say that if I am not use the parameter for encoding, it's ok for working,
but when I key in
spam.tdm <- get.tdm(all.spam)

The output error information is following:
Error in tolower(txt) : invalid multibyte string 1

Who have same situation?
Please help me!!

Thanks

@Donnie-Liu
Copy link

I have same issue as y1239051. My system is Win7, 32bit, R version 3.0.2, RStudio Version 0.98.490.
However, it seem OK on my old XP system. And,it spent so long time on command "spam.tdm <- get.tdm(all.spam)" that I aborted its running.
I will try again.

@Donnie-Liu
Copy link

Ooops!, I try XP system again, and get same error!

@Donnie-Liu
Copy link

I found a solution following these steps:

  1. Remove "encoding='latin1'" in function get.msg()
  2. In function get.tdm(), add
    doc.corpus <- tm_map(doc.corpus, function(x) iconv(x, to='UTF-8', sub='byte'))
    before
    doc.dtm <- TermDocumentMatrix(doc.corpus, control)

The solution made program run normally. But, the results are a little different.

head(spam.df[with(spam.df,order(-occurrence)),])
term frequency density occurrence
7471 email 813 0.005859586 0.566
18382 please 425 0.003063129 0.508
14339 list 409 0.002947811 0.444
26848 will 828 0.005967697 0.422
2831 body 379 0.002731591 0.408
9124 free 539 0.003884769 0.390

@laocan
Copy link

laocan commented Mar 31, 2014

@y1239051
after I changed the function 'get.msg' to {... con <- file(path, open = "rt") ...}
and deleted the wrong encoding words(just one sentence) in file:"00136.faa39d8e816c70f23b4bb8758d8a74f0"
the command:
all.spam <- sapply(spam.docs,

  •                function(p) get.msg(file.path(spam.path, p)))
    
    works.
    but the following command:
    spam.tdm <- get.tdm(all.spam)
    received the same problem like this:
    Error in .tolower(txt) : invalid multibyte string 1
    how did you fix it?
    thanks.

@jnjcc
Copy link

jnjcc commented Jul 29, 2014

For those of you still have this problem, I'd suggest try removing the
"open" parameter from file function. It worked for me on
R 3.0.3, Win7 x64, and didn't break anything on R 3.1.1, Ubuntu 12.04

@okamipride
Copy link

After i correct the encoding parameter to con <- file(path, open = "rt", encoding ="native.enc"), the program can run; however it still show the warning "incomplete final line found on 'data/spam/00136.faa39d8e816c70f23b4bb8758d8a74f0' " in the end of command line. Dose anyone knows what's wrong with this warning ?

@bluesilence
Copy link

Hi Donnie @Donnie-Liu,

I tested your solution, however,
your change on get.tdm will cause error:

Error: inherits(doc, "TextDocument") is not TRUE

Could you paste the full text of your get.tdm definition?

@IbrahimZamit
Copy link

Same thing here okamipride
what is the solution to this warning ???

@divyanshofficials
Copy link

library(tm)
library(ggplot2)

#defining paths

spam.path<- "data/spam/"
spam2.path<- "data/spam_2/"
easyham.path <- "data/easy_ham/"
easyham2.path <- "data/easy_ham_2/"
hardham.path <- "data/hard_ham/"
hardham2.path <- "data/hard_ham_2/"

#creating get.msg function

get.msg <- function(path) {
con <- file(path, open="rt", encoding="native.enc")
text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1]+1,length(text),1)]
close(con)
return(paste(msg, collapse="\n"))
}

#creating spam training dataset

spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs!="cmds")]
all.spam <- sapply(spam.docs,function(p) get.msg(paste(spam.path, p,sep="")))

get.tdm <- function(doc.vec) {
doc.corpus <- Corpus(VectorSource(doc.vec))
control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE,
minDocFreq=2)
doc.dtm <- TermDocumentMatrix(doc.corpus, control)
return(doc.dtm)
}
spam.tdm <- get.tdm(all.spam)

spam.matrix <- as.matrix(spam.tdm)
spam.counts <- rowSums(spam.matrix)
spam.df <- data.frame(cbind(names(spam.counts),
as.numeric(spam.counts)), stringsAsFactors=FALSE)
names(spam.df) <- c("term","frequency")
spam.df$frequency <- as.numeric(spam.df$frequency)
spam.occurrence <- sapply(1:nrow(spam.matrix),
function(i) {length(which(spam.matrix[i,] > 0))/ncol(spam.matrix)})
spam.density <- spam.df$frequency/sum(spam.df$frequency)
spam.df <- transform(spam.df, density=spam.density,
occurrence=spam.occurrence)

#creating easyham.df

easyham.docs <- dir(easyham.path)
easyham.docs <- easyham.docs[which(easyham.docs!="cmds")]
all.easyham <- sapply(easyham.docs, function(p) get.msg(paste(easyham.path,p,sep="")))[1:500]

get.tdm <- function(doc.vec) {
doc.corpus <- Corpus(VectorSource(doc.vec))
control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE,
minDocFreq=2)
doc.dtm <- TermDocumentMatrix(doc.corpus, control)
return(doc.dtm)
}
easyham.tdm <- get.tdm(all.easyham)

easyham.matrix <- as.matrix(easyham.tdm)
easyham.counts <- rowSums(easyham.matrix)
easyham.df <- data.frame(cbind(names(easyham.counts),
as.numeric(easyham.counts)), stringsAsFactors=FALSE)
names(easyham.df) <- c("term","frequency")
easyham.df$frequency <- as.numeric(easyham.df$frequency)
easyham.occurrence <- sapply(1:nrow(easyham.matrix),
function(i) {length(which(easyham.matrix[i,] > 0))/ncol(spam.matrix)})
easyham.density <- easyham.df$frequency/sum(easyham.df$frequency)
easyham.df <- transform(easyham.df, density=easyham.density,
occurrence=easyham.occurrence)

creating the classifier

classify.email <- function(path, training.df, prior=0.5, c=1e-6) {
msg <- get.msg(path)
msg.tdm <- get.tdm(msg)
msg.freq <- rowSums(as.matrix(msg.tdm))

Find intersections of words

msg.match <- intersect(names(msg.freq), training.df$term)
if(length(msg.match) < 1) {
return(prior*c^(length(msg.freq)))
}
else {
match.probs <- training.df$occurrence[match(msg.match, training.df$term)]
return(prior * prod(match.probs) * c^(length(msg.freq)-length(msg.match)))
}
}

#Testing the classifier

hardham.docs <- dir(hardham.path)
hardham.docs <- hardham.docs[which(hardham.docs != "cmds")]
hardham.spamtest <- sapply(hardham.docs,
function(p) classify.email(paste(hardham.path, p, sep=""),
training.df=spam.df))
hardham.hamtest <- sapply(hardham.docs,
function(p) classify.email(paste(hardham.path, p, sep=""),
training.df=easyham.df))
hardham.res <- ifelse(hardham.spamtest > hardham.hamtest, TRUE, FALSE)
summary(hardham.res)

use this code in chapter 3.
create a code for easyham.df, which is not given in the book. so you can use this complete code with code written for easyham files creation.
the encoding is changed from "latin1" to "naive.enc"
also, a file in spam folder is corrupted, which is causing the errors. so, better alternative is to delete that file and then run the code.

delete this file - spam/00002.d94f1b97e48ed3b553b3508d116e6a09.
also as written in the book, use only first 500 sample mails from the easyham folder for better results.

hope, you found this solution genuine and good enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests