Chapter 3 - Error executing get.msg() #4

erwtokritos · 2012-03-28T14:20:31Z

Hello guys,

Great book :-)
Right now, I am in the 3rd chapter (e-mail classification).
I am executing the R commands one by one andi am having a problem getting the list of spam documents (page 81).
The command is : all.spam <- sapply(spam.docs, function(p) get.msg(paste(spam.path,p,sep="")))

and the error i get is
Error in seq.default(which(text == "")[1] + 1, length(text), 1) :
invalid (to - from)/by in seq(.)

Any clue?
Thank you very much

cesarblum · 2012-04-14T03:41:06Z

I wish there was some way to upvote an issue. I'm having the exact same problem. I figured out that the problem seems to be with the "encoding" argument to the "file" function. If you remove it, it works, but the results you get are somewhat different from those in the book. Also, some weird tokens appear in the list of words found in the corpus. Someone also reported this problem at the Unconfirmed Errata page for the book at O'Reilly: http://oreilly.com/catalog/errataunconfirmed.csp?isbn=0636920018483

johnmyleswhite · 2012-04-14T13:03:37Z

Sorry about the lag on this, all. We'll look into it more this weekend and report back.

drewconway · 2012-04-20T19:07:28Z

I am having trouble replicating the error. The current version of the code in the repository reads as follows:

# Get all the SPAM-y email into a single vector
spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs != "cmds")]
all.spam <- sapply(spam.docs,
               function(p) get.msg(file.path(spam.path, p)))

It runs fine for me on OS X and Ubuntu. So, perhaps the issue is the use of paste rather than the file command, or an operating system issue. The paste function does appear in the text of the book, which should fixed in future editions.

cesarblum · 2012-04-21T13:01:29Z

I still get the errors when using file.path. These are the errors I get:

Error in seq.default(which(text == "")[1] + 1, length(text), 1) :
invalid (to - from)/by in seq(.)
In addition: Warning messages:
1: In readLines(con) :
invalid input found on input connection 'data/spam//00006.5ab5620d3d7c6c0db76234556a16f6c1'
2: In readLines(con) :
invalid input found on input connection 'data/spam//00009.027bf6e0b0c4ab34db3ce0ea4bf2edab'
3: In readLines(con) :
invalid input found on input connection 'data/spam//00031.a78bb452b3a7376202b5e62a81530449'
4: In readLines(con) :
incomplete final line found on 'data/spam//00031.a78bb452b3a7376202b5e62a81530449'
5: In readLines(con) :
invalid input found on input connection 'data/spam//00035.7ce3307b56dd90453027a6630179282e'
6: In readLines(con) :
incomplete final line found on 'data/spam//00035.7ce3307b56dd90453027a6630179282e'

The problems seems to be with the encoding argument of the file function called in get.msg. If I remove encoding="latin1", the code runs without errors, but the results are quite different from those presented in the book.

I'm working on OS X with R 2.15.0.

johnmyleswhite · 2012-04-21T14:42:12Z

What operation system and version of R are you using?

-- John

On Apr 21, 2012, at 9:01 AM, Cesar L. B. Silveira wrote:

I still get the errors when using file.path. These are the errors I get:

Error in seq.default(which(text == "")[1] + 1, length(text), 1) :
invalid (to - from)/by in seq(.)
In addition: Warning messages:
1: In readLines(con) :
invalid input found on input connection 'data/spam//00006.5ab5620d3d7c6c0db76234556a16f6c1'
2: In readLines(con) :
invalid input found on input connection 'data/spam//00009.027bf6e0b0c4ab34db3ce0ea4bf2edab'
3: In readLines(con) :
invalid input found on input connection 'data/spam//00031.a78bb452b3a7376202b5e62a81530449'
4: In readLines(con) :
incomplete final line found on 'data/spam//00031.a78bb452b3a7376202b5e62a81530449'
5: In readLines(con) :
invalid input found on input connection 'data/spam//00035.7ce3307b56dd90453027a6630179282e'
6: In readLines(con) :
incomplete final line found on 'data/spam//00035.7ce3307b56dd90453027a6630179282e'

The problems seems to be with the encoding argument of the file function called in get.msg. If I remove encoding="latin1", the code runs without errors, but the results are quite different from those presented in the book.

Reply to this email directly or view it on GitHub:
#4 (comment)

cesarblum · 2012-04-21T16:51:14Z

I'm using OS X Lion with R 2.15.0 (installed from MacPorts).

hanfeisun · 2012-05-24T21:30:29Z

I also has this error..

foxet · 2012-07-22T15:29:56Z

That's because of the data files,not the code, open and check the data/spam/000*..which is not a email,but a file list

quasiben · 2012-09-03T15:36:37Z

@foxet is right. The file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' causes the problem. I amended the mask function to include files which begin with '0000.':

spam.docs <- spam.docs[which( !str_detect(spam.docs,"^0000.") & spam.docs != 'cmds' )]

adayone · 2012-10-26T06:14:51Z

It's the problem of encoding. ReadLines should be useful no matter it is an email.
con <- file(path, open="rt") instead of
con <- file(path, open="rt", encoding="utf-8")
will be work.

ceekr · 2012-11-01T02:46:03Z

The encoding changes does NOT seem to alter the behavior. I am running this on R 2.15.2 on Windows 7 x64. Here is my function:

get.msg <- function(path) {
con <- file(path, open="rt", encoding="native.enc")
text <- readLines(con)
# The message always begins after the first full line break
msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
close(con)
return(paste(msg, collapse="\n"))
}

I have changed encoding to "utf-8", "latin1" and nothing happens. Same error.

Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)

I also applied the suggestions by foxet and quasiben. The fact is my spam folder does not have this file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' at all.

What am I missing, folks?

adayone · 2012-11-01T05:21:10Z

Do not define parameter "encoding", just use

con <- file(path, open="rt")

2012/11/1 Kingshuk Chatterjee [email protected]

The encoding changes does NOT seem to alter the behavior. I am running
this on R 2.15.2 on Windows 7 x64. Here is my function:

get.msg <- function(path) {
con <- file(path, open="rt", encoding="native.enc")
text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
close(con)
return(paste(msg, collapse="\n"))
}

I have changed encoding to "utf-8", "latin1" and nothing happens. Same
error.

Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid
(to - from)/by in seq(.)

I also applied the suggestions by foxet and quasiben. The fact is my spam
folder does not have this file '0000.7b1b73cf36cf9dbc3d64e3f2ee2b91f1' at
all.

What am I missing, folks?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/4#issuecomment-9969386.

ceekr · 2012-11-01T15:24:33Z

Alright, I did this now: (Removed the encoding parameter)

get.msg <- function(path) {
con <- file(path, open="rt")
text <- readLines(con)
# The message always begins after the first full line break
msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
close(con)
return(paste(text, collapse="\n"))
}

Ran the whole bunch again. The outcome:

            Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)

So, like I said earlier, the encoding parameter does not seem to have any affect. Again, I am running this on Windows 7 x64. And here is my whole bunch:

           spam.path <- "datasets/spam/"
           easyham.path <- "datasets/easy_ham/"
           hardham.path <- "datasets/hard_ham/"

           get.msg <- function(path) {
                    con <- file(path, open="rt")
                    text <- readLines(con)
                    # The message always begins after the first full line break
                    msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
                    close(con)
                    return(paste(text, collapse="\n"))
            }

            spam.docs <- dir(spam.path)
            spam.docs <- spam.docs[which(spam.docs!="cmds")]
            spam.docs <- paste(spam.path, spam.docs, sep="")
            all.spam.msgs <- sapply(spam.docs, get.msg) - This is the line that throws the above error

adayone · 2012-11-01T15:27:47Z

you should check if the length(text) >1.

haoyuan hu
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)

On Thursday, November 1, 2012 at 11:24 PM, Kingshuk Chatterjee wrote:

Alright, I did this now: (Removed the encoding parameter)
get.msg <- function(path) {
con <- file(path, open="rt")
text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
close(con)
return(paste(text, collapse="\n"))
}
Ran the whole bunch again. The outcome:
Error in seq.default(which(text == "")[1] + 1, length(text), 1) : invalid (to - from)/by in seq(.)
So, like I said earlier, the encoding parameter does not seem to have any affect. Again, I am running this on Windows 7 x64. And here is my whole bunch:
spam.path <- "datasets/spam/"
easyham.path <- "datasets/easy_ham/"
hardham.path <- "datasets/hard_ham/"
get.msg <- function(path) {
con <- file(path, open="rt")
text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1] + 1, length(text), 1)]
close(con)
return(paste(text, collapse="\n"))
}
spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs!="cmds")]
spam.docs <- paste(spam.path, spam.docs, sep="")
all.spam.msgs <- sapply(spam.docs, get.msg) - This is the line that throws the above error

—
Reply to this email directly or view it on GitHub (#4 (comment)).

ceekr · 2012-11-01T15:48:05Z

Lovely, that works!! Thanks mon. One last question: I see (intermittently) the socket open warning:

             Warning message: closing unused connection 3 (datasets/spam/desktop.ini)

This I am presuming is because the underlying code failed to close all the File Sockets? It does not happen all the time though.

jamesbconner · 2012-12-02T23:25:55Z

Is there a permanent fix for this issue? I'm having the same problem. If I remove the encoding on the file(), then the get.msg function will work, but obviously you lose some encoding information.

Using Win 7 (64bit), RStudio 0.96.331, R 2.15.2

almartin82 · 2013-04-16T03:38:48Z

Can confirm that I am seeing a similar issue as others above -
`Error in seq.default(which(text == "")[1] + 1, length(text), 1) :
wrong sign in 'by' argument``

Solved by dropping the encoding on con in get.msg. R 3.0.0 on Windows 7, 64 bit.

y1239051 · 2013-06-23T07:57:02Z

I have problem in following code:

get.msg <- function(path)
{
con <- file(path, open = "rt", encoding = "latin1")
text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text == "")[1] + 1, length(text), 1)]
close(con)
return(paste(msg, collapse = "\n"))
}

How can i do , please some body help me!!

y1239051 · 2013-06-23T12:34:17Z

I want say that if I am not use the parameter for encoding, it's ok for working,
but when I key in
spam.tdm <- get.tdm(all.spam)

The output error information is following:
Error in tolower(txt) : invalid multibyte string 1

Who have same situation?
Please help me!!

Thanks

Donnie-Liu · 2014-02-06T03:45:02Z

I have same issue as y1239051. My system is Win7, 32bit, R version 3.0.2, RStudio Version 0.98.490.
However, it seem OK on my old XP system. And,it spent so long time on command "spam.tdm <- get.tdm(all.spam)" that I aborted its running.
I will try again.

Donnie-Liu · 2014-02-06T07:27:33Z

Ooops!, I try XP system again, and get same error!

Donnie-Liu · 2014-02-06T09:48:38Z

I found a solution following these steps:

Remove "encoding='latin1'" in function get.msg()
In function get.tdm(), add
doc.corpus <- tm_map(doc.corpus, function(x) iconv(x, to='UTF-8', sub='byte'))
before
doc.dtm <- TermDocumentMatrix(doc.corpus, control)

The solution made program run normally. But, the results are a little different.

head(spam.df[with(spam.df,order(-occurrence)),])
term frequency density occurrence
7471 email 813 0.005859586 0.566
18382 please 425 0.003063129 0.508
14339 list 409 0.002947811 0.444
26848 will 828 0.005967697 0.422
2831 body 379 0.002731591 0.408
9124 free 539 0.003884769 0.390

laocan · 2014-03-31T15:43:53Z

@y1239051
after I changed the function 'get.msg' to {... con <- file(path, open = "rt") ...}
and deleted the wrong encoding words(just one sentence) in file:"00136.faa39d8e816c70f23b4bb8758d8a74f0"
the command:
all.spam <- sapply(spam.docs,

```
               function(p) get.msg(file.path(spam.path, p)))
```
works.
but the following command:
spam.tdm <- get.tdm(all.spam)
received the same problem like this:
Error in .tolower(txt) : invalid multibyte string 1
how did you fix it?
thanks.

jnjcc · 2014-07-29T10:53:13Z

For those of you still have this problem, I'd suggest try removing the
"open" parameter from file function. It worked for me on
R 3.0.3, Win7 x64, and didn't break anything on R 3.1.1, Ubuntu 12.04

okamipride · 2015-03-10T10:47:50Z

After i correct the encoding parameter to con <- file(path, open = "rt", encoding ="native.enc"), the program can run; however it still show the warning "incomplete final line found on 'data/spam/00136.faa39d8e816c70f23b4bb8758d8a74f0' " in the end of command line. Dose anyone knows what's wrong with this warning ?

bluesilence · 2015-04-22T11:22:26Z

Hi Donnie @Donnie-Liu,

I tested your solution, however,
your change on get.tdm will cause error:

Error: inherits(doc, "TextDocument") is not TRUE

Could you paste the full text of your get.tdm definition?

IbrahimZamit · 2016-05-08T16:55:36Z

Same thing here okamipride
what is the solution to this warning ???

divyanshofficials · 2018-03-12T09:35:32Z

library(tm)
library(ggplot2)

#defining paths

spam.path<- "data/spam/"
spam2.path<- "data/spam_2/"
easyham.path <- "data/easy_ham/"
easyham2.path <- "data/easy_ham_2/"
hardham.path <- "data/hard_ham/"
hardham2.path <- "data/hard_ham_2/"

#creating get.msg function

get.msg <- function(path) {
con <- file(path, open="rt", encoding="native.enc")
text <- readLines(con)

The message always begins after the first full line break

msg <- text[seq(which(text=="")[1]+1,length(text),1)]
close(con)
return(paste(msg, collapse="\n"))
}

#creating spam training dataset

spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs!="cmds")]
all.spam <- sapply(spam.docs,function(p) get.msg(paste(spam.path, p,sep="")))

get.tdm <- function(doc.vec) {
doc.corpus <- Corpus(VectorSource(doc.vec))
control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE,
minDocFreq=2)
doc.dtm <- TermDocumentMatrix(doc.corpus, control)
return(doc.dtm)
}
spam.tdm <- get.tdm(all.spam)

spam.matrix <- as.matrix(spam.tdm)
spam.counts <- rowSums(spam.matrix)
spam.df <- data.frame(cbind(names(spam.counts),
as.numeric(spam.counts)), stringsAsFactors=FALSE)
names(spam.df) <- c("term","frequency")
spam.df$frequency <- as.numeric(spam.df$frequency)
spam.occurrence <- sapply(1:nrow(spam.matrix),
function(i) {length(which(spam.matrix[i,] > 0))/ncol(spam.matrix)})
spam.density <- spam.df$frequency/sum(spam.df$frequency)
spam.df <- transform(spam.df, density=spam.density,
occurrence=spam.occurrence)

#creating easyham.df

easyham.docs <- dir(easyham.path)
easyham.docs <- easyham.docs[which(easyham.docs!="cmds")]
all.easyham <- sapply(easyham.docs, function(p) get.msg(paste(easyham.path,p,sep="")))[1:500]

get.tdm <- function(doc.vec) {
doc.corpus <- Corpus(VectorSource(doc.vec))
control <- list(stopwords=TRUE, removePunctuation=TRUE, removeNumbers=TRUE,
minDocFreq=2)
doc.dtm <- TermDocumentMatrix(doc.corpus, control)
return(doc.dtm)
}
easyham.tdm <- get.tdm(all.easyham)

easyham.matrix <- as.matrix(easyham.tdm)
easyham.counts <- rowSums(easyham.matrix)
easyham.df <- data.frame(cbind(names(easyham.counts),
as.numeric(easyham.counts)), stringsAsFactors=FALSE)
names(easyham.df) <- c("term","frequency")
easyham.df$frequency <- as.numeric(easyham.df$frequency)
easyham.occurrence <- sapply(1:nrow(easyham.matrix),
function(i) {length(which(easyham.matrix[i,] > 0))/ncol(spam.matrix)})
easyham.density <- easyham.df$frequency/sum(easyham.df$frequency)
easyham.df <- transform(easyham.df, density=easyham.density,
occurrence=easyham.occurrence)

creating the classifier

classify.email <- function(path, training.df, prior=0.5, c=1e-6) {
msg <- get.msg(path)
msg.tdm <- get.tdm(msg)
msg.freq <- rowSums(as.matrix(msg.tdm))

Find intersections of words

msg.match <- intersect(names(msg.freq), training.df$term)
if(length(msg.match) < 1) {
return(prior*c^(length(msg.freq)))
}
else {
match.probs <- training.df$occurrence[match(msg.match, training.df$term)]
return(prior * prod(match.probs) * c^(length(msg.freq)-length(msg.match)))
}
}

#Testing the classifier

hardham.docs <- dir(hardham.path)
hardham.docs <- hardham.docs[which(hardham.docs != "cmds")]
hardham.spamtest <- sapply(hardham.docs,
function(p) classify.email(paste(hardham.path, p, sep=""),
training.df=spam.df))
hardham.hamtest <- sapply(hardham.docs,
function(p) classify.email(paste(hardham.path, p, sep=""),
training.df=easyham.df))
hardham.res <- ifelse(hardham.spamtest > hardham.hamtest, TRUE, FALSE)
summary(hardham.res)

use this code in chapter 3.
create a code for easyham.df, which is not given in the book. so you can use this complete code with code written for easyham files creation.
the encoding is changed from "latin1" to "naive.enc"
also, a file in spam folder is corrupted, which is causing the errors. so, better alternative is to delete that file and then run the code.

delete this file - spam/00002.d94f1b97e48ed3b553b3508d116e6a09.
also as written in the book, use only first 500 sample mails from the easyham folder for better results.

hope, you found this solution genuine and good enough.

jnjcc mentioned this issue Jul 29, 2014

Fix email_classify.R under windows #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 3 - Error executing get.msg() #4

Chapter 3 - Error executing get.msg() #4

erwtokritos commented Mar 28, 2012

cesarblum commented Apr 14, 2012

johnmyleswhite commented Apr 14, 2012

drewconway commented Apr 20, 2012

cesarblum commented Apr 21, 2012

johnmyleswhite commented Apr 21, 2012

cesarblum commented Apr 21, 2012

hanfeisun commented May 24, 2012

foxet commented Jul 22, 2012

quasiben commented Sep 3, 2012

adayone commented Oct 26, 2012

ceekr commented Nov 1, 2012

adayone commented Nov 1, 2012

The message always begins after the first full line break

ceekr commented Nov 1, 2012

adayone commented Nov 1, 2012

The message always begins after the first full line break

The message always begins after the first full line break

ceekr commented Nov 1, 2012

jamesbconner commented Dec 2, 2012

almartin82 commented Apr 16, 2013

y1239051 commented Jun 23, 2013

y1239051 commented Jun 23, 2013

Donnie-Liu commented Feb 6, 2014

Donnie-Liu commented Feb 6, 2014

Donnie-Liu commented Feb 6, 2014

laocan commented Mar 31, 2014

jnjcc commented Jul 29, 2014

okamipride commented Mar 10, 2015

bluesilence commented Apr 22, 2015

IbrahimZamit commented May 8, 2016

divyanshofficials commented Mar 12, 2018

Chapter 3 - Error executing get.msg() #4

Chapter 3 - Error executing get.msg() #4

Comments

erwtokritos commented Mar 28, 2012

cesarblum commented Apr 14, 2012

johnmyleswhite commented Apr 14, 2012

drewconway commented Apr 20, 2012

cesarblum commented Apr 21, 2012

johnmyleswhite commented Apr 21, 2012

cesarblum commented Apr 21, 2012

hanfeisun commented May 24, 2012

foxet commented Jul 22, 2012

quasiben commented Sep 3, 2012

adayone commented Oct 26, 2012

ceekr commented Nov 1, 2012

adayone commented Nov 1, 2012

The message always begins after the first full line break

ceekr commented Nov 1, 2012

adayone commented Nov 1, 2012

The message always begins after the first full line break

The message always begins after the first full line break

ceekr commented Nov 1, 2012

jamesbconner commented Dec 2, 2012

almartin82 commented Apr 16, 2013

y1239051 commented Jun 23, 2013

The message always begins after the first full line break

y1239051 commented Jun 23, 2013

Donnie-Liu commented Feb 6, 2014

Donnie-Liu commented Feb 6, 2014

Donnie-Liu commented Feb 6, 2014

laocan commented Mar 31, 2014

jnjcc commented Jul 29, 2014

okamipride commented Mar 10, 2015

bluesilence commented Apr 22, 2015

IbrahimZamit commented May 8, 2016

divyanshofficials commented Mar 12, 2018

The message always begins after the first full line break

creating the classifier

Find intersections of words