Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 3: Contents of spam.df don't match output in book #35

Open
ChrisHowlin opened this issue Oct 17, 2015 · 2 comments
Open

Chapter 3: Contents of spam.df don't match output in book #35

ChrisHowlin opened this issue Oct 17, 2015 · 2 comments

Comments

@ChrisHowlin
Copy link

In Chapter 3 we construct a spam filter based on the data in the folder:

ML_for_Hackers/03-Classification/data/spam

In the book, the terms in these emails are ordered by occurrence with the command below. The book lists the following table with html at the top:

head(spam.df[with(spam.df, order(-occurrence)),])

term frequency density occurrence
2122 html 377 0.005665595 0.338
538 body 324 0.004869105 0.298
4313 table 1182 0.017763217 0.284
1435 email 661 0.009933576 0.262
1736 font 867 0.013029365 0.262
1942 head 254 0.003817138 0.246

When running the code directly, this does not match the output I get with email at the top:

term frequency density occurrence
7781 email 813 0.005853680 0.566
18809 please 425 0.003060042 0.508
14720 list 409 0.002944840 0.444
27309 will 828 0.005961681 0.422
3060 body 379 0.002728837 0.408
9457 free 539 0.003880853 0.390

This seems to be explained by the way the document vectors are processed with the removePunctuation setting. This punctuation is removed and any terms which were separated would now be a new term. For example, becomes htmlhead. The result is that instead of html being listed as a common term in many of the emails, we have lots of low frequency combination of html with other HTML tag keywords.

@IbrahimZamit
Copy link

@ChrisHowlin and what seems to be the solution for this issue in order to obtain the same results as the book ??!!!

@NumberOne925
Copy link

Do you guys know, why i have a different results than in the book? why is this happening?
data mining1

@pythonandr
Copy link

@NumberOne925
I got the same result as yours.
I think it is normal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants