Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Md2 #1

Open
wants to merge 39 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
d56466c
intial commit of scrapy project
shreya025 Dec 3, 2020
edeef88
Add files via upload
angela81ku Dec 4, 2020
99c55a6
added web scraping data-cleaning EDA files
sourav-naskar Dec 12, 2020
3ec4cdd
added web scraping files
sourav-naskar Dec 12, 2020
70ea705
added Web scraping data cleaning EDA files
sourav-naskar Dec 12, 2020
a74f34d
Add the webscraping and EDA tutorials
elateifsara Dec 12, 2020
9daa6a2
Pull the changes
elateifsara Dec 12, 2020
253db76
Clean up the folder and restructure
elateifsara Dec 12, 2020
2d94b47
Module2
YasaminAbbaszadegan Dec 14, 2020
799dd0f
updated webscraping data cleaning EDA files
sourav-naskar Dec 19, 2020
6821c53
updated files
sourav-naskar Dec 19, 2020
2fb5b70
updated files
sourav-naskar Dec 19, 2020
efd1c36
updated files
sourav-naskar Dec 19, 2020
0ebda3a
updated files
sourav-naskar Dec 19, 2020
79abadf
Merge branch 'md2' of https://github.com/mentorchains/level1_post_rec…
YasaminAbbaszadegan Dec 19, 2020
5addf75
Add text file
elateifsara Dec 20, 2020
621a626
Add Yasamin notebook to our repo
elateifsara Dec 20, 2020
7b46144
Add Flowster Forum Scraping Example notebook
elateifsara Dec 21, 2020
1724a71
Pull latest changes
elateifsara Dec 21, 2020
7d17c49
Remove sara_elateif folder (served as demonstration example)
elateifsara Dec 21, 2020
dcacd44
initial commit
Sachitt Jan 2, 2021
d0f918f
updated files
sourav-naskar Jan 2, 2021
56474e9
module two - forum spider and data cleaning
shreya025 Jan 2, 2021
eca50f1
Merge branch 'md2' of https://github.com/mentorchains/level1_post_rec…
shreya025 Jan 2, 2021
5dff8ce
Merge branch 'md2' of https://github.com/mentorchains/level1_post_rec…
Sachitt Jan 2, 2021
2a7b9ef
Merge branch 'md2' of https://github.com/mentorchains/level1_post_rec…
YasaminAbbaszadegan Jan 5, 2021
7aa8c37
Final_WebScraping_Version
YasaminAbbaszadegan Jan 5, 2021
ca52f8a
Merge branch 'md2' of https://github.com/mentorchains/level1_post_rec…
Sachitt Jan 11, 2021
4dce683
Restructuring file system
Sachitt Jan 11, 2021
56753d5
restructuring file system
Sachitt Jan 11, 2021
6e8b2d7
Added information acquired by scrolling
Sachitt Jan 11, 2021
825f033
More EDA
Sachitt Jan 11, 2021
5fbda64
Final version of Webscraping DataCleaning EDA files
sourav-naskar Jan 12, 2021
508729a
Merge branch 'md2' of https://github.com/mentorchains/level1_post_rec…
Sachitt Jan 13, 2021
e7f2681
renaming files and adding cleaneddata.csv to be used in md3
Sachitt Jan 13, 2021
f77f4c3
Added a csv file with stopwords to be used in md3
Sachitt Jan 16, 2021
9e46560
adjusted to clean amazon data
Sachitt Jan 30, 2021
1eda659
Add assets folder
elateifsara Jun 15, 2021
6bded57
Add new stuff from md2
elateifsara Jun 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
updated files
sourav-naskar committed Jan 2, 2021
commit d0f918fe57484b509e5e18d7a5cbb8ce78b465f4
30,839 changes: 0 additions & 30,839 deletions Sourav_Naskar/Codeacademy Data cleaning & EDA.ipynb

This file was deleted.

2,370 changes: 0 additions & 2,370 deletions Sourav_Naskar/Codeacademy Webscraping.ipynb

This file was deleted.

19,685 changes: 19,685 additions & 0 deletions Sourav_Naskar/Codeacademy20210102131605.csv

Large diffs are not rendered by default.

26,470 changes: 0 additions & 26,470 deletions Sourav_Naskar/Codeacademy_Discuss.csv

This file was deleted.

10,533 changes: 10,533 additions & 0 deletions Sourav_Naskar/Data_cleaning & EDA_Codeacademy.ipynb

Large diffs are not rendered by default.

42,532 changes: 42,532 additions & 0 deletions Sourav_Naskar/Webscraping_Codeacademy.ipynb

Large diffs are not rendered by default.

219 changes: 219 additions & 0 deletions Sourav_Naskar/Webscraping_Codeacademy.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import requests
import time
from datetime import datetime
import os
import pandas as pd

class CodeacademyWebscraper:
driver = None # Selenium webdriver object
topicDict = {} # Dictionary of all topics and their attributes
topicDataframe = \
pd.DataFrame(columns=[ # Pandas dataframe of all topic attributes
'Topic Title',
'Category',
'Tags',
'Leading Comment',
'Other Comments',
'Likes',
'Views'])


def __init__(self, webdriverPath):
# Set up webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors') # Ignore security certificates
options.add_argument('--incognito') # Use Chrome in Incognito mode
options.add_argument('--headless') # Run in background
self.driver = webdriver.Chrome( \
executable_path = webdriverPath, \
options = options)

def get_title(self, topicSoup):
topicName = topicSoup.find('a', class_='fancy-title').text

# Remove leading and trailing spaces and newlines
topicName = topicName.replace('\n', '').strip()
return topicName

def get_category_and_tags(self, topicSoup):
topicCategoryDiv = topicSoup.find('div', class_='topic-category ember-view')
tagAnchors = topicCategoryDiv.find_all('span', class_='category-name')

tagList = []
for anchor in tagAnchors:
tagList.append(anchor.text)

if (len(tagList) == 1):
category = tagList[0]
tags = []
return category, tags
else:
category = tagList[0]
tags = tagList[1:]
return category, tags


def get_comments(self, topicSoup):
# Get all the posts HTML
comment = topicSoup.find_all('div', class_='cooked')
comments = []
temp = ''
for ele in comment:
temp += ele.get_text()
comments.append(temp)
try:
leading_comment = comments[0]
if len(comments) == 1:
other_comments = []
else:
other_comments = comments[1:]
except:
leading_comment, other_comments = [], []

return leading_comment, other_comments


def get_views(self, topicSoup):
views = topicSoup.find('li', class_='secondary views')
if views == None:
return str(0)
return views.span.text


def get_likes(self, topicSoup):
likes = topicSoup.find('li', class_='secondary likes')
if likes == None:
return str(0)
return likes.span.text

def runApplication(self, baseURL):
# Open Chrome web client using Selenium and retrieve page source
self.driver.get(baseURL)
baseHTML = self.driver.page_source

# Get base HTML text and generate soup object
baseSoup = BeautifulSoup(baseHTML, 'html.parser')

# Find all anchor objects that contain category information
categoryAnchors = baseSoup.find_all('a', class_='category-title-link')

# Get hyperlink references and append it to the base URL to get the category page URLs
categoryPageURLs = []
for i in range(len(categoryAnchors)):
href = categoryAnchors[i]['href']
categoryPageURLs.append(baseURL + href)

# 1st for loop to loop through all categories
for categoryURL in categoryPageURLs:
# Access category webpage
self.driver.get(categoryURL)

# Load the entire webage by scrolling to the bottom
lastHeight = self.driver.execute_script("return document.body.scrollHeight")
while (True):
# Scroll to bottom of page
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Wait for new page segment to load
time.sleep(0.5)

# Calculate new scroll height and compare with last scroll height
newHeight = self.driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight


# Generate category soup object
categoryHTML = self.driver.page_source
categorySoup = BeautifulSoup(categoryHTML, 'html.parser')

# Find all anchor objects that contain topic information
topicAnchors = categorySoup.find_all('a', class_='title raw-link raw-topic-link')

# Get hyperlink references and append it to the base URL to get the topic page URLs
topicPageURLs = []
for i in range(len(topicAnchors)):
href = topicAnchors[i]['href']
topicPageURLs.append(baseURL + href)


# 2nd for loop to loop through all topics in a category
for topicURL in topicPageURLs:
# Get topic HTML text and generate topic soup object
self.driver.get(topicURL)
topicHTML = self.driver.page_source
topicSoup = BeautifulSoup(topicHTML, 'html.parser')

# Scape all topic attributes of interest
topicTitle = self.get_title(topicSoup)
category, tags = self.get_category_and_tags(topicSoup)

leadingComment, otherComments= self.get_comments(topicSoup)
numLikes = self.get_likes(topicSoup)
numViews = self.get_views(topicSoup)

# Create attribute dictionary for topic
attributeDict = {
'Topic Title' : topicTitle,
'Category' : category,
'Tags' : tags,
'Leading Comment' : leadingComment,
'Other Comments' : otherComments,
'Likes' : numLikes,
'Views' : numViews}

# Add the new entry to the topic dictionary and Pandas dataframe
self.topicDict[topicTitle] = attributeDict
self.topicDataframe = self.topicDataframe.append(attributeDict, ignore_index=True)

#TEST
print('Topic Title:')
print(topicTitle)
print('Category:')
print(category)
print('Tags:')
print(tags)
print('Leading Comment:')
print(leadingComment)
print('Other Comments:')
print(otherComments)
print('Likes:')
print(numLikes)
print('Views:')
print(numViews)


# Get unique timestamp of the webscraping
timeStamp = datetime.now().strftime('%Y%m%d%H%M%S')

# Save data in CSV file and store in the save folder as this program

csvFilename = 'Codeacademy' + timeStamp + '.csv'


csvFileFullPath = os.path.join(os.path.dirname(os.path.abspath("__file__")), csvFilename)



self.topicDataframe.to_csv(csvFileFullPath)



if __name__=='__main__':
# Local path to webdriver
webdriverPath = 'C:\Program Files (x86)\chromedriver.exe'

# Codeacademy discuss forum base URL
baseURL = 'https://discuss.codecademy.com/'


# Create Codeacademy webscraping object
codeacademyWebscraper = CodeacademyWebscraper(webdriverPath)

# Run webscraping and save data
codeacademyWebscraper.runApplication(baseURL)