Skip to content

Python-based cross-platform tool for mining text data (html, transcript, problems) of edX MOOCs on a user's dashboard. It is an extension of edx-dl.

Notifications You must be signed in to change notification settings

TokyoTechX-TAs/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

edx-crawler

edx-crawler is a Python-based cross-platform tool for mining text data of the enrolled edX and edge edX courses available on a user's dashboard. It was developed by teaching assistants at Tokyo Tech Online Education Development Office as an extension of edx-dl.

Prerequisites

Python libraries and modules:

  • Python - version 3.5+
  • beautifulsoup - a Python library for pulling data out of HTML and XML files
  • webvtt-py - a Python module for reading/writing WebVTT caption files
  • youtube-dl - command-line program to download videos from YouTube.com
  • ffmpeg-python - command-line python wrapper for videos (mpeg) file analysis using ffmpeg software

multimedia framework:

  • ffmpeg - command-line program to to record, convert and stream audio and video.

How to run

Run a python script edx_crawler.py passing edx course link -url , username -u and password -p as parameters.

python edx_crawler.py -url [course_url] -u [edx_user_name] -p [edx_user_password]

OPTIONS

-url, --course-urls		Specify target course urls given from edx dashboard
-u, --username			Specify your edX username (email)
-p, --password			Input your edX password
-d, --html-dir			Specify directory to store data

The output contents are stored in .json format as the following:

  • all text components -> all_textcomp.json
  • all problem components -> all_probcomp.json
  • all video components -> all_videocomp.json
  • all components (text, quizes, videos) -> all_comp.json

The raw HTML files corresponding to each Unit are back up in sourcefile.tar.gz

Extra files and folders

transcript_error_report.txt contains the information about video transcripts which are not provided by edX or YouTube.

About

Python-based cross-platform tool for mining text data (html, transcript, problems) of edX MOOCs on a user's dashboard. It is an extension of edx-dl.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages