CompassCrawler-Reina-UC3M

Description

This project is under a bigger one the COMPASS project, aimed to represent learning opportunities with integrated learning outcomes/competence information. For universities around the world.

This little sub-project covers the UC3M learning opportunity specification, providing a tool to extract the courses information from the UC3M courses web pages, and exporting it to COMPASS information model.

Technically is a web crawler with conversion to COMPASS JSON.

Installation

This project can be launch without install, you just need Python(3.6) and Scrapy (1.5). With both installed you are ready to rock.

Usage

Go to CompassCrawler\course_scraper and run the python script main.py (python main.py). You'll be asked to insert the UC3M bachelors home pages (note is the bachelor page, not the course page), to extract the information from their courses as shown below. Once you finish, insert a double enter and the program will start.

$ python main.py
Welcome to COMPASS UC3M courses parser.
This tool helps you to extract the information from the UC3M website and export it to COMPASS JSON format.

Homepages of the bachelors you want the information, divided by a new line:

https://www.uc3m.es/ss/Satellite/Grado/en/Detalle/Estudio_C/1371212345976/1371212987094/Bachelor_s_Degree_in_Telecommunication_Technologies_Engineering
https://www.uc3m.es/ss/Satellite/Grado/en/Detalle/Estudio_C/1371212485394/1371212987094/Bachelor_s_Degree_in_Communication_System_Engineering

Extracting: https://www.uc3m.es/ss/Satellite/Grado/en/Detalle/Estudio_C/1371212345976/1371212987094/Bachelor_s_Degree_in_Telecommunication_Technologies_Engineering

Information extracted
Stored at: C:\Users\CURRENT_USER\...\course_scraper\Courses

All information will be exported in COMPASS JSON format to the default output folder course_scraper\Courses, with one JSON for each extracted course.

File format

File name

The generated file names are the name of the bachelor or master under-slash UC3M course id with JSON extension.

Fields

The fields from the UC3M system often needs translation, in order to be compatible with COMPASS. As shown below you can check the conversions.

COMPASS Field	UC3M Field	Notes
Title	Course name
Description	Qualification
Identifier		Generates a ID: university-courseID-degreeType_degreeAcronym Ex. for UC3M 205 Bachelor in Computer Science : UC3M-205-b_CS.
Publisher		Always "UC3M".
Creator	Coordinator
Competence	Prerequisite
Type		Always "Talk/Lecture" + "Class/Group based" as all courses in UC3M follows same pattern.
Level	UC3M Bachelor/Master
URL	UC3M Reina URL
Language		The text language of qualification.
Credits	Credits	Defined as double + "ECTS" Ex. "6.0 ECTS".
assessment	assessment

Fields not contemplated in the table are not implemented in the current script.

Contributing

You can modify this project as you want, if you want your changes to be applied to the master branch just make your improvements (well documented, good comments and descriptive variable names) and ask for merge.

Project structure

To make easier yor life you should learn a bit of Scrapy before get you hands dirty. Also this section is intended to help you understand the project and modify the code as you need.
Scrapy is a popular crawling library allowing easy web information extraction, this is possible just customizing a couple of files items.py, pipelines.py and the spiders beneath spiders folder in this case spider_courses.py.

items.py defines the items fields, an item is a information instance extracted from a page, in our case represents a course an its information.
spider_courses defines the spider which is going to fetch the information from a given page. It has to locate the information in the HTML structure of the page and store the information into a scrapy item instance. The locating of the information is archieved using XPath in code defined by selectors. In this case the spider also format (lightly) some of the values because XPath is not precise enough for the current page structure.
pipelines.py defines the processing over the extracted items, in this case it generates the COMPASS JSON using Utils.CourseToJSON a custom utility intended to translate the UC3M information to COMPASS model and store it in JSON files to later be added to COMPASS service. The supported fields are collected in the table above beneath the section files. All the conversions are explained in code.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
course_scraper		course_scraper
.gitignore		.gitignore
README.md		README.md
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CompassCrawler-Reina-UC3M

Description

Table of contents

Installation

Usage

File format

File name

Fields

Contributing

Project structure

License

About

Releases

Packages

Languages

JorgeFrias/CompassCrawler-Reina-UC3M

Folders and files

Latest commit

History

Repository files navigation

CompassCrawler-Reina-UC3M

Description

Table of contents

Installation

Usage

File format

File name

Fields

Contributing

Project structure

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages