Problem statement: Scrap the data from Techolution Careers website and store the data according to the date of posting(Most old first) as a DataFrame in CSV.
Website URL: https://techolution.app.param.ai/jobs/
To solve this problem the steps were:
- Opened the website URL
2)While inspecting the website I found out that there were three request sent by the website one of them was giving query on job type description and location
3)We copied this link it was https://techolution.app.param.ai/api/career/get_job/?query=&locations=&category=&job_types= we also got the content type it was json
4)To extract information from this we used requests python package
5)We loaded the information in json format using json.loads
import requests
import json, sys, csv
import pandas as pd
r = requests.get('https://techolution.app.param.ai/api/career/get_job/?query=&locations=&category=&job_types=')
j = json.loads(r.content.decode('UTF-8'))
We saved the json file as data_file.json
with open("data_file.json", "w") as write_file:
json.dump(j, write_file)
for row in j:
print(row)
fil_locations
data
fil_job_types
fil_category
query_str
total_jobs
In JSON file we found six keys, as we can see above, we observed the data and found that the data key will be sufficient to give all information related to jobs.
The structure of data in json file is:
data
categories
jobs
Inside the jobs we can find the informations related to the job we store this in a list arr and also the location is in array so we convert it into string
arr=[]
locations = []
for i in j['data']:
for obj in j['data'][i]['jobs']:
locations = obj['locations']
obj['locations'] = locations[0]
arr.append(obj)
We found the columns and saved in obj list and created a dataframe using arr list and columns were in obj list
obj=[]
for col in j['data'][i]['jobs'][0]:
obj.append(col)
df=pd.DataFrame(arr, columns=obj)
We need to sort the list using the posting date so first we convert the date created_at using pandas to_datetime and then sort the data frame on this column
df['created_at'] = pd.to_datetime(df['created_at'])
df=df.sort_values('created_at')
df.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
id | title | req_id | slug | created_at | locations | description | job_type | min_exp | max_exp | added_by | added_by_email | category | business_unit_name | organization_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20 | a933f8e2-dd5d-4a82-ab10-1209634d31c7 | Engineering Lead | 1861 | engineering-lead | 2019-02-08T13:11:49.966886Z | mauritius | <p><strong style="color: rgb(0, 0, 0); backgro... | Full-time | 84 | 216 | Rekha Allam | [email protected] | Information Technology | Cloud Automation - Mauritius | Techolution Mauritius |
19 | 4e17f47b-7916-411a-b0cd-90d5eeb6346f | DevOps Architect | 1873 | devops-architect | 2019-02-11T12:00:25.061831Z | Hyderabad | <p><span style="color: rgb(0, 0, 0); backgroun... | Full-time | 60 | 180 | Nikhil Shekhar | [email protected] | Information Technology | Cloud Automation - India | Techolution Pvt Ltd |
26 | 4e641217-901a-4670-886e-dd2946bf5476 | Machine Learning Engineer | 1898 | machine-learning-engineer | 2019-02-14T16:13:38.000894Z | Hyderabad | <p><strong style="color: rgb(51, 51, 51);">Tit... | Full-time | 36 | 60 | Madhav Kommineni | [email protected] | Facial recognition | FaceOpen | Techolution LLC |
18 | d4847f54-7a0a-44dd-b3a6-b93c7fb3cb7d | Sr SDET | 1903 | sr-sdet | 2019-02-14T16:38:50.411436Z | New York | <p>Techolution is a premier cloud, user interf... | Full-time | 36 | 120 | Satish Kumar | [email protected] | Information Technology | UI/UX Modernization - US | Techolution LLC |
17 | c4daf0d7-f86f-4d86-b62e-ab9117ba2800 | OSS DevOps Engineer | 1905 | oss-devops-engineer | 2019-02-14T16:55:20.844881Z | Hyderabad | <p><strong>Title : OSS DevOps Engineer</s... | Full-time | 72 | 144 | Pavan Kumar | [email protected] | Information Technology | Cloud Automation - India | Techolution Pvt Ltd |
df.to_csv("jobfile.csv",encoding='utf-8', index=False)
Saving the file as jobfile.csv. This is the required file