Web Scraping
Web Scraping with python

50 lines of Python : To scrap the job of your preference

Web scraping : It is a capital mistake to theorize before one has data

                                                                                   – Arthur Conan Doyle

Web Scraping, Web Data Extraction, Web Harvesting etc are the common term used to signify the extraction of large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database.

Web scraping automatically extracts data and presents it in a format you can easily make sense of (eg CSV, JSON, XML etc). In this tutorial, we’ll focus on its applications ” Scraping the  Android developer jobs from jobs posting site Glassdoor“. Web scraping can be used in a wide variety of situations like:

  • E-commerce, Retailers/ marketplaces use web scraping to monitor their competitor prices and to improve their product attributes. Also, collect product reviews to do sentimental analysis
  • Lawyers use web scraping to see the past judgment report for their case reference.
  • Lead generation companies use it to scrape the email address and phone numbers.
  • Recruiters use it to collects the people profiles.
  • Travel companies collect data in real time to provide live tracking details.
  • Media companies collect trending topics and use a hashtag to collect information from social media profiles etc.


Each business has competition in the present world, So companies scrape their competitor information regularly to monitor the movements.

Let’s get started

We are going to use Python as our scraping language, together with a simple and powerful Python library, BeautifulSoup.

To Get most of this post you need to have basic knowledge of the following

  • Installation of Python 3
  • Installation of BeautifulSoup
  • Basic understanding of the syntax of an HTML webpage.

Inspecting the glassdoor page:

Let’s take one page from the Glassdoor website as an example.

Glassdor

From  here we need to scrap

  • Job Title
  • Company name
  • Salaries
  • Job Location

Job data

Then transform all the data into CSV format.

Time to Work with actual code:

Now we know what to and where to scrap our data. Open text editor of your choice.

First, we need to import all the libraries that we are going to use.

# import libraries
import urllib
from bs4 import BeautifulSoup as bs
import pandas as pd
import re

Next, declare a variable for the url of the page.

# Specify url name
list_url= ["https://www.glassdoor.co.in/Job/boston-android-developer-jobs-SRCH_IL.0,6_IC1154532_KO7,24_IP2.htm"]
# Here we go for only one url but if you want to scrap multiple pages you put all the urls within "[]"
# Create different variables
base_url='https://www.glassdoor.co.in'
# Base url of glassdoor
link=[]
# Stored all the links on a given page
Job_Type=[]
CompanyName=[]
Company_Location=[]
Salery_paid=[]

To fetch the HTML page of the given URL use urllib and to beautify it use Beautifulsoup(#Python package for parsing HTML and XML documents.)

for url in list_url:
  req = urllib.request.Request(url, headers={'User-Agent':
'Mozilla/5.0'})
response=urllib.request.urlopen(req).read()
data=bs(response,'lxml')
 
#try print(data) see what you get.

Now we have a variable( data) containing the HTML of the page. Here’s where we can start coding the part that extracts all the “href’ ( The href attribute, which indicates the link’s destination)

new=data.find('div',{'class':'pagingControls cell middle'})
for data1 in new.findAll('li'):
    for new1 in data1.findAll('a'):
        link.append(new1.get('href'))
# Start collecting data from each link
new=data.find('div',{'class':'pagingControls cell middle'})
for data1 in new.findAll('li'):
    for new1 in data1.findAll('a'):
        link.append(new1.get('href'))
     
for m in link:
    newurl=base_url+m
    data2=urllib.request.Request(newurl,headers={'User-Agent': 'Mozilla/5.0'})
    response2=urllib.request.urlopen(data2).read()
   # print(response2)
    data4=bs(response2,'lxml')
    jobType=data4.find('ul',{'class':'jlGrid hover'})
    for jobs in jobType.find_all('li'):
         
        developer=jobs.find('div',{'class':'flexbox'})
        Job_Text=developer.find('a').find(text=True)
         
        place=jobs.find('div',{'class':'flexbox empLoc'})
        Company_Name=place.find(text=True)
         
        loc=place.find('span',{'class':'subtle loc'})
        True_Location=loc.find(text=True)
         
        #print(Job_Text,'\t',Company_Name,'\t',True_Location)
        salery=jobs.find('span',{'class':'green small'})
        pay=str(salery)
        payscale=re.findall(r'$\w+',pay)
        dataSalery=str('-'.join(payscale))
         
         
        #print(type(payscale))
         
        Job_Type.append(Job_Text)
        CompanyName.append(Company_Name)
        Salery_paid.append(dataSalery)
        Company_Location.append(True_Location)

Now that we have the data, it is time to save it. Create a CSV file containing all the data.

data=pd.DataFrame(Job_Type,columns=['Job_Title'])
data['ComName']=CompanyName
data['Sal_Paid']=Salery_paid
data['Com_location']=Company_Location
data

data file

In this tutorial, we scrape Glassdoor.com, one of the fastest growing job recruiting sites. The scraper will extract the data fields for a particular job name in a location given, salary and organization name. You can further  extend it to have more columns like- scrape the jobs based on the number of miles from a particular city, qualification required for the job or many more, Give it a try…

GitHub link for this postAiVentureO  

Limitations

This scraper should work for extracting most job listings on Glassdoor unless the website structure changes drastically. If this code does not work for you probably website changes the pattern. But if instead of just copying the code, you try to learn the working structure behind it, how it really works. You can tackle these sort of problem easily.  

Disclaimer: Any code provided in this tutorials is for illustration and learning purposes only. We are not responsible for how it is used and assume no liability for any detrimental usage of the source code.

About the author

Vikram singh

Founder of Ai Venture,
An artificial intelligence specialist who teaches developers how to get results with modern AI methods via hands-on tutorials.
GANs are my favorite one.

View all posts