Project 3 - Webscraping Linkedin.com

Background

Linkedin.com has very well presented job postings for web scraping. This is a demonstration of how to scrape from the job postings page on LinkedIn.com using Python. We'll pull the first 1,000 job postings since the site limits the amount of jobs it will load to that number. Let's load the libraries we'll use.

In [ ]:
import pandas as pd
import bs4
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

Using the Webdriver from the Selenium library

First we need to find a way to automate the web browsing to scrape each page on the website. Otherwise, it could take quite some time. Let's store our driver as an object.

In [ ]:
browser = webdriver.Chrome(executable_path=r"/Users/chesterpoon/chromedriver")

With the next few lines of code, we will now automatically login to linkedin.com and navigate to the jobs page, where we'll instruct Selenium to search for "data scientist". To do this, we'll take the following steps:

  • Find the username element and enter a username
  • Find the password element and enter the password
  • Find the "login" button and click it
  • Find the jobs icon on the homepage and click it
  • Find the search box and enter "data scientist" in the field
  • Find the "Search" button and click it to initiate the search

The below code starts this process. You may notice a while loop added to the code. For a reason that I cannot explain, an error would be thrown saying it could not find the element for the search box. However, when I ran the code a second time without any changes, it worked. The while loop was created to keep trying until an error was no longer thrown. Linkedin's job posting page has a couple of these, thus any instance in the code where a while loop is in place is meant to handle the errors thrown.

In [ ]:
browser.get('https://www.linkedin.com/')
log_name = browser.find_element_by_id('login-email')
log_name.send_keys('username@email.com')
log_pass = browser.find_element_by_id('login-password')
log_pass.send_keys('password')
login_but = browser.find_element_by_id('login-submit')
login_but.click()
jobs_link = browser.find_element_by_id('jobs-tab-icon')
jobs_link.click()

while True:
    try:
        search_jobs = browser.find_element_by_css_selector('input[id*=jobs-search-box-keyword-id-ember]')
    except:
        continue
    break

search_jobs = browser.find_element_by_css_selector('input[id*=jobs-search-box-keyword-id-ember]')
search_jobs.send_keys('data scientist')
search_go = browser.find_element_by_css_selector(
    'button.jobs-search-box__submit-button.button-secondary-large-inverse')
search_go.click()

Scraping the Data from Each Posting

Now comes the bulk of the work where we will need to iterate through each posting to gather the data. Here's how we will tackle this problem:

  • We'll create some empty vectors (lists in 'Python talk') to store the values we gather in order to setup our dataframe later.
  • Take all of the below steps and set up in a loop that repeats itself 40 times to not go over the 1,000 posting limit.
    • Because the job postings site is javascript enabled, we need to render all job postings for the first page. So we automate the action of scrolling to the bottom of the screen.
    • We then use the Selenium and Beautiful Soup libraries to gather the page source to parse the html.
    • Once the html is parsed, we can build a loop to extract the list of skills required for each posting while assigning a job id, the primary industry of the job, and the title of the job for the skillset.
  • We'll then find the element of the button for the next page and repeat.
  • Throughout these steps, we'll build in "sleep" functions to give the browser time to render the html.
In [ ]:
skill_list = []
title_list = []
industry_list = []
job_id = []

#This for-loop is intended to iterate the navigation of pages
for p in range(2,42):
    def pg_dwn():
        #This function is a page down function
        while True:
            try:
                browser.find_element_by_css_selector(
                    'div[class*=jobs-search-results--is-two-pane]').send_keys(Keys.PAGE_DOWN)
            except:
                continue
            break

    for i in range(25):
        pg_dwn()

    url = browser.current_url
    source = browser.page_source
    html = bs4.BeautifulSoup(source, "lxml")

    while True:
        try:
            #Find all the jobs
            job = browser.find_elements_by_class_name('job-card-search__title-line')
            jtitle1 = html.find_all(
                attrs={"class": "truncate-multiline--last-line-wrapper"})
            jtitle2 = html.find_all(
                attrs={"class": "job-card-search__title lt-line-clamp lt-line-clamp--multi-line ember-view"})
        except:
            continue
        break
    
    #Next few lines will combine promoted job ads and regular ones
    title1 = []
    title2 = []

    for a in range(len(jtitle1)):
        title1.append(jtitle1[a].getText())
    for b in range(len(jtitle2)):
        title2.append(jtitle2[b].getText())
    
    jtitle = title1 + title2
    
    j_id = 0
    
    #This code below clicks through each job ad and stores the information scraped.
    for i in range(len(job)-1):
        job[i].click()
        time.sleep(2)
        url = browser.current_url
        source = browser.page_source
        html = bs4.BeautifulSoup(source, "lxml")
        skills = html.find_all(attrs={"class": "jobs-ppc-criteria__value t-14 t-black t-normal ml2 block"})
        industry = html.find_all(attrs={"class": "jobs-box__list-item jobs-description-details__list-item"})
        j_id = j_id + 1
        
        for j in range(len(skills)):
            s = skills[j].getText()
            t = jtitle[i]
            ind = industry[0].getText()
            skill_list.append(s)
            title_list.append(t)
            industry_list.append(ind)
            job_id.append("LI" + str(j_id))
    
    if p == 41:
        print("Last page complete")
        break
    
    #Code below finds the button for the next page.
    try:
        page = browser.find_elements_by_xpath('//button/span[text()="' + str(p) +'"]')
    except:
        page = browser.find_elements_by_xpath('//button/span[text()="…"]')
    
    try:
        page[1].click()
    except:
        page[0].click()
    time.sleep(2)

Loading Our Data into a Data Frame

Now that we have our data loaded up, we can store our data into a data frame and export it out as a csv file to prepare for the data cleaning process in R.

In [ ]:
skillsdf = pd.DataFrame(
    {'job_id': job_id,
     'skills': skill_list, 
     'title': title_list, 
     "industry": industry_list})
skillsdf.to_csv('skills.csv', index=False)
In [ ]:
skillsdf