Linkedin.com has very well presented job postings for web scraping. This is a demonstration of how to scrape from the job postings page on LinkedIn.com using Python. We'll pull the first 1,000 job postings since the site limits the amount of jobs it will load to that number. Let's load the libraries we'll use.
import pandas as pd
import bs4
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
First we need to find a way to automate the web browsing to scrape each page on the website. Otherwise, it could take quite some time. Let's store our driver as an object.
browser = webdriver.Chrome(executable_path=r"/Users/chesterpoon/chromedriver")
With the next few lines of code, we will now automatically login to linkedin.com and navigate to the jobs page, where we'll instruct Selenium to search for "data scientist". To do this, we'll take the following steps:
The below code starts this process. You may notice a while loop added to the code. For a reason that I cannot explain, an error would be thrown saying it could not find the element for the search box. However, when I ran the code a second time without any changes, it worked. The while loop was created to keep trying until an error was no longer thrown. Linkedin's job posting page has a couple of these, thus any instance in the code where a while loop is in place is meant to handle the errors thrown.
browser.get('https://www.linkedin.com/')
log_name = browser.find_element_by_id('login-email')
log_name.send_keys('username@email.com')
log_pass = browser.find_element_by_id('login-password')
log_pass.send_keys('password')
login_but = browser.find_element_by_id('login-submit')
login_but.click()
jobs_link = browser.find_element_by_id('jobs-tab-icon')
jobs_link.click()
while True:
try:
search_jobs = browser.find_element_by_css_selector('input[id*=jobs-search-box-keyword-id-ember]')
except:
continue
break
search_jobs = browser.find_element_by_css_selector('input[id*=jobs-search-box-keyword-id-ember]')
search_jobs.send_keys('data scientist')
search_go = browser.find_element_by_css_selector(
'button.jobs-search-box__submit-button.button-secondary-large-inverse')
search_go.click()
Now comes the bulk of the work where we will need to iterate through each posting to gather the data. Here's how we will tackle this problem:
- Because the job postings site is javascript enabled, we need to render all job postings for the first page. So we automate the action of scrolling to the bottom of the screen.
- We then use the Selenium and Beautiful Soup libraries to gather the page source to parse the html.
- Once the html is parsed, we can build a loop to extract the list of skills required for each posting while assigning a job id, the primary industry of the job, and the title of the job for the skillset.
skill_list = []
title_list = []
industry_list = []
job_id = []
#This for-loop is intended to iterate the navigation of pages
for p in range(2,42):
def pg_dwn():
#This function is a page down function
while True:
try:
browser.find_element_by_css_selector(
'div[class*=jobs-search-results--is-two-pane]').send_keys(Keys.PAGE_DOWN)
except:
continue
break
for i in range(25):
pg_dwn()
url = browser.current_url
source = browser.page_source
html = bs4.BeautifulSoup(source, "lxml")
while True:
try:
#Find all the jobs
job = browser.find_elements_by_class_name('job-card-search__title-line')
jtitle1 = html.find_all(
attrs={"class": "truncate-multiline--last-line-wrapper"})
jtitle2 = html.find_all(
attrs={"class": "job-card-search__title lt-line-clamp lt-line-clamp--multi-line ember-view"})
except:
continue
break
#Next few lines will combine promoted job ads and regular ones
title1 = []
title2 = []
for a in range(len(jtitle1)):
title1.append(jtitle1[a].getText())
for b in range(len(jtitle2)):
title2.append(jtitle2[b].getText())
jtitle = title1 + title2
j_id = 0
#This code below clicks through each job ad and stores the information scraped.
for i in range(len(job)-1):
job[i].click()
time.sleep(2)
url = browser.current_url
source = browser.page_source
html = bs4.BeautifulSoup(source, "lxml")
skills = html.find_all(attrs={"class": "jobs-ppc-criteria__value t-14 t-black t-normal ml2 block"})
industry = html.find_all(attrs={"class": "jobs-box__list-item jobs-description-details__list-item"})
j_id = j_id + 1
for j in range(len(skills)):
s = skills[j].getText()
t = jtitle[i]
ind = industry[0].getText()
skill_list.append(s)
title_list.append(t)
industry_list.append(ind)
job_id.append("LI" + str(j_id))
if p == 41:
print("Last page complete")
break
#Code below finds the button for the next page.
try:
page = browser.find_elements_by_xpath('//button/span[text()="' + str(p) +'"]')
except:
page = browser.find_elements_by_xpath('//button/span[text()="…"]')
try:
page[1].click()
except:
page[0].click()
time.sleep(2)
Now that we have our data loaded up, we can store our data into a data frame and export it out as a csv file to prepare for the data cleaning process in R.
skillsdf = pd.DataFrame(
{'job_id': job_id,
'skills': skill_list,
'title': title_list,
"industry": industry_list})
skillsdf.to_csv('skills.csv', index=False)
skillsdf