Scraping News Articles from NPR and National Review

December 9, 2018

Introduction

The data I will scrape will be used to do a text analysis between a politically right-leaning news website versus a left-leaning news website.

Given the time constraints, I will be sampling one right-leaning site (National Review) and one left-leaning site (NPR). Let's begin by loading our libraries. Our main scraping tools will be Selenium and BeautifulSoup4

In [ ]:
import pandas as pd
import numpy as np
import bs4
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

Scraping NPR

NPR is first and we will navigate straight to their article archive page. Let's set the browser driver and initiate a window to navigate to NPR's website.

In [ ]:
#set browser driver and navigate to NPR
browser = webdriver.Chrome(executable_path=r"/Users/chesterpoon/chromedriver")
browser.get('https://www.npr.org/sections/politics/archive')

We can see from manual navigation that to load the older articles, we have to scroll down to the bottom of the page where more articles will load. This can get very tedious when done manually, so we'll create a page down function to do this quicker.

In [ ]:
#set page down function
def pg_dwn():
    browser.find_element_by_css_selector(
        'body').send_keys(Keys.PAGE_DOWN)

Now let's set the function in a for loop. I'll set the range limit to 50. I'll also add a time.sleep() function to give the page some time to render.

In [ ]:
for i in range(50):
    for i in range(15):
        pg_dwn()
    time.sleep(2)

To prepare for what we're about to do, I will first instantiate some empty lists to prepare for our pull. These will serve as our columns in our future data frame.

Now that our list of articles have rendered, I'll write code to locate all the links for each article. A quick glance at the DOM and we can see that all articles are set in an "h2" tag with a class name of "title". We'll use beautifulsoup4 to get the page source and pull each link and store it in a list. We'll also pull the link text, which also happens to be our article titles and store that in a list as well.

In [ ]:
title_l = []
datepub_l = []
article_l = []
href_l = [] 

source = browser.page_source
html = bs4.BeautifulSoup(source, "lxml")
links_l = html.find_all('h2', attrs={'class': 'title'})
for l in links_l:
    href_l.append(l.a.get('href'))
for t in range(len(links_l)):
    tl = links_l[t].get_text()
    title_l.append(tl)

From further manual navigation of the site, I noticed that when you navigate to an article and then back out back to the original archive page, our large list of articles are no longer rendered and we have to loop through our "page down" function again. To avoid this, we simply need to open each link we collected in a new browser tab.

With each tab we open, we'll scrape the date the article was published, which is stored in a "time" tag. Because this is NPR and a number of the "time" tags indicate the time it takes for an audio playback to complete, the correct time tag is most likely our last one in all time tags on the page. We'll also store all the article contents into a variable, which is located between "p" tags without a class name.

In [ ]:
for j in range(len(href_l)):
    browser.execute_script("window.open('" + href_l[j] + "', 'new_window')")
    browser.switch_to.window(browser.window_handles[1])

    #start NPR loop here
    source = browser.page_source
    html = bs4.BeautifulSoup(source, "lxml")
    datepub = html.find_all('time')[-1]['datetime']
    content = html.find_all('p',attrs={'class': None})

    article = ""

    for c in content:
        article = article + c.getText()


    datepub_l.append(datepub)
    article_l.append(article)
    
    browser.close()
    browser.switch_to.window(browser.window_handles[0])
    #     end loop here

Now that all our data is collected, we'll create a dataframe, do a bit of cleaning up and export it to a csv file.

In [ ]:
npr = pd.DataFrame({
    'title': title_l,
    'datepub': datepub_l,
    'article': article_l,
    'link': href_l
})

for index,row in npr.iterrows():
    if row['datepub'][0] == 'P':
        row['datepub'] = ''
        
npr['datepub'] =  pd.to_datetime(npr['datepub'])

npr.to_csv('npr.csv',index=False)

Scraping the National Review

Much like the process for NPR's site, we'll follow much of the same steps. I will outline a few critical differences.

In [ ]:
browser.get('https://www.nationalreview.com/politics-policy/')

ntitle_l = []
ndatepub_l = []
narticle_l = []
nhref_l = []

Our first major difference is in how to load more articles. The National Review loads more articles when you scroll to the bottom of the screen and click the load more button. This time we'll loop through clicking the load more button.

In [ ]:
for i in range(60):
    nxt = browser.find_element_by_css_selector('span.button-text')
    nxt.click()
    time.sleep(2)

The National Review site has many ads that make their pull a bit difficult in that sometimes their page doesn't fully load because one of their ads timed out. It's also far slower than the NPR site, which I suspect is also due to the ads and clickbait. To handle these exceptions, I've built it exception handlers to tackle this issue. The rest of the code is identical to NPR's scrape.

In [ ]:
#function to scrape each page
source = browser.page_source
html = bs4.BeautifulSoup(source, "lxml")
links = html.find_all('h4', attrs={'class': 'post-list-article__title'})

for t in range(len(links)):
    tl = links[t].get_text()
    ntitle_l.append(tl)

for l in links:
    nhref_l.append(l.a.get('href'))

for j in range(len(nhref_l)):
    browser.execute_script("window.open('" + nhref_l[j] + "', 'new_window')")
    browser.switch_to.window(browser.window_handles[1])

    #start loop here
    #handle time-out exceptions
    try:
        source = browser.page_source
    except:
        ndatepub_l.append('')
        narticle_l.append('')
        browser.close()
        browser.switch_to.window(browser.window_handles[0])
    else:
        html = bs4.BeautifulSoup(source, "lxml")
        #This handles a missing time tag exception
        try:
            datepub = html.find('time')['datetime'] 
        except:
            datepub = ''
        content = html.find_all('p',attrs={'class': None})

        article = ""

        for c in content:
            article = article + c.getText()

        ndatepub_l.append(datepub)
        narticle_l.append(article)

        browser.close()
        browser.switch_to.window(browser.window_handles[0])
    #     end loop here

natrev = pd.DataFrame({
    'title': ntitle_l,
    'datepub': ndatepub_l,
    'article': narticle_l,
    'link': nhref_l
})

natrev['datepub'] =  pd.to_datetime(natrev['datepub'])
natrev.to_csv('natrev.csv',index=False)