An introduction to Web scraping using Python 3

In this article I will demonstrate how easy it is to perform basic text Web scraping using Python and just a few lines of code.

The example have been developed and tested using Python  3.5.2.

The first step is to see if you have the following third party libraries already installed; Requests and Beautiful Soup 4. So start idle and try typing the following command:


import requests

After you press return, if you see no error messages then requests is installed. If you see an error message that shows requests has not been found, you should install it using pip from the command line as shown below.


pip install requests

Repeat the process to see if you already have the Beautiful Soup library installed, fortunately you don’t have too much to type….


import bs4

Again if Python complains that it can’t find the library, use pip from the command line to install it.


pip install beautifulsoup4

With the libraries installed, here is a program that scrapes this site. It returns the titles from the blog posts that are shown on this page.

To demonstrate how this is achieved with just a few lines of code, here is the program without comments:


import requests, bs4

def getTitlesFromMySite(url):

 res = requests.get(url)
 res.raise_for_status()

 soup = bs4.BeautifulSoup(res.text, 'html.parser')
 elems = soup.select('.entry-title')
 
 return elems


titles = getTitlesFromMySite('https://oraclefrontovik.com')

for title in titles:
 print(title.text)

Now the same code but this time with each section commented…


# import requests (for downloading web pages) and beautiful soup (for parsing html) 
import requests, bs4

# create a function that allows a parameter containing a url to be passed into it
def getTitlesFromMySite(url):

# download the webpage and store it in res variable
res = requests.get(url)
# check for problems - if there are, raise_for_status() raises an exception
# and the program stops at this point
res.raise_for_status()

# running the downloaded webpage through Beautiful Soup returns a
# Beautiful Soup object which represents the HTML as a nested data structure.
soup = bs4.BeautifulSoup(res.text, 'html.parser')

# store in an array the items that match this css selector. 
# I will explain how I obtained this entry below
elems = soup.select('.entry-title')

return elems

# call the function and store the results in titles
titles = getTitlesFromMySite('https://oraclefrontovik.com')

# loop through the array printing out the title.
for title in titles:
print(title.text)

Running the example returns the following expected output….


Learn C# in One Day and Learn It Well – Review

Contributing to an Open Source Project

A step by step guide to building a Raspberry Pi Hedgehog camera

Is there more than one reason to use PL/SQL WHERE CURRENT OF ?

Structured Basis Testing

Raspberry Pi connected to WiFi but no internet access

The auditing capabilities of Flashback Data Archive in Oracle 12c.

DBMS_UTILITY.FORMAT_ERROR_BACKTRACE and the perils of the RAISE statement

Using INSERT ALL with related tables

The best lesson I learnt from Steve McConnell

To summarise, the code imports two third party libraries, requests and Beautiful Soup 4, that perform the lions share of the work. In the example I use the requests library to download a web page as HTML and then pass it to Beautiful Soup along with a CSS selector to return the information I want from it.

Obtaining the CSS selector

The code example has the following line which extracts the part of the webpage, the blog post titles, that we are interested in:

elems = soup.select('.entry-title')

Using Firefox, I obtained the CSS Selector ‘.entry-title’ by:

  1. Navigate to the page of interest, in this case, oraclefrontovik.com
  2. Opened Firefox developer tools (Ctrl + Shift + I)
  3. Highlighted the first title (which at the time of writing was Learn C# in One Day and Learn it Well – Review) , right click and select Inspect Element
  4. In the console, I then right click and select Copy and then choose CSS Selector from the sub menu.

At the time of writing, I was unable to get the same CSS Selector using the native developer tools from Chrome. If you know of a way please let me know in the comments.

Summary

In this post I have walked through the steps to perform basic text Web scraping using Python 3.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.