Beautiful Soup Basics

Posted on Jun 14, 2018 in Notes • 15 min read

Beautiful Soup

Some notes on using beautiful soup on dataquest.

In [20]:
# To get content from webpages via `get()`
import requests
from bs4 import BeautifulSoup 
import pandas as pdp

First get content from html via requests.get

In [ ]:
page = requests.get('http://dataquestio.github.io/web-scraping-pages/simple.html')

Basic tags

In [2]:
# raw HTML content of the page 
page.content
In [3]:
# Create an instance of BS class to parse our doc
soup = BeautifulSoup(page.content, 'html.parser')

# `prettify` method displays nicely formatted HTML
soup.prettify()
In [4]:
# Move through the structure one level down
soup.children # returns a list generator, requires the `list()` function
list(soup.children)

[type(item) for item in list(soup.children)]
Out[4]:
[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

Search by tags

  • use the find_all() method and pass in tag name as a str
In [5]:
# Find first instance of a certian tag
soup.find('p') # returns bs4.element.Tag

# Find all instances of a certain tag
soup.find_all('p') # returns bs4.element.ResultSet
Out[5]:
[<p>Here is some simple content for this page.</p>]
In [6]:
# Access the text
soup.find_all('p')[0].get_text()
Out[6]:
'Here is some simple content for this page.'

Search by class & id

  • use the class_ or id attribute of find_all()
In [7]:
page = requests.get('http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html')
soup = BeautifulSoup(page.content, 'html.parser')
In [9]:
outer_text = soup.find_all(class_='outer_text')

Search by CSS selectors

  • use the select() method CSS selectors examples
    • p a — finds all a tags inside of a p tag.
    • body p a — finds all a tags inside of a p tag inside of a body tag.
    • html body — finds all body tags inside of an html tag.
    • p.outer-text — finds all p tags with a class of outer-text.
    • p#first — finds all p tags with an id of first.
    • body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.
In [10]:
# Finds all p tags that are inside a div
soup.select("div p") # returns a python list
Out[10]:
[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]
  1. inspect HTML with Chrome Dev tools
  2. click the target text
  3. find the "outermost" element that contains all of the text
  4. explore the div more
  5. use find() or find_all() to navigate to the target
  6. further explore the tags of the target information and act accordingly
    • are they accessible simply by class or id? USE find().get_text()
    • are they within an attribute of a tag? USE find() and access it as a dict

find() and find_all() have to be called on bs4.element.Tag NOT bs4.element.ResultSet

In [28]:
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")

# Select the first element of the result set
tonight = forecast_items[0] # create a bs4.element.Tag
print(tonight.prettify())
<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Sunny, with a high near 69. West southwest wind 9 to 14 mph increasing to 15 to 20 mph in the afternoon. Winds could gust as high as 26 mph. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 69. West southwest wind 9 to 14 mph increasing to 15 to 20 mph in the afternoon. Winds could gust as high as 26 mph. "/>
 </p>
 <p class="short-desc">
  Sunny
 </p>
 <p class="temp temp-high">
  High: 69 °F
 </p>
</div>
  • The name of the forecast item is in class_="period-name"
  • The short description of the conditions is in class_="short-desc"
  • The low temperature is in class="temp temp-low"
In [12]:
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)
Today
Sunny
High: 69 °F
  • The description of the conditions is in the title property of the img tag

How to extract the title attribute from the img tag

  1. treat the BeautifulSoup object like a dictionary
  2. pass in the attribute we want as a key
In [13]:
img = tonight.find("img")
desc = img['title']

print(desc)
Today: Sunny, with a high near 69. West southwest wind 9 to 14 mph increasing to 15 to 20 mph in the afternoon. Winds could gust as high as 26 mph. 

Get ALL Information

generalise the process above

In [17]:
# Use CSS selector to get period_name classes with in all tombstone_container classes
period_tags = seven_day.select(".tombstone-container .period-name")
period_tags
Out[17]:
[<p class="period-name">Today<br/><br/></p>,
 <p class="period-name">Tonight<br/><br/></p>,
 <p class="period-name">Monday<br/><br/></p>,
 <p class="period-name">Monday<br/>Night</p>,
 <p class="period-name">Tuesday<br/><br/></p>,
 <p class="period-name">Tuesday<br/>Night</p>,
 <p class="period-name">Wednesday<br/><br/></p>,
 <p class="period-name">Wednesday<br/>Night</p>,
 <p class="period-name">Thursday<br/><br/></p>]
In [34]:
periods = [pt.get_text() for pt in period_tags]
type(periods)
Out[34]:
list
In [19]:
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)
['Sunny', 'Patchy Fog', 'Breezy.Patchy Fogthen Sunny', 'PatchyDrizzle andPatchy Fog', 'PatchyDrizzle andPatchy Fogthen Sunny', 'PatchyDrizzle andPatchy Fog', 'PatchyDrizzle andPatchy Fogthen Sunny', 'Patchy Fog', 'Patchy Fogthen Sunny']
['High: 69 °F', 'Low: 54 °F', 'High: 65 °F', 'Low: 52 °F', 'High: 63 °F', 'Low: 52 °F', 'High: 60 °F', 'Low: 52 °F', 'High: 61 °F']
['Today: Sunny, with a high near 69. West southwest wind 9 to 14 mph increasing to 15 to 20 mph in the afternoon. Winds could gust as high as 26 mph. ', 'Tonight: Patchy fog after 11pm.  Otherwise, partly cloudy, with a low around 54. West southwest wind 16 to 21 mph decreasing to 10 to 15 mph after midnight. Winds could gust as high as 28 mph. ', 'Monday: Patchy fog before 11am.  Otherwise, mostly sunny, with a high near 65. Breezy, with a west wind 9 to 14 mph increasing to 18 to 23 mph in the afternoon. Winds could gust as high as 30 mph. ', 'Monday Night: Patchy drizzle and fog after 11pm.  Increasing clouds, with a low around 52. Breezy, with a west wind 20 to 24 mph, with gusts as high as 31 mph. ', 'Tuesday: Patchy drizzle and fog before 11am.  Mostly sunny, with a high near 63. West wind 16 to 21 mph, with gusts as high as 26 mph. ', 'Tuesday Night: Patchy drizzle and fog after 11pm.  Mostly cloudy, with a low around 52.', 'Wednesday: Patchy drizzle and fog before 11am.  Mostly sunny, with a high near 60.', 'Wednesday Night: Patchy fog after 11pm.  Otherwise, mostly cloudy, with a low around 52.', 'Thursday: Patchy fog before 11am.  Otherwise, mostly sunny, with a high near 61.']

Wrangle results

In [33]:
import re
string = re.sub("e|l", "", "Hello people")
string
Out[33]:
'Ho pop'

Store the result in pandas.DataFrame

In [21]:
weather = pd.DataFrame({
        "period": periods,
        "short_desc": short_descs,
        "temp": temps,
        "desc": descs
    })
weather
Out[21]:
desc period short_desc temp
0 Today: Sunny, with a high near 69. West southw... Today Sunny High: 69 °F
1 Tonight: Patchy fog after 11pm. Otherwise, pa... Tonight Patchy Fog Low: 54 °F
2 Monday: Patchy fog before 11am. Otherwise, mo... Monday Breezy.Patchy Fogthen Sunny High: 65 °F
3 Monday Night: Patchy drizzle and fog after 11p... MondayNight PatchyDrizzle andPatchy Fog Low: 52 °F
4 Tuesday: Patchy drizzle and fog before 11am. ... Tuesday PatchyDrizzle andPatchy Fogthen Sunny High: 63 °F
5 Tuesday Night: Patchy drizzle and fog after 11... TuesdayNight PatchyDrizzle andPatchy Fog Low: 52 °F
6 Wednesday: Patchy drizzle and fog before 11am.... Wednesday PatchyDrizzle andPatchy Fogthen Sunny High: 60 °F
7 Wednesday Night: Patchy fog after 11pm. Other... WednesdayNight Patchy Fog Low: 52 °F
8 Thursday: Patchy fog before 11am. Otherwise, ... Thursday Patchy Fogthen Sunny High: 61 °F

Write to Excel

In [ ]:
writer = pd.ExcelWriter('file_name.xlsx', engine='xlsxwriter')
df.to_excel(writer)
writer.save()

Analysis

For example, we can use regex and the Series.str.extract method to pull out the numeric temperature values

In [24]:
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums
Out[24]:
0    69
1    54
2    65
3    52
4    63
5    52
6    60
7    52
8    61
Name: temp_num, dtype: object
In [25]:
weather["temp_num"].mean()
Out[25]:
58.666666666666664