Beautiful Soup Basics

Posted on Jun 14, 2018 in Notes • 15 min read

Beautiful Soup¶

Some notes on using beautiful soup on dataquest.

In [20]:

# To get content from webpages via `get()`
import requests
from bs4 import BeautifulSoup 
import pandas as pdp

First get content from html via `requests.get`¶

In [ ]:

page = requests.get('http://dataquestio.github.io/web-scraping-pages/simple.html')

Basic tags¶

In [2]:

# raw HTML content of the page 
page.content

In [3]:

# Create an instance of BS class to parse our doc
soup = BeautifulSoup(page.content, 'html.parser')

# `prettify` method displays nicely formatted HTML
soup.prettify()

In [4]:

# Move through the structure one level down
soup.children # returns a list generator, requires the `list()` function
list(soup.children)

[type(item) for item in list(soup.children)]

Out[4]:

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

Search by tags¶

use the find_all() method and pass in tag name as a str

In [5]:

# Find first instance of a certian tag
soup.find('p') # returns bs4.element.Tag

# Find all instances of a certain tag
soup.find_all('p') # returns bs4.element.ResultSet

Out[5]:

[<p>Here is some simple content for this page.</p>]

In [6]:

# Access the text
soup.find_all('p')[0].get_text()

Out[6]:

'Here is some simple content for this page.'

Search by class & id¶

use the class_ or id attribute of find_all()

In [7]:

page = requests.get('http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html')
soup = BeautifulSoup(page.content, 'html.parser')

In [9]:

outer_text = soup.find_all(class_='outer_text')

Search by CSS selectors¶

use the select() method CSS selectors examples
- p a — finds all a tags inside of a p tag.
- body p a — finds all a tags inside of a p tag inside of a body tag.
- html body — finds all body tags inside of an html tag.
- p.outer-text — finds all p tags with a class of outer-text.
- p#first — finds all p tags with an id of first.
- body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

In [10]:

# Finds all p tags that are inside a div
soup.select("div p") # returns a python list

Out[10]:

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>, <p class="inner-text">
                 Second paragraph.
             </p>]

Navigating the Web Structure¶

inspect HTML with Chrome Dev tools
click the target text
find the "outermost" element that contains all of the text
explore the div more
use find() or find_all() to navigate to the target
further explore the tags of the target information and act accordingly
- are they accessible simply by class or id? USE find().get_text()
- are they within an attribute of a tag? USE find() and access it as a dict

find() and find_all() have to be called on bs4.element.Tag NOT bs4.element.ResultSet

In [28]:

page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")

# Select the first element of the result set
tonight = forecast_items[0] # create a bs4.element.Tag
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Sunny, with a high near 69. West southwest wind 9 to 14 mph increasing to 15 to 20 mph in the afternoon. Winds could gust as high as 26 mph. " class="forecast-icon" src="newimages/medium/few.png" title="Today: Sunny, with a high near 69. West southwest wind 9 to 14 mph increasing to 15 to 20 mph in the afternoon. Winds could gust as high as 26 mph. "/>
 </p>
 <p class="short-desc">
  Sunny
 </p>
 <p class="temp temp-high">
  High: 69 °F
 </p>
</div>

The name of the forecast item is in class_="period-name"
The short description of the conditions is in class_="short-desc"
The low temperature is in class="temp temp-low"

In [12]:

period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()

print(period)
print(short_desc)
print(temp)

Today
Sunny
High: 69 °F

The description of the conditions is in the title property of the img tag

How to extract the title attribute from the img tag

treat the BeautifulSoup object like a dictionary
pass in the attribute we want as a key

In [13]:

img = tonight.find("img")
desc = img['title']

print(desc)

Today: Sunny, with a high near 69. West southwest wind 9 to 14 mph increasing to 15 to 20 mph in the afternoon. Winds could gust as high as 26 mph.

Get ALL Information¶

generalise the process above

In [17]:

# Use CSS selector to get period_name classes with in all tombstone_container classes
period_tags = seven_day.select(".tombstone-container .period-name")
period_tags

Out[17]:

[<p class="period-name">Today<br/><br/></p>,
 <p class="period-name">Tonight<br/><br/></p>,
 <p class="period-name">Monday<br/><br/></p>,
 <p class="period-name">Monday<br/>Night</p>,
 <p class="period-name">Tuesday<br/><br/></p>,
 <p class="period-name">Tuesday<br/>Night</p>,
 <p class="period-name">Wednesday<br/><br/></p>,
 <p class="period-name">Wednesday<br/>Night</p>,
 <p class="period-name">Thursday<br/><br/></p>]

In [34]:

periods = [pt.get_text() for pt in period_tags]
type(periods)

Out[34]:

list

In [19]:

short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]

print(short_descs)
print(temps)
print(descs)

['Sunny', 'Patchy Fog', 'Breezy.Patchy Fogthen Sunny', 'PatchyDrizzle andPatchy Fog', 'PatchyDrizzle andPatchy Fogthen Sunny', 'PatchyDrizzle andPatchy Fog', 'PatchyDrizzle andPatchy Fogthen Sunny', 'Patchy Fog', 'Patchy Fogthen Sunny']
['High: 69 °F', 'Low: 54 °F', 'High: 65 °F', 'Low: 52 °F', 'High: 63 °F', 'Low: 52 °F', 'High: 60 °F', 'Low: 52 °F', 'High: 61 °F']
['Today: Sunny, with a high near 69. West southwest wind 9 to 14 mph increasing to 15 to 20 mph in the afternoon. Winds could gust as high as 26 mph. ', 'Tonight: Patchy fog after 11pm.  Otherwise, partly cloudy, with a low around 54. West southwest wind 16 to 21 mph decreasing to 10 to 15 mph after midnight. Winds could gust as high as 28 mph. ', 'Monday: Patchy fog before 11am.  Otherwise, mostly sunny, with a high near 65. Breezy, with a west wind 9 to 14 mph increasing to 18 to 23 mph in the afternoon. Winds could gust as high as 30 mph. ', 'Monday Night: Patchy drizzle and fog after 11pm.  Increasing clouds, with a low around 52. Breezy, with a west wind 20 to 24 mph, with gusts as high as 31 mph. ', 'Tuesday: Patchy drizzle and fog before 11am.  Mostly sunny, with a high near 63. West wind 16 to 21 mph, with gusts as high as 26 mph. ', 'Tuesday Night: Patchy drizzle and fog after 11pm.  Mostly cloudy, with a low around 52.', 'Wednesday: Patchy drizzle and fog before 11am.  Mostly sunny, with a high near 60.', 'Wednesday Night: Patchy fog after 11pm.  Otherwise, mostly cloudy, with a low around 52.', 'Thursday: Patchy fog before 11am.  Otherwise, mostly sunny, with a high near 61.']

Wrangle results¶

In [33]:

import re
string = re.sub("e|l", "", "Hello people")
string

Out[33]:

'Ho pop'

Store the result in pandas.DataFrame¶

In [21]:

weather = pd.DataFrame({
        "period": periods,
        "short_desc": short_descs,
        "temp": temps,
        "desc": descs
    })
weather

Out[21]:

	desc	period	short_desc	temp
0	Today: Sunny, with a high near 69. West southw...	Today	Sunny	High: 69 °F
1	Tonight: Patchy fog after 11pm. Otherwise, pa...	Tonight	Patchy Fog	Low: 54 °F
2	Monday: Patchy fog before 11am. Otherwise, mo...	Monday	Breezy.Patchy Fogthen Sunny	High: 65 °F
3	Monday Night: Patchy drizzle and fog after 11p...	MondayNight	PatchyDrizzle andPatchy Fog	Low: 52 °F
4	Tuesday: Patchy drizzle and fog before 11am. ...	Tuesday	PatchyDrizzle andPatchy Fogthen Sunny	High: 63 °F
5	Tuesday Night: Patchy drizzle and fog after 11...	TuesdayNight	PatchyDrizzle andPatchy Fog	Low: 52 °F
6	Wednesday: Patchy drizzle and fog before 11am....	Wednesday	PatchyDrizzle andPatchy Fogthen Sunny	High: 60 °F
7	Wednesday Night: Patchy fog after 11pm. Other...	WednesdayNight	Patchy Fog	Low: 52 °F
8	Thursday: Patchy fog before 11am. Otherwise, ...	Thursday	Patchy Fogthen Sunny	High: 61 °F

Write to Excel¶

In [ ]:

writer = pd.ExcelWriter('file_name.xlsx', engine='xlsxwriter')
df.to_excel(writer)
writer.save()

Analysis¶

For example, we can use regex and the Series.str.extract method to pull out the numeric temperature values

In [24]:

temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
temp_nums

Out[24]:

0    69
1    54
2    65
3    52
4    63
5    52
6    60
7    52
8    61
Name: temp_num, dtype: object

In [25]:

weather["temp_num"].mean()

Out[25]:

58.666666666666664