Scraping IMDb and Wiki
Posted on Jun 22, 2018 in Notes • 12 min read
Web Scraping - IMDb, Wiki¶
This is written to collect data for my friend's translation project, in which she attempts to analyse the differences between the official movie title translations in China, Hong Kong and Taiwan.
Learnings:¶
find()
andfind_all()
only works onbs4.BeautifulSoup
orbs4.element.Tags
bs4.element.Resultset
can simply be treated as a list ofbs4.element.Tags
Challenges:¶
- Directing to the correct Wiki page of the movie.
- The tag containing title translations are elements in a table without distinguishing features.
Sol 1¶
Using try except
, first try the _(film)
labeled url. For cases where there are multiple entries of this movie title, this would direct us to the correct movie page. If there is only one entry (the film one), this url would be redirected to an error page which would produce errors in the following bs4 scraping code. Handle this exceptiong with a normal url with the appended title.
Sol 2¶
After a few observations, the translation elements seemed to always be at the very last three slots of the table. A simple backward indexing did the trick.
Limitations:¶
- Failed to consider instances where there are multiple versions of a movie, in which case, the correct url would be appended with
_(year_film)
. - Following the first bug, since we are using simple indexing to retrieve text instead of bs4, irrelavant text (whatever the last three table elements are) would be appended to the results in lieu of the 'null' marker we applied for other exceptions. This has created multiple entries with erroneous translations and couldn't be removed along with the others that are correctly marked with 'null'.
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
from openpyxl import load_workbook
# Scrape IMDB for list of movie titles
def getTitles(url):
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
header = soup.find_all('h3', class_='lister-item-header')
titles = []
for item in header:
titles.append(item.find('a').get_text())
return titles
# Find ZH wiki page link to each movie title
def getLinks(titles):
links = []
for idx, title in enumerate(titles):
title = title.replace(' ','_')
print('Now processing title number' + str(idx))
try:
try:
tmp = []
path = 'https://en.wikipedia.org/wiki/' + title + '_(film)'
page = requests.get(path)
soup = BeautifulSoup(page.content, 'html.parser')
li = soup.find('li', class_='interlanguage-link interwiki-zh')
lin = li.find('a')
link = lin['href']
links.append(link)
except:
tmp = []
path = 'https://en.wikipedia.org/wiki/' + title
page = requests.get(path)
soup = BeautifulSoup(page.content, 'html.parser')
li = soup.find('li', class_='interlanguage-link interwiki-zh')
lin = li.find('a')
link = lin['href']
links.append(link)
except:
links.append('404')
print('Not Found.')
return links
# Get 3 translated titles of China, Hong Kong and Taiwan:
def getTranslation(links):
titlesCH = []
titlesHK = []
titlesTW = []
for link in links:
try:
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table', class_='infobox vevent')
td = table.find_all('td')[-3:]
titlesCH.append(td[0].get_text())
titlesHK.append(td[1].get_text())
titlesTW.append(td[2].get_text())
except:
titlesCH.append('null')
titlesHK.append('null')
titlesTW.append('null')
return titlesCH, titlesHK, titlesTW
imdbPath = 'https://www.imdb.com/search/title?year=2016&title_type=feature&sort=num_votes,desc&page=4&ref_=adv_nxt'
titles = getTitles(imdbPath)
links = getLinks(titles)
titlesCH, titlesHK, titlesTW = getTranslation(links)
results = pd.DataFrame({
'TitlesEN': titles,
'TitlesCH': titlesCH,
'TitlesHK': titlesHK,
'TitlesTW': titlesTW,
'wikiZH': links
})
# Write to an existing excel file and create new sheet
with pd.ExcelWriter('final.xlsx', engine='openpyxl') as writer:
writer.book = load_workbook('final.xlsx')
results.to_excel(writer, sheet_name='2016_p4')