Scraping IMDb and Wiki

Posted on Jun 22, 2018 in Notes • 12 min read

Web Scraping - IMDb, Wiki

This is written to collect data for my friend's translation project, in which she attempts to analyse the differences between the official movie title translations in China, Hong Kong and Taiwan.

Learnings:

  • find() and find_all() only works on bs4.BeautifulSoup or bs4.element.Tags
  • bs4.element.Resultset can simply be treated as a list of bs4.element.Tags

Challenges:

  1. Directing to the correct Wiki page of the movie.
  2. The tag containing title translations are elements in a table without distinguishing features.

Sol 1

Using try except, first try the _(film) labeled url. For cases where there are multiple entries of this movie title, this would direct us to the correct movie page. If there is only one entry (the film one), this url would be redirected to an error page which would produce errors in the following bs4 scraping code. Handle this exceptiong with a normal url with the appended title.

Sol 2

After a few observations, the translation elements seemed to always be at the very last three slots of the table. A simple backward indexing did the trick.

Limitations:

  1. Failed to consider instances where there are multiple versions of a movie, in which case, the correct url would be appended with _(year_film).
  2. Following the first bug, since we are using simple indexing to retrieve text instead of bs4, irrelavant text (whatever the last three table elements are) would be appended to the results in lieu of the 'null' marker we applied for other exceptions. This has created multiple entries with erroneous translations and couldn't be removed along with the others that are correctly marked with 'null'.
In [ ]:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
from openpyxl import load_workbook
In [ ]:
# Scrape IMDB for list of movie titles
def getTitles(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    header = soup.find_all('h3', class_='lister-item-header')

    titles = []
    for item in header:
        titles.append(item.find('a').get_text())
    return titles
In [ ]:
# Find ZH wiki page link to each movie title
def getLinks(titles):
    links = []
    for idx, title in enumerate(titles):
        title = title.replace(' ','_')
        print('Now processing title number' + str(idx))
        try:
            try:
                tmp = []
                path = 'https://en.wikipedia.org/wiki/' + title + '_(film)'
                page = requests.get(path)
                soup = BeautifulSoup(page.content, 'html.parser')
                li = soup.find('li', class_='interlanguage-link interwiki-zh')
                lin = li.find('a')
                link = lin['href']
                links.append(link)
            except:
                tmp = []
                path = 'https://en.wikipedia.org/wiki/' + title
                page = requests.get(path)
                soup = BeautifulSoup(page.content, 'html.parser')
                li = soup.find('li', class_='interlanguage-link interwiki-zh')
                lin = li.find('a')
                link = lin['href']
                links.append(link)
        except:
            links.append('404')
            print('Not Found.')
    return links
In [ ]:
# Get 3 translated titles of China, Hong Kong and Taiwan:
def getTranslation(links):
    titlesCH = []
    titlesHK = []
    titlesTW = []
    for link in links:
        try:
            page = requests.get(link)
            soup = BeautifulSoup(page.content, 'html.parser')
            table = soup.find('table', class_='infobox vevent')
            td = table.find_all('td')[-3:]
            titlesCH.append(td[0].get_text())
            titlesHK.append(td[1].get_text())
            titlesTW.append(td[2].get_text())
        except:
            titlesCH.append('null')
            titlesHK.append('null')
            titlesTW.append('null')
    return titlesCH, titlesHK, titlesTW
In [ ]:
imdbPath = 'https://www.imdb.com/search/title?year=2016&title_type=feature&sort=num_votes,desc&page=4&ref_=adv_nxt'
titles = getTitles(imdbPath)
In [ ]:
links = getLinks(titles)
In [ ]:
titlesCH, titlesHK, titlesTW = getTranslation(links)
In [ ]:
results = pd.DataFrame({
    'TitlesEN': titles,
    'TitlesCH': titlesCH,
    'TitlesHK': titlesHK,
    'TitlesTW': titlesTW,
    'wikiZH': links
})

# Write to an existing excel file and create new sheet
with pd.ExcelWriter('final.xlsx', engine='openpyxl') as writer:
    writer.book = load_workbook('final.xlsx')
    results.to_excel(writer, sheet_name='2016_p4')