Financial Data Science (FIN42110)
Dr. Richard McGee
Web Scraping
                                    1
Introduction
Definitions
   Web Scraping
   Using tools to gather data you can see on a webpage.
   A wide range of web scraping
   techniques and tools exist. These can
   be as simple as copy/paste and
   increase in complexity to automation
   tools, HTML parsing, APIs and
   programming.
                                                          2
Definitions
   HTTP
   HyperText Transfer Protocol
     • HTTP is the foundation of data
       communication for the World Wide
       Web, where hypertext documents
       include hyperlinks to other resources
       that the user can easily access, for
       example by a mouse click or by
       tapping the screen in a web browser.
   The protocol defines aspects of authentication, requests, status
   codes, persistent connections, client/server request/response.
   etc.
                                                                      3
Definitions
   HTML
   HyperText Markup Language
     • HyperText Markup Language
       (HTML) is the set of markup
       symbols or codes inserted into a file
       intended for display on the Internet.
       The markup tells web browsers how
       to display a web page’s words and
       images.
   Each individual piece markup code (which would fall between
   "<" and ">" characters) is referred to as an element, though
   many people also refer to it as a tag.
                                                                  4
Definitions
   XML
   Extensible Markup Language
     • Extensible Markup Language (XML)
       is a markup language and file format
       for storing, transmitting, and
       reconstructing arbitrary data.
   It defines a set of rules for encoding documents in a format that
   is both human-readable and machine-readable.
   XML is about encoding data, HTML is about display.
                                                                       5
XML Example
    • We will check out the Books.xml example file on
      Brightspace.
    • With a new data set it can be useful to use an XML viewer
      to view the hierarchy:
    • https://www.xmlgrid.net/
                                                                  6
XML parsing
  from lxml.etree import fromstring
  with open('Books.xml', 'r') as file:
      xml = file.read()
  root = fromstring(xml)
  for books in root.xpath("/catalog/book"):
      print(books.xpath("title")[0].text)
                                              7
Definitions
   JSON
   JavaScript Object Notation
     • JSON, is a lightweight computer
       data interchange format. It is a
       text-based, human-readable format
       for representing simple data
       structures and associative arrays
       (called objects) in serialization and
       serves as an alternative to XML.
                                               8
Definitions
   API
   Application Programming Interface
     • An application programming
       interface (API) is a connection
       between computers or between
       computer programs. It is a type of
       software interface, offering a service
       to other pieces of software.
                                                9
Definitions
   SOAP
   Simple Object Access Protocol
     • SOAP is a commonly used set of
       commands and objects used to
       implement an API.
                                        10
Definitions
     • Parsing
         • The act of analyzing the strings and symbols to reveal
           only the data you need.
     • Crawling
         • Moving across or through a website in an attempt to
           gather data from more than one URL or page
                                                                    11
HTML Structure: div
  <html>
  <head>
  <style>
  .myDiv {
    border: 5px outset red;
    background-color: lightblue;
    text-align: center;
  }
  </style>
  </head>
  <body>
  <div class="myDiv">
    <h2>This is a heading in a div element</h2>
    <p>This is some text in a div element.</p>
  </div>
  </body>
  </html>
     • division/section/used as a container for HTML elements.
     • https://www.w3schools.com/Tags/tag_div.asp
                                                                 12
HTML Structure: table/tr/td
   <table>
     <tr>
       <td>Cell   A</td>
       <td>Cell   B</td>
     </tr>
     <tr>
       <td>Cell   C</td>
       <td>Cell   D</td>
     </tr>
   </table>
      • one <table> and one or more <tr>, <th>, and <td> elements
      • https://www.w3schools.com/Tags/tag_table.asp
                                                                    13
Robots.txt
     • Instructs web robots (typically search engine robots) how
       to crawl pages on the website.
     • Example https://www.buzzfeed.com/robots.txt
     • Accessing at too high a frequency will get you blocked!
                                                                   14
Useful Python Packages
    • pip install beautifulsoup4
    • pip install requests
    • pip install html5lib
    • pip install yfinance
    • pip install mplfinance
    • pip install twython
    • pip install selenium
         • install chrome browser
         • and chrome driver matching browser version
                                                        15
Crypto Punks
Example Project: Crypto Punk Pricing
  https://www.larvalabs.com/cryptopunks
  Step 1: Specify what you are looking for. In this case:
     • a database of 10,000 crypto punks
     • their key features
     • their trade history and prices
     • looking to explain prices with features.
                                                            16
Example Project: Crypto Punk Pricing
  Examine the web page source:
                                       17
Example Project: Crypto Punk Pricing
  Step 2: Design your database structure
     • for this project I will use a simple SQLite DB
     • https://sqlitebrowser.org/dl/
     • I will create two tables - a punk attribute table and a trade
       table.
                                                                       18
Example Project: Crypto Punk Pricing
  Step 3: Examine the web site structure (view page source in
  browser)
     • CyptoPunk is nicely structured with one page per punk
       numbered 1-10,000
     • e.g. punk 1 is at
       https://www.larvalabs.com/cryptopunks/details/1
                                                                19
Example Project: Crypto Punk Pricing
     • Example: Print trade dates and amounts for one punk.
  import requests
  from bs4 import BeautifulSoup
  # Crypto Punk
  #~~~~~~~~~~~~
  BaseStr = "https://www.larvalabs.com/cryptopunks/details/"
  PunkNo = '1'
  page = requests.get(BaseStr + PunkNo)
  soup = BeautifulSoup(page.content, 'html.parser')
  table = soup.find('table', attrs={'class':'table'})
  rows = table.find_all('tr')
  for row in rows:
      cols = row.find_all('td')
      if cols:
          cols = [ele.text.strip() for ele in cols]
          print(cols[4] + ' : ' + cols[3])
                                                               20
Yahoo finance API
Web Scraping from yahoo finance
  import yfinance as yf
  import mplfinance as mpf
  import numpy as np
  ticker_name = 'NFLX'
  yticker = yf.Ticker(ticker_name)
  nflx = yticker.history(period="1y") # max, 1y, 3mo
  ....
  # Compute log returns
  nflx['Return'] = np.log(nflx['Close']/nflx['Close'].shift(1))
  https://pypi.org/project/yfinance/
  https://pypi.org/project/mplfinance/
                                                                  21
MPL output example
                     22
More Scraping
Web Scraping Example: House of Representatives
  https://www.house.gov/representatives
                                                 23
Web Scraping Example
  def main():
      from bs4 import BeautifulSoup
      import requests
       url = "https://www.house.gov/representatives"
       text = requests.get(url).text
       soup = BeautifulSoup(text, "html5lib")
       all_urls = [a['href']
                   for a in soup('a')
                   if a.has_attr('href')]
       print(len(all_urls))
  Example from Data Science from Sratch, Joel Grus
                                                       24
Web Scraping Example
       import re
       # Must start with http:// or https://
       # Must end with .house.gov or .house.gov/
       regex = r"^https?://.*\.house\.gov/?$"
       # Let's write some tests!
       assert re.match(regex, "http://joel.house.gov")
       # And now apply
       good_urls = [url for url in all_urls if re.match(regex, url)]
       print(len(good_urls))
       good_urls = list(set(good_urls))
  Example from Data Science from Sratch, Joel Grus.
  For regex see, e.g.: https://www.w3schools.com/python/python_regex.asp
                                                                           25
Web Scraping Example
  from bs4 import BeautifulSoup
  import requests
  def paragraph_mentions(text: str, keyword: str) -> bool:
      """
      Returns True if a <p> inside the text mentions {keyword}
      """
      soup = BeautifulSoup(text, 'html5lib')
      paragraphs = [p.get_text() for p in soup('p')]
       return any(keyword.lower() in paragraph.lower()
                  for paragraph in paragraphs)
  Example from Data Science from Sratch, Joel Grus
                                                                 26
Web Scraping Example
       import random
       from typing import Dict, Set
       good_urls = random.sample(good_urls, 5)
       print(f"after sampling, left with {good_urls}")
       press_releases: Dict[str, Set[str]] = {}
       for house_url in good_urls:
           html = requests.get(house_url).text
           soup = BeautifulSoup(html, 'html5lib')
           pr_links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}
           print(f"{house_url}: {pr_links}")
           press_releases[house_url] = pr_links
       for house_url, pr_links in press_releases.items():
           for pr_link in pr_links:
               url = f"{house_url}/{pr_link}"
               text = requests.get(url).text
                 if paragraph_mentions(text, 'data'):
                     print(f"{house_url}")
                     break # done with this house_url
  Example from Data Science from Sratch, Joel Grus                                           27