Practical Perl: Web Automation

This document discusses using Perl to automate interactions with websites by scraping HTML content. It introduces the LWP library for fetching web pages and resources in Perl. Screen scraping HTML is described as error-prone due to invalid formatting, but necessary for extracting data from websites. The document recommends using modules like HTML::LinkExtor and ignoring common interface elements to focus on page-specific content when screen scraping.

Uploaded by

Rajesh Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views5 pages

Practical Perl: Web Automation

Uploaded by

Rajesh Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

practical perl

offer programmatic access to a service like, say, eBay, Google, or

Web Automation Amazon.com. But if you want to publish information or offer a
service to the widest possible audience, you still need to build a
Web site.
by Adam Turoff
Adam is a consultant The HTML Problem
who specializes in
using Perl to manage Before Web services, automatic processing of data from the Web
big data. He is a usually involved fetching HTML documents and scanning them
long-time Perl Mon-
ger, a technical editor to find new or interesting bits of data. Web services offer a more
for The Perl Review, robust alternative, but do not eliminate the need to sift through
and a frequent pre-
senter at Perl confer- HTML documents and “screen scrape” data off a Web page.
ences.
Processing HTML is the worst possible solution, but it is often
ziggy@panix.com
the only solution available. HTML is a difficult format to parse.
Many documents contain invalid or otherwise broken format-
ting. Using regular expressions to extract information from
Introduction
HTML documents is a common coping strategy, but it is quite
Web service protocols like XML-RPC and SOAP are error-prone and notoriously brittle.
great for automating common tasks on the Web. But
Nevertheless, HTML is the universal format for data on the
these protocols aren’t always available. Sometimes
Web. Programmers who are building systems may consider
interacting with HTML-based interfaces is still neces- alternatives like XML-RPC or SOAP Web services. But publish-
sary. Thankfully, Perl has the tools to help you get your ers and service providers are still focused on HTML, because it
job done. is the one format that everyone with a Web browser can always
use.
In my last column, I introduced Web services using XML-RPC.
Web services are commonly used as a high-level RPC (remote
Automating the Web
procedure call) mechanism to allow two programs to share
Since the early days of the Web, people have used programs that
data. They enable programs to exchange information with each
automatically scan, monitor, mirror, and fetch information
other by sending XML documents over HTTP.
from the Web. These programs are generally called robots or
There are many advantages to using Web service tools like spiders. Today, other kinds of programs traverse the Web, too.
XML-RPC and its cousin, SOAP. First, all of the low-level details Spammers use email harvesters to scour Web pages for email
of writing a client and server programs are handled by reusable addresses they can spam. In the semantics of the Web commu-
libraries. No longer is it necessary to master the arcana of socket nity, “scutters” follow links to metadata files to build up data-
programming and protocol design to implement or use a new bases of information about who’s who and what’s what on the
service or daemon. Because information is exchanged as text, Web.
Web services are programming-language agnostic. You could
There are many other mundane uses for Web automation pro-
write a service in Perl to deliver weather information, and access
grams. Link checkers rigorously fetch all the resources on a Web
it with clients written in Python, Tcl, Java, or C#. Or vice versa.
site to find and report broken links. With software development
Yet for all of the benefits Web services bring, they are hardly a moving to the Web, testers use scripts to simulate a user session
panacea. Protocols like SOAP and XML-RPC focus on how pro- to make sure Web applications behave properly.
grams interact with each other, not on how people interact with
Fortunately, there are a great many Perl modules on CPAN to
programs. For example, I cannot scribble down the address of
help with all of these tasks.
an XML-RPC service on a napkin and expect someone to use
that service easily. Nor can I send a link to an XML-RPC service Most Web automation programs in Perl start with libwww-
in the body of an email message. perl, more commonly known as LWP. This library of modules is
Gisle Aas’s Swiss Army knife for interacting with the Web. The
Generally speaking, in order to use a Web service, I need to
easiest way to get started with LWP is with the LWP::Simple
write some code, and have an understanding of how to use that
module, which provides a simple interface to fetch Web
particular service. This is why after about five years, Web ser-
resources:
vices are still a niche technology. They work great if you want to

50 Vol. 29, No. 3 ;login:

#!/usr/bin/perl -w

PROGRAMMING
use strict;
use LWP::Simple;
## Grab a Web page, and throw the content in a Perl variable.
my $content = get("http://www.usenix.org/publications/login/");

l
## Grab a Web page, and write the content to disk.
getstore("http://www.usenix.org/publications/login/", "login.html");
## Grab a Web page, and write the content to disk if it has changed.
mirror("http://www.usenix.org/publications/login/”, "login.html");
LWP has other interfaces that enable you to customize exactly how your program will interact with the Web sites it visits. For more
details about LWP’s capabilities, check out the documentation that comes with the module, including the lwpcook and lwptut man
pages. Sean Burke’s book Perl & LWP also provides an introduction to and overview of LWP.

Screen Scraping
Retrieving Web resources is the easy part of automating Web access. Once HTML files have been fetched, they need to be examined.
Simple Web tools like link checkers only care about the URLs for the clickable links, images, and other files embedded in a Web page.
One easy way to find these pieces of data is to use the HTML::LinkExtor module to parse an HTML document and extract only these
links. HTML::LinkExtor is another of one of Gisle’s modules that can be found in his HTML::Parser distribution.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::LinkExtor;
my $content = get("http://www.usenix.org/publications/login/");
my $extractor = new HTML::LinkExtor;
$extractor->parse($content);
my @links = $extractor->links();
foreach my $link (@links) {
## $link is a 3-element array reference containing
## element name, attribute name, and URL:
##
## 0 1 2
## <a href="http://....">
## <img src="http://....">
print "$link->[2]\n";
}

Most modern Web sites have common user interface elements that appear on every page. These are elements like page headers, page
footers, and navigation columns. The actual content of a page is embedded inside these repeating interface elements that appear on
every page of a Web site. Sometimes, a screen scraper will want to ignore all of the repeatable elements and focus instead on the
page-specific content for each HTML page it examines.
For example, the O’Reilly book catalog (http://www.oreilly.com/catalog/) has each of these three common interface elements. The
header, footer, and navigation column on this page all contain links to ads and to other parts of the O’Reilly Web site. A program
that monitors the book links on this page is only concerned with a small portion of this Web page, the actual list of book titles.
One way to focus on the meaningful content is to examine the structure of the URLs on this page, and create a regular expression
that matches only the URLs on the list of titles. But when the URLs change, your program breaks. Another way to solve this problem
is to write a regular expression that matches the HTML content of the entire book list, and throw out the extraneous parts of this

June 2004 ;login: PRACTICAL PERL l 51

Web page. Both of these approaches can work, but they are error-prone. Both will fail if the page design changes in a subtle or a sig-
nificant manner.
Of course, this is Perl, so there’s more than one way to do it. Many Web page designs are built using a series of HTML tables. A better
way to find the relevant content on this Web page is to parse the HTML and focus on the portion of the page that contains what we
want to examine. This approach isn’t foolproof, but it is more robust than using a regular expression to match portions of a Web
page and fixing your program each time the Web page you are analyzing changes.
There are a few modules on CPAN that handle parsing HTML content. While HTML::Parser can provide a good general-purpose
solution, I prefer Simon Drabble’s HTML::TableContentParser, which focuses on extracting the HTML tables found in a Web page.
This technique will break if the HTML layout changes drastically, but at least it is less likely to break when insignificant changes to
the HTML structure appear.
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use HTML::TableContentParser;
my $content = get("http://www.oreilly.com/catalog/");
my $parser = new HTML::TableContentParser;
my $tables = $parser->parse($content);
## $tables is an array reference. Select to the specific table
## content and process it directly.

Interacting with the Web

Most Web automation techniques, like the ones described above, focus on fetching a page and processing the result. This kind of
shallow interaction is sufficient for simple automation tasks, like link checking or mirroring. For more complicated automation,
scripts need to be able to do all the things a person could do with a Web browser. This means entering data into forms, clicking on
specific links in a specific order, and using the back and reload buttons.
This is where Andy Lester’s WWW::Mechanize comes in. Mechanize provides a simple programmatic interface to script a virtual
user navigating through a Web site or using a Web application.
Consider a shopping cart application. A user starts by browsing or searching for products, and periodically clicks on “Add to Shop-
ping Cart.” On the shopping cart page, the user can click on the “Continue shopping” button, click on the back button, browse else-
where on the Web site, or search for products.
If you were developing this application, how would you test it? Would you write down detailed instructions for the people on your
test team to repeat by rote? Or would you write a program to simulate a user, checking each and every intermediate result along the
way? Mechanize is the tool you need to write your simulated user scripts. That user script might look something like this:
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
my $mech = new WWW::Mechanize;
## Start with the homepage.
$mech->get("http://localhost/myshop.cgi");
## Browse for a book.
$mech->follow_link( text => "Books" );
$mech->follow_link( text_regex => qr/Computers/ );
$mech->follow_link( text_regex => qr/Perl/ );
## Put "Programming Perl" in the shopping cart.

52 Vol. 29, No.3 ;login:

$mech->follow_link( text_regex => qr/Programming Perl/);

PROGRAMMING
## Add this to the shopping cart.
$mech->click_button( name => "AddToCart");
## Click the "back button."
$mech->back();

l
## Check out.
$mech->click_button( name => "Checkout");
## Fill in the shipping and billing information.
....

Mechanize is also an excellent module for scripting common actions. Every other week, I need to use a Web-based time-tracking
application to tally up how much time I’ve worked in the current pay period. I could fire up a browser and type in the same thing I
typed in two weeks ago. Or I could use Mechanize :
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
my $mech = new WWW::Mechanize;
$mech->get('...');
## Log in.
$mech->set_fields(
user => "my_username",
pass => "my_password",
);
$mech->submit();
## Put in a standard work week.
## Log in manually later if this needs to be adjusted.
## (Timesheet is the 2nd form. Skip the calendar.)
$mech->submit_form (
form_number => 1,
fields => {
0 => 7.5,
1 => 7.5,
...
9 => 7.5,
},
button => "Save",
);
## That's it. Run this again in two weeks.

Mechanize is also a great module for writing simple Web automation. Scripts that rely on HTML layout or specific textual artifacts
in HTML documents are prone to breaking whenever a page layout changes. For example, whenever I am reading a multi-page arti-
cle on the Web, I invariably click on the “Print” link to read the article all at once.
I could use regular expressions, or modules like HTML::LinkExtor or HTML::TableContentParser, to examine the content of a Web
page to find the printable version of an article. But these techniques are both site-specific and prone to breakage. With Mechanize, I
can analyze the text of a link — the stuff that appears underlined in blue in my Web browser. Using Mechanize, I can look for the
“Print” link and just follow it:

June 2004 ;login: PRACTICAL PERL l 53

#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
my $mech = new WWW::Mechanize;
my $url = shift(@ARGV)
$mech->get();
if ($mech->find_link(text_regex => qr/^2|Next$/i)) {
## This is a multipage document.
## Open the "print" version instead.
$url = $mech->find_link(text_regex => qr/Print/);
}
## Open the file in a browser (on MacOS X).
system(“open '$url'");

Conclusion
Perl is well known for automating the drudgery out of system administration. But Perl is also very capable of automating Web-based
interactions. Whether you are using Web service interfaces like XML-RPC and SOAP or interacting with standard HTML-based
interfaces, Perl has the tools to help you automate frequent, repetitive tasks.
Perl programmers have a host of tools available to help them automate the Web. Simple automation can be accomplished quickly
and easily with LWP::Simple and a couple of regular expressions. More intensive HTML analysis can be done using modules like
HTML::LinkExtor, HTML::Parser, HTML::TableContentParser, or WWW::Mechanize, to name a few. Whatever you need to automate
on the Web, there’s probably a Perl module ready to help you quickly write a robust tool to solve your problem.

54 Vol. 29, No.3 ;login:

Web Client Programming With Perl
No ratings yet
Web Client Programming With Perl
257 pages
Perl Job Scraping with WWW::Mechanize
No ratings yet
Perl Job Scraping with WWW::Mechanize
7 pages
1.1 Web Scraping
No ratings yet
1.1 Web Scraping
34 pages
Perl For WWW
No ratings yet
Perl For WWW
3 pages
Lecture 5 DC
No ratings yet
Lecture 5 DC
50 pages
Assignment B52
No ratings yet
Assignment B52
13 pages
Lesson 2
No ratings yet
Lesson 2
26 pages
Introduction To Web Programming
No ratings yet
Introduction To Web Programming
12 pages
XML Processing With Perl, Python and PHP. Also Covers TCL, Rebol, Ruby and AppleScript
No ratings yet
XML Processing With Perl, Python and PHP. Also Covers TCL, Rebol, Ruby and AppleScript
447 pages
XML Processing With Perl, Python, and PHP
No ratings yet
XML Processing With Perl, Python, and PHP
447 pages
A Practical Guide To Web Scraping (PDFDrive)
No ratings yet
A Practical Guide To Web Scraping (PDFDrive)
107 pages
Active Browser Pages, Web Services, Basics
No ratings yet
Active Browser Pages, Web Services, Basics
59 pages
Chap 5 Web Service Administration
No ratings yet
Chap 5 Web Service Administration
24 pages
XML and PHP Basics for Web Tech
No ratings yet
XML and PHP Basics for Web Tech
14 pages
Web Spider Design for Efficient Search
No ratings yet
Web Spider Design for Efficient Search
8 pages
David Ward, Internext, Inc., New Brunswick, NJ: Sas® and The Internet For Programmers
No ratings yet
David Ward, Internext, Inc., New Brunswick, NJ: Sas® and The Internet For Programmers
4 pages
Perl Project: Siddhant Sanjeev 337/CO/11 Siddharth Saluja 338/CO/11
No ratings yet
Perl Project: Siddhant Sanjeev 337/CO/11 Siddharth Saluja 338/CO/11
14 pages
Computer Networks Lecture Notes: Course Code - Course Name
No ratings yet
Computer Networks Lecture Notes: Course Code - Course Name
57 pages
Web Scraping Handbook
100% (1)
Web Scraping Handbook
115 pages
Web Info PDF
No ratings yet
Web Info PDF
4 pages
XML Processing With Perl Python and PHP 1st Edition Martin C. Brown Instant Download
100% (4)
XML Processing With Perl Python and PHP 1st Edition Martin C. Brown Instant Download
61 pages
Plans On First Week
No ratings yet
Plans On First Week
7 pages
Chapter 11. Web Scraping
100% (1)
Chapter 11. Web Scraping
57 pages
Introductiontowebarchitecture 090922221506 Phpapp01
No ratings yet
Introductiontowebarchitecture 090922221506 Phpapp01
60 pages
XML and Web Services
No ratings yet
XML and Web Services
39 pages
What Is LAMP?: Linux Operating System Apache Mysql Rdbms PHP Object-Oriented Perl Python
No ratings yet
What Is LAMP?: Linux Operating System Apache Mysql Rdbms PHP Object-Oriented Perl Python
3 pages
Web Engineering: DR Naima Iltaf Naima@mcs - Edu.pk
No ratings yet
Web Engineering: DR Naima Iltaf Naima@mcs - Edu.pk
52 pages
Building Your Own Web Spider: Thoughts, Considerations and Problems
No ratings yet
Building Your Own Web Spider: Thoughts, Considerations and Problems
17 pages
List of Experiments: S.No Name of The Experiements NO Signature
No ratings yet
List of Experiments: S.No Name of The Experiements NO Signature
76 pages
MP QA 1
No ratings yet
MP QA 1
25 pages
Web Crawler Guide for Developers
No ratings yet
Web Crawler Guide for Developers
20 pages
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
Web Crawler
0% (1)
Web Crawler
16 pages
Web Service APIs and Libraries 1st Edition Jason Paul Michel PDF Download
No ratings yet
Web Service APIs and Libraries 1st Edition Jason Paul Michel PDF Download
122 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
CNS - Unit 5 (Application Layer) - Model Answers
No ratings yet
CNS - Unit 5 (Application Layer) - Model Answers
27 pages
Web Programming Essentials
No ratings yet
Web Programming Essentials
44 pages
Drupal Comprehensive Tutorial
No ratings yet
Drupal Comprehensive Tutorial
0 pages
OReilly - Writing.apache Modules With Perl and C
No ratings yet
OReilly - Writing.apache Modules With Perl and C
741 pages
+2 Computer Science 4-12
No ratings yet
+2 Computer Science 4-12
61 pages
Web Database
No ratings yet
Web Database
18 pages
Web Dev 2
No ratings yet
Web Dev 2
20 pages
Introduction To Web Services
No ratings yet
Introduction To Web Services
12 pages
0756 Perls Before Swine
No ratings yet
0756 Perls Before Swine
78 pages
CGI Scripts: Dynamic Web Interactions
No ratings yet
CGI Scripts: Dynamic Web Interactions
11 pages
Computer Networking Essentials
No ratings yet
Computer Networking Essentials
39 pages
Apache Toamcat Installation
No ratings yet
Apache Toamcat Installation
58 pages
11 Web Services
No ratings yet
11 Web Services
94 pages
It Sivapadidapu PDF
No ratings yet
It Sivapadidapu PDF
130 pages
World Wide Web - 299
No ratings yet
World Wide Web - 299
8 pages
106 Antila Full Paper
No ratings yet
106 Antila Full Paper
6 pages
Browser Context Flow
100% (1)
Browser Context Flow
18 pages
MY BOOK CART Project Report
No ratings yet
MY BOOK CART Project Report
125 pages
HTML &XML
No ratings yet
HTML &XML
1 page
JSP for Web Developers
No ratings yet
JSP for Web Developers
75 pages
HTML Railway Reservation System
No ratings yet
HTML Railway Reservation System
6 pages
Le Pan S User Manual
100% (1)
Le Pan S User Manual
34 pages
FRD CarWashingServices
No ratings yet
FRD CarWashingServices
7 pages
Color Run Process Book
No ratings yet
Color Run Process Book
28 pages
Empowerment Technologies: Prepared By: Karl Kevin C Bacon
No ratings yet
Empowerment Technologies: Prepared By: Karl Kevin C Bacon
37 pages
Hospital Management System A Project Rep
No ratings yet
Hospital Management System A Project Rep
70 pages
MRTG
No ratings yet
MRTG
14 pages
Overview of Active Directory Federation Services in Windows Server 2003 R2
No ratings yet
Overview of Active Directory Federation Services in Windows Server 2003 R2
21 pages
Kiosk Manual
No ratings yet
Kiosk Manual
39 pages
Angular Js Angularjs A Code Like Jonathan Bates (WWW - Ebook DL - Com)
No ratings yet
Angular Js Angularjs A Code Like Jonathan Bates (WWW - Ebook DL - Com)
79 pages
Web Programming Notes
No ratings yet
Web Programming Notes
82 pages
Creating Flip Charts
No ratings yet
Creating Flip Charts
27 pages
102-00094-I RIO ZUNI Operators Manual
No ratings yet
102-00094-I RIO ZUNI Operators Manual
46 pages
Clothing Waste Awareness Guide
No ratings yet
Clothing Waste Awareness Guide
7 pages
Social Media Strategies for Lawyers
No ratings yet
Social Media Strategies for Lawyers
16 pages
HTML Lab Programmes
No ratings yet
HTML Lab Programmes
25 pages
Automate PDF Save Without Prompt
No ratings yet
Automate PDF Save Without Prompt
2 pages
Class 11 Materials
No ratings yet
Class 11 Materials
3 pages
Create and Download A Text File From A Web Page
No ratings yet
Create and Download A Text File From A Web Page
8 pages
RoR Tutorial
No ratings yet
RoR Tutorial
219 pages
WebMapping CHAPTER 1
100% (1)
WebMapping CHAPTER 1
11 pages
Hyperlink 2
No ratings yet
Hyperlink 2
21 pages
FSEarthTiles 1.03 Update Guide
No ratings yet
FSEarthTiles 1.03 Update Guide
6 pages
Colombo International School: Mock Examination - Paper 2
No ratings yet
Colombo International School: Mock Examination - Paper 2
13 pages
HTML Forms
No ratings yet
HTML Forms
87 pages