|
Date Published: 2002-08-27
One of Perl's greatest strengths is the CPAN, an archive
of programs, scripts, snippets, and modules. These are all made available to
other programmers worldwide, usually under the same terms as Perl itself. Just
about anything that can be done with Perl is on the CPAN, or will be there
shortly. (Some writers even find inspiration for columns and articles by
watching the list of recent uploads.)
Because of (or contributing to) Perl's popularity as a language for web
development and CGI programming, several CPAN modules handle everything from
HTML formatting to CGI parameter processing. The grande dame is CGI.pm.
Written by Lincoln Stein, it has the potential to make your CGI scripts
shorter, more secure, more valid, and much easier to write. Even better, the
CGI module has shipped in the core Perl distribution for several years. Any
web host worth using will have it installed.
Unfortunately, many coders are not aware of the module's existence. Others
don't see the need, as it's possible to write CGI programs in Perl without
CGI.pm. Doing so, however, is similar to reading webpages through telnet
instead of using a web browser. This may be a good learning experience, but
it's fragile and very difficult to debug.
Two widely-used ``alternatives'' exist. One is cgi-lib.pl, an ancient Perl 4
libary. The other is a copied and pasted snippet of code that originated
either in a web programming book or a free script. Both date back to the
origins of the original CGI standard. While there are good alternatives to
CGI.pm, these two solutions do not apply. They appear simple and effective,
especially if they're familiar, but subtle and unsubtle bugs lurk underneath.
Except in very specific cases, all new CGI programs written in Perl should use
CGI.pm. This article explains three areas in which the module is superior to
the other two common approaches.
chromatic is a Perl hacker, author, and frequent contributor to several
popular websites (including Slashdot and Perlmonks). He is the co-author of
O'Reilly's "Running Weblogs with Slash", and occasionally annoys people by
improving Perl's core test suite. He may be the only Perl 5 porter to have
written Perl while riding a camel.
Unfortunately, a website open to the world around the clock is also open to a
small but dedicated group of mischief makers -- and worse.
- Resource exhaustion
All sites and programs have finite resources. These limitations include, but
are not limited to available bandwidth, disk space, processing time, memory,
and the allowable number of open files. Running out of any of these can render
the site unavailable to visitors, and, worse, can cause strange behavior in any
running program. Well-written programs may degrade gracefully, but it takes
knowledge and experience to handle these situations correctly. Consequently,
common attacks seek to exploit artificial or real limitations.
For example, some attackers send huge amounts of data to websites. Large
requests eat up bandwidth and disk space, wasting processor time and memory
that could be used to serve other users. Other attacks fake large uploads,
leaving programs to expect more data than will ever arrive.
If your form processing reads from STDIN without checking the CONTENT_LENGTH
environmental variable, you're probably susceptible to both attacks. CGI.pm
can limit the allowed size of POSTed content (including file uploads) to any
size you like. The means to do so is as simple as assigning to a variable, and
will be demonstrated in a future article.
- File uploads
Does your parsing routine handle file uploads? Decrypting GET parameters can
be done (though badly) in five lines, but files are uploaded by POSTing
multipart form data. Parsing this is more difficult, especially with the
diverse behavior of popular web browsers. Good luck doing this by hand.
Even if your parser works, does it handle files securely? Does it store them
in a world-accessible directory, even temporarily? Can someone upload a
program and then have the server execute it? Even if uploads are stored
outside the web directory, could a local user secretly replace it, or hijack
sensitive information from it?
Again, CGI.pm handles these situations fairly sanely.
Security is important, but it will hopefully never be tested. Data validity
will always be tested. If one in ten client browsers interprets the
specifications in a way your program did not forsee, you could produce corrupt
data or turn away valuable users.
- Multi-valued fields
If you've ever written a program which allows users to select more than one
thing, you may have wondered how to process multiple things. For example,
consider a list of employees to assign to a project:
<input type="checkbox" name="dev1" value="sunny" />Sunny
<input type="checkbox" name="dev2" value="kam" />Kam
<input type="checkbox" name="dev3" value="hannah" />Hannah
<input type="checkbox" name="dev4" value="ann" />Ann
<input type="checkbox" name="dev5" value="amanda" />Amanda
Note that each checkbox has a unique name. To see if Sunny will work on this
project means examining the dev1 parameter. To find all employees assigned
to the project means looping through all potential devX parameters. The
annoyance grows with the number of potential items. (chromatic industries
isn't large, but it does have an attractive and highly intelligent workforce.)
The HTML and CGI specifications do allow one slight trick to make our lives
easier, though. Parameter names can be repeated. It's legal to write:
<input type="checkbox" name="dev" value="sunny" />Sunny
<input type="checkbox" name="dev" value="kam" />Kam
<input type="checkbox" name="dev" value="hannah" />Hannah
<input type="checkbox" name="dev" value="ann" />Ann
<input type="checkbox" name="dev" value="amanda" />Amanda
Checking ``Sunny'' and ``Kam'' produces a request similar to:
dev=sunny
dev=kam
If the parameter parsing code expects only one value for each parameter name,
the second dev will overwrite the first. Poor Sunny will have nothing to do.
The venerable (read, ``moldy oldy'') cgi-lib.pl code, written in the days
before references, created an artificial C-type array. Most handwritten
parsers don't even do that. Behold the magic of CGI.pm:
my $developer = param('dev'); # gets the first one, ie 'sunny'
my @developers = param('dev'); # gets both of them, ie ( 'sunny', 'kam' )
This beats grepping through potential parameter patterns, testing for existence
and definedness.
- Validating HTML
Moving away from input issues, all HTML generated with CGI.pm will validate
against the official World Wide Web Consortium standards.
This is very important; it enables all compliant clients to see the same
information. It also protects against silly syntax typos: if you've ever
spent hours debugging a missing table tag in Netscape, you'll appreciate this.
CGI.pm's built-in shortcuts saves you having to remember the gory details of
HTTP headers or nested tags, freeing you to focus on programming and not HTML
syntax. (The code needed to build the checkbox group from the last example
with CGI.pm is substantially shorter than writing the HTML by hand.)
- RFC-compliant encoding
Speaking of standards, which character should be used to separate parameters
in a query string within a link? If you said ``the ampersand'', you're partially
correct (but are you encoding it properly?). If you said ``the semicolon'',
you're even more correct. The ampersand has potential conflicts with character
entities, and has been deprecated since the HTML 4.0 recommendation, (See
the standard itself, if you're curious.)
Having mentioned character entities, are you forming them correctly? Have you
escaped all special URI elements? Does your program produce valid HTTP
headers, including the correct media type (say, "text/html", or
"text/xml")? While most popular web browsers will silently correct even bad
HTML, what happens when it breaks? If you don't have time to learn CGI.pm now,
will you have time to fix things in the future?
How do you handle extensions such as cookies, if at all? Though there are
snippets of various quality to get and to set cookies, they often have similar
security and validity issues. CGI.pm, however, is regularly updated with the
latest features -- a recent release even included support for P3P cookies. You
may not have heard of them and you may never use them, but if you need them,
they're available immediately. That's more than can be said for the
alternative, form-parsing code circa 1996.
Given security and reliability benefits, the case for CGI.pm is very strong.
The module has several additional features. Two stand out to ease certain
specific programming tasks.
- Sticky Widgets
Some CGI programs display the same form multiple times during a session.
Stickiness means that selected values persist through submissions. In the
employee selection option, this means that the manager could enter her details
just once in certain form widgets before creating and submitting several
different jobs.
This is already possible manually -- you just have to provide default values
for your form widgets, or save state information on the server, passing some
unique identifier back and forth to and from the client. CGI.pm handles it
automatically, if you use its form widget generating functions. If you say:
print textfield(-name => 'manager');
then when a name parameter has been read from the request (or assigned via
param()), the textfield will take that value as its own. This is often
handy, and is the default behavior. (It can be disabled with the nosticky
pragma.)
- Easier debugging
Perl's rapid development cycle is a useful feature. Instead of a
compile-link-test-change loop, it's test-change. You can run a program from
the command line and immediately make changes in your editor.
Web programs can be harder to debug, especially if you lack a web server on
your development box. (It's easy to install one and possible to write one, but
that's not the point.) Besides that, web servers often shunt program errors to
hidden or obtuse logfiles.
CGI.pm has several handy debugging features. First, it allows programs to run
from the command line as well as in a web server. CGI.pm is smart enough to
tell the difference. Instead of reading from a client socket or from
$ENV{QUERY_STRING}, it reads from the command line. To test the employee
program, sending two employee names, run the program as:
./employees.pl dev=sunny dev=kam
The param() function will work as expected.
If you're fortunate enough to be running on a web server but lack error log
access, the CGI::Carp module (included with CGI.pm) may save you time.
It can intercept fatal errors and send them to the web browser,
where you can read them immediately. It can do much more, but it is most often
enabled with one line of code:
use CGI::Carp qw( fatalsToBrowser );
Everything else will work as you expect. (It's advisable to comment out this
line when you've finished debugging, as it can reveal things about your setup
best left hidden.)
A final nicety of CGI.pm is the ability to save parameters to a file. That is,
at any point in the program, you can save the exact data a user has submitted.
This is a quick and easy way to gather data for command-line debugging or to
log attempts to break your program. (It can also be used to implement user
persistence, but a real database is better for that.) The necessary function is
save_parameters(), and it takes a filehandle:
{
local *OUTPUT;
open(OUTPUT, '>debug.txt') or die "Cannot open debug.txt: $!";
save_parameters(OUTPUT);
close OUTPUT;
}
The resulting file can be edited in any text editor.
Given the fervor with which experienced CGI.pm advocates promote the module,
it's no surprise that there are popular arguments against it. By far the most
common reasons not to use this module are ``I didn't know it existed'' and ``I
don't know how.'' Both are honest and fair (but will shortly be no excuse :).
Other popular objections follow, along with debunkings.
- It's big.
-
``CGI.pm is a big module, and it does many things.''
This is true. Version 2.78 contains 6671 lines. Around half (3266 lines) are
documentation. That seems excessive, compared to the common ten-liner program,
but this includes HTML generation, file uploading, and the persistence
mechanisms in addition to several other features not yet discussed. It's big
because it does many things. It's big because it has many bugfixes and
workarounds for weird servers and browsers. It's big because it's robust.
- It's bloated and slow.
-
``CGI.pm is a big module and it takes forever to load and wastes resources.''
Larger modules do take longer to load and take up more memory.
CGI uses some clever (or obtuse) tricks to work around this. Instead of
compiling everything as it loads, the module waits until a function is first
used. This way, users only pay for what they use. This does complicate
things, but can be changed as necessary.
As for the speed issue, it's rarely a problem. Many of the programs that don't
use CGI have other bottlenecks, such as running external programs to perform
something Perl can do much faster. Besides that, CGI works with technologies
such as mod_perl and FastCGI that can increase speeds far more than the hit of
compiling the autoloading code.
- It's too complicated.
-
``It doesn't make sense, all the functions to use. Using a hash is much easier.''
There are two kinds of complexity involved here: to learn something and to use
it. Any process for retrieving CGI parameters has a learning curve. With
simple parameters, a hash is easy to use. For anything more complex
(multi-valued fields, sticky widgets, names without values), the simple model
breaks down. There are exceptions upon exceptions. Though CGI.pm has a longer
learning curve, it is consistent, and much easier to use.
If you plan to spend years using something, a few hours invested to learn it is
well worth the trouble.
- I don't use anything I can't understand.
-
``I don't want to use anything I couldn't write myself.''
This is false hubris. Though it's a good learning experience to create your
own tools, you're denying yourself the benefit of well-designed, well-debugged,
and well-tested code. The
kinds of people who make this claim often make the same mistakes, rarely
learning from them, and almost never improve their code. Somehow, they don't
extend this argument to Perl itself.
Good programmers could reinvent the wheel (and some do, to great effect),
but they know it's usually better to build on the work of others.
Some tools are clearly the best of their breed. If you don't have a hammer, a
big rock might do. Given the choice, the hammer is obviously better.
Similarly, it's possible to write CGI programs in Perl without CGI.pm, but good
practice recommends against it.
In a future article, we'll explain how to use the module for common tasks.
It's easy.
|