Using Perl and Regular Expressions to Process HTML Files - Part 1



Author: John Dixon

Like many web content authors, over the past few years I've had many occasions when I've needed to clean up a bunch of HTML files that have been generated by a word processor or publishing package. Initially, I used to clean up the files manually, opening each one in turn, and making the same set of updates to each one. This works fine when you only have a few files to fix, but when you have hundreds or even thousands to do, you can very quickly be looking at weeks or even months of work. A few years ago someone put me on to the idea of using Perl and regular expressions to perform this 'cleaning up' process.

Why write an article about Perl and regular expressions I hear you say. Well, that’s a good point. After all the web is full of tutorials on Perl and regular expressions. What I found though, was that when I was trying to find out how I could process HTML files, I found it difficult to find tutorials that met my criteria.

I'm not saying they don't exist, I just couldn't find them. Sure, I could find tutorials that explained everything I needed to know about regular expressions, and I could find plenty of tutorials about how to program in Perl, and even how to use regular expressions within Perl scripts. What I couldn’t find though, was a tutorial that explained how to open one or more HTML or text files, make updates to those files using regular expressions, and then save and close the files.

The Goal

When converting documents into HTML the goal is always to achieve a seamless conversion from the source document (for example, a word processor document) to HTML. The last thing you need is for your content authors to be spending hours, or even days, fixing untidy HTML code after it has been converted.

Many applications offer excellent tools for converting documents to HTML and, in combination with a well designed cascading style sheet (CSS), can often produce perfect results. Sometimes though, there are little bits of HTML code that are a bit messy, normally caused by authors not applying paragraph tags or styles correctly in the source document.

Why Perl?

The reason why Perl is such a good language to use for this task is because it is excellent at processing text files, which let's face it, is all HTML files are. Perl is also the de facto standard for the use of regular expressions, which you can use to search for, and replace/change, bits of text or code in a file.

What is Perl?

Perl (Practical Extraction and Report Language) is a general purpose programming language, which means it can be used to do anything that any other programming language can do. Having said that, Perl is very good at doing certain things, and not so good at others.

Although you could do it, you wouldn’t normally develop a user interface in Perl as it would be much easier to use a language like Visual Basic to do this. What Perl is really good at, is processing text. This makes it a great choice for manipulating HTML files.

What is a Regular Expression?

A regular expression is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are not unique to Perl - many languages, including JavaScript and PHP can use them - but Perl handles them better than any other language.

In part 2, we'll look at our first example Perl script

About the Author:
John Dixon is a freelance web developer and technical author. Go to http://www.computernostalgia.net to read and submit articles and photos relating to the history of the computer. Go to http://www.dixondevelopment.co.uk to find out more about John's work. "


Total Word Count: 670
Click here to View more Articles at: Invision-Graphics
Invision-Graphics Article Source:

Approved on Sunday, January 07 @ 13:58:15 CST by Shawn DesRochers
 
Options
 Return to the main page Return Home

 Print Page Print Version

 Send to a Friend Send To A Friend

 Discuss Article Discuss Article

 Related Articles Related Articles

 Search Articles Search Articles

 Stories Archive Stories Archive

 Subscribe Newsletter Subscribe Newsletter

 Syndicate Article Topic: Perl XML News Feed

 Contact US Contact US
Article Rating
Average Score:
Votes: 0

Rate this article:
Using Perl and Regular Expressions to Process HTML Files - Part 1

Using Perl and Regular Expressions to Process HTML Files - Part 1 – Aticle Rating 5 Stars
Using Perl and Regular Expressions to Process HTML Files - Part 1 – Aticle Rating 4 Stars
Using Perl and Regular Expressions to Process HTML Files - Part 1 – Aticle Rating 3 Stars
Using Perl and Regular Expressions to Process HTML Files - Part 1 – Aticle Rating 2 Stars
Using Perl and Regular Expressions to Process HTML Files - Part 1 – Aticle Rating 1 Star


Syndicate Article
 My Yahoo!
 Google
 NewsGator
 Stumbleupon
 PluckIT
 Rojo
 Bloglines
 My AOL
 Blogrolling
 ENewsblog
 NewsIsFree
 NetVibes
 del.icio.us
 Technorati
 Digg This
 FeedBurner
 FeedMailer
Sponsor Advertising
FREE Classified Ads
Post your free classified ads at Classifieds Depo.

Text Advertising Info Text Advertising Info
Support US

Make a donation!
If you enjoy our services, make a donation today!

Google Support Ads
Related Links
More about Perl
News by Admin


Advertise Here

Most read story about Perl:
Using Perl and Regular Expressions to Process HTML Files - Part 1

Book Advertising
Get this Book Now
Buy this Book Now!
Click Here
Comment on Article:"Using Perl and Regular Expressions to Process HTML Files - Part 1" Login | Create an Account | 0 comments
The following comments are owned by the individual who posted them. Invision-Graphics is not responsible for the content or the accuracy of the following statements.
No Comments Allowed for Anonymous, please Register
Related Categories
Technorati TagsTechnorati Tags


Click Here to Advertise
Affordable Hosting! http://www.invision-graphics.com/images/banners/468X60_VISIONHOSTING.gif
 Today: 21,597  Yesterday: 26,413  Total Hits: 1,788,858
Page Rendered in: 0.15s - Total Queries: 21 - MySQL DB: 8.9 mb's - Pages served in past 5 minutes: 151