Version 1.3
November 8, 2004
Bugs

html_scrub

HTML Editing Utility in C

Summary

htmlscrub is a command line utility that screens HTML files for unwanted or hazardous HTML tags, as specified by a configuration file.  This utility is free software, licensed without charge under the GNU General Public License (GPL).

Introduction

Some HTML tags and attributes can cause problems.  They may be browser-specific, or they may be rendered differently by different browsers, or they may interact poorly with certain website management tools.

If you edit your HTML files by hand, you can simply avoid using the kinds of HTML that you don't want.  The solution is not so simple if you use web authoring software or word processing programs to create HTML files.  Such tools may inject unwanted HTML that you then have to remove by hand.

Likewise, if you publish HTML files contributed by others, you may have to edit them to remove unwanted tags before publishing them.  Policing your HTML manually is tedious, time-consuming, and error-prone.

Why not let your computer do the grunt work?

This website offers a free tool, html_scrub, in the form of C source code.  By supplying a small configuration file, you can tell html_scrub which tags and attributes to keep, which to discard, and which to warn you about.

Synopsis

html_scrub [ -f config_file ] [ input_file ]
The -f option specifies the name of a configuration file.  If the command line does not specify a configuration file, html_scrub looks for a configuration file in several default locations, as described below.

After specifying the configuration file, if any, the command line may specify the name of the input file.  If no input file is specified, html_scrub reads standard input.

The output HTML is written to standard output.  Error messages are written to standard error.

Configuration File

If the command line does not specify a configuration file, html_scrub looks for one in the following places:
  1. A file named by the environmental variable HTML_SCRUB_CFG, if such a variable exists
  2. "html_scrub.cfg" in the current working directory
  3. "html_scrub.cfg" in the user's home directory, as defined by the environmental variable HOME, if such a variable exists
  4. "html_scrub.cfg" in the \etc directory
NOTE: The default file name "html_scrub.cfg" may be overridden at compile time by defining a value for the macro DEF_CFG_FILE.

Sample Configuration File

A sample configuration file is given below.  This example is not very useful in itself, but it will serve to illustrate the syntax:
    <HEAD> : keep
    <BODY> : keep
    <APPLET> : warn
    
    # FONT tags may contain browser-specific barbarisms

    <FONT>
    {
        attribute FACE : drop   # Don't require browser to have
                                # specific fonts available
        default : keep
    }
    <META>
    {
        attribute HTTP-EQUIV : warn  # be wary of Refresh
    }
    <EM>  : drop
    <SCRIPT> : drop all  # Don't want any scripts
    default : drop
    comment : warn
Given this configuration file, html_scrub will do the following: A more realistic configuration file would likely include a long list of tags that html_scrub should keep, a smaller list of tags for it to warn about, and possibly some instructions to drop or warn you about certain attributes within tags.

Syntax Rules

You can probably figure out most of what you need to know by studying the example above, but here is a more formal description of the syntax used by the configuration file.
General Rules
  1. Everything from a pound sign ('#') to the end of the line is treated as a comment, i.e. ignored.
  2. White space is not significant except as a separator between words such as DROP and ALL, and, in the case of line feeds, as a terminator for comments.  It's a good idea to keep things tidy, as in the sample above, but it isn't necessary.  You can write the whole thing on one line if you want to.
  3. Case is not significant.  My own practice is to put tag names and attributes in upper case and everything else in lower case, but your taste may vary.
Tag-level Actions
  1. An action for a tag is defined by:

  2. Define the default action and the action for HTML comments in the same way, using "default" or "comment" respectively in the place of the tag name, without angle brackets.
  3. You may specify only one action for any given tag.
  4. You may specify only one default action or comment action.
  5. The default action applies to any tag for which you don't otherwise specify an action.
  6. If you don't specify a default action, the default action is "keep".
  7. If you don't specify a comment action, the comment action is "keep".
Attribute-level Actions
  1. To define attribute actions within a given tag, code:

  2. An attribute action consists of:

  3. Define the default attribute action in the same way, using the word "default" in the place of the attribute name, without the word "attribute".
  4. The default attribute action applies to any attribute for which you have not otherwise specified an action.
  5. If you don't define a default attribute action within a tag, the default attribute action for that tag will be "keep".
  6. Within any given tag, you can specify only one action for any attribute, and only one default action.
  7. If you don't define a default attribute action within a tag, the default attribute action for that tag will be "keep".
  8. Attribute-level actions apply only to the tags for which they are defined.
Actions
The "keep" action retains the tag or attribute unchanged.

The "warn" action also retains the tag or attribute unchanged, but it also issues a warning message to standard error, so that you can review the HTML manually.

The "drop" action, when applied to a tag, eliminates the HTML tag and the corresponding end tag.  It does not affect anything between the start and end tags.

The "drop" action, when applied to an attribute, eliminates both the attribute and the associated value.

The "drop all" action applies only to tags.  It eliminates the specified start tag, the corresponding end tag, and everything in between.

Compiling

The download package contains several source files, a simple Makefile, and a copy of this HTML file.

Under Linux or UNIX, or any other system that provides a make utility, put all the files into the current working directory and enter the make command.  You may have to tinker with the Makefile a bit to change the name of the compiler, or the compiler options. You will almost certainly need root privileges to install the resulting executable into /usr/bin, /usr/local/bin, or some other directory in your path.

If your system doesn't provide a make utility, then you'll have to do whatever it takes to compile the C files and link them.  No additional libraries are needed beyond the Standard C libraries.

Bugs

If the input HTML includes a scripting language such as JavaScript, and the script is outside of an HTML comment ("<--...-->"), and the script contains "</" (probably within a comment or string literal), html_scrub will interpret these characters as the beginning of an end tag, and get terribly confused.

This combination of characters is unlikely to occur in practice.  If you are don't want to take that chance, then code "<SCRIPT> : warn" in your configuration file so that you can edit any scripts manually.  If you encounter this problem and cannot fix it by a trivial change to the script, then enclose the script within an HTML comment, where html_scrub will do a better job of interpreting the syntax.  It is good practice to enclose scripts within HTML comments anyway.

A true, proper, and correct fix will not be simple, because it will require some parsing of the scripting language, and the rules may be different for different languages. 

Download

For Windows:
html_scrub_1_3.zip
Extract the files with WinZip or a similar utility.

For Linux or UNIX:

html_scrub_1_3.tar.gz
Extract the files with the following commands:
	gunzip html_scrub_1_3.tar.gz
        tar -xvf html_scrub_1_3.tar
These two archives contain identical source code, except that the Windows version uses carriage return/line feeds to terminate the lines, while the Linux/UNIX version uses line feeds.
hscrub.exe
This is an MS-DOS executable, compiled with an ancient creaking copy of Borland 3.0.  The name has been shortened to fit within the old MS-DOS naming limits.  If you can't compile html_scrub for yourself, or you don't want to bother, and you trust me not to plague you with viruses, then download it and run it from the DOS prompt or a .bat file.

Subscribe

To be notified of updates, subscribe to html_scrub at Freshmeat.
MailScott McKellar