Version 1.3
November 8, 2004
Bugs
html_scrub
HTML Editing Utility in C
Summary
htmlscrub is a command line utility that screens HTML files for
unwanted or hazardous HTML tags, as specified by a configuration
file. This utility is free software, licensed without charge
under the GNU General Public License (GPL).
Introduction
Some HTML tags and attributes can cause problems.  They may be
browser-specific, or they may be rendered differently by different
browsers, or they may interact poorly with certain website management
tools.
If you edit your HTML files by hand, you can simply avoid using the
kinds of HTML that you don't want. The solution is not so simple
if you use web authoring software or word processing programs to create
HTML files. Such tools may inject unwanted HTML that you then
have to remove by hand.
Likewise, if you publish HTML files contributed by others, you may
have to edit them to remove unwanted tags before publishing
them. Policing your HTML manually is tedious, time-consuming, and
error-prone.
Why not let your computer do the grunt work?
This website offers a free tool, html_scrub, in the form of C source
code. By supplying a small configuration file, you can tell
html_scrub which tags and attributes to keep, which to discard,
and which to warn you about.
Synopsis
html_scrub [ -f config_file ] [ input_file ]
The -f option specifies the name of a configuration file. If the
command line does not specify a configuration file, html_scrub
looks for a configuration file in several default locations, as
described below.
After specifying the configuration file, if any, the command line
may specify the name of the input file. If no input file is
specified, html_scrub reads standard input.
The output HTML is written to standard output. Error messages
are written to standard error.
Configuration File
If the command line does not specify a configuration file,
html_scrub looks for one in the following places:
-
A file named by the environmental variable HTML_SCRUB_CFG,
if such a variable exists
-
"html_scrub.cfg" in the current working directory
-
"html_scrub.cfg" in the user's home directory, as defined by
the environmental variable HOME, if such a variable exists
-
"html_scrub.cfg" in the \etc directory
NOTE: The default file name "html_scrub.cfg" may be overridden at
compile time by defining a value for the macro DEF_CFG_FILE.
Sample Configuration File
A sample configuration file is given below. This example is not
very useful in itself, but it will serve to illustrate the syntax:
<HEAD> : keep
<BODY> : keep
<APPLET> : warn
# FONT tags may contain browser-specific barbarisms
<FONT>
{
attribute FACE : drop # Don't require browser to have
# specific fonts available
default : keep
}
<META>
{
attribute HTTP-EQUIV : warn # be wary of Refresh
}
<EM> : drop
<SCRIPT> : drop all # Don't want any scripts
default : drop
comment : warn
Given this configuration file, html_scrub will do the following:
-
Retain all instances of the <HEAD> and <BODY> tags
without change, along with the ending tags </HEAD> and
</BODY>.
-
Issue a warning message to standard error for every occurrence of
the <APPLET> tag, or the ending tag </APPLET>, giving the
line number of its appearance in the input file.
-
Within any <FONT> tag, eliminate any FACE attribute, while
retaining any other attributes without change.
-
Issue a warning message to standard error for every occurrence of
the HTTP-EQUIV attribute within a <META> tag, giving the line
number of its appearance in the input file.
-
Remove all instances of the <EM> tag, along with the ending tag
</EM>. Everything between the starting and ending tags will
be retained without change.
-
Remove the <SCRIPT> tag, along with the ending </SCRIPT>
tag, and everything in between.
-
Remove any tag, and the corresponding ending tag, for which no
action has been specified. If you specify a default action
of "keep", these tags will be retained without change.
-
Issue a warning message for every HTML comment, giving the line
number where it occurs.
A more realistic configuration file would likely include a long
list of tags that html_scrub should keep, a smaller list of
tags for it to warn about, and possibly some instructions to drop
or warn you about certain attributes within tags.
Syntax Rules
You can probably figure out most of what you need to know by
studying the example above, but here is a more formal description
of the syntax used by the configuration file.
General Rules
-
Everything from a pound sign ('#') to the end of the line is treated
as a comment, i.e. ignored.
-
White space is not significant except as a separator between words
such as DROP and ALL, and, in the case of line feeds, as a terminator
for comments. It's a good idea to keep things tidy, as in the
sample above, but it isn't necessary. You can write the whole
thing on one line if you want to.
-
Case is not significant. My own practice is to put tag names
and attributes in upper case and everything else in lower case, but
your taste may vary.
Tag-level Actions
-
An action for a tag is defined by:
-
The tag name, in angle brackets
-
A colon
-
One of "keep", "warn", "drop", or "drop all"
- Define the default action and the action for HTML comments in
the same way, using "default" or "comment" respectively in the
place of the tag name, without angle brackets.
-
You may specify only one action for any given tag.
-
You may specify only one default action or comment action.
-
The default action applies to any tag for which you don't
otherwise specify an action.
-
If you don't specify a default action, the default action is "keep".
-
If you don't specify a comment action, the comment action is "keep".
Attribute-level Actions
-
To define attribute actions within a given tag, code:
-
The attribute name, in angle brackets
-
A left curly brace
-
One or more attribute actions
-
A right curly brace
-
An attribute action consists of:
-
The word "attribute"
-
The attribute name
-
A colon
-
One of "keep", "warn", or "drop".
-
Define the default attribute action in the same way, using
the word "default" in the place of the attribute name, without
the word "attribute".
-
The default attribute action applies to any attribute for which
you have not otherwise specified an action.
-
If you don't define a default attribute action within a tag, the
default attribute action for that tag will be "keep".
-
Within any given tag, you can specify only one action for any
attribute, and only one default action.
-
If you don't define a default attribute action within a tag, the
default attribute action for that tag will be "keep".
-
Attribute-level actions apply only to the tags for which they are
defined.
Actions
The "keep" action retains the tag or attribute unchanged.
The "warn" action also retains the tag or attribute unchanged, but it
also issues a warning message to standard error, so that you can
review the HTML manually.
The "drop" action, when applied to a tag, eliminates the HTML tag and
the corresponding end tag.  It does not affect anything between the
start and end tags.
The "drop" action, when applied to an attribute, eliminates both the
attribute and the associated value.
The "drop all" action applies only to tags. It eliminates the
specified start tag, the corresponding end tag, and everything in
between.
Compiling
The download package contains several source files, a simple
Makefile, and a copy of this HTML file.
Under Linux or UNIX, or any other system that provides a make
utility, put all the files into the current working directory and
enter the make command. You may have to tinker with
the Makefile a bit to change the name of the compiler, or the
compiler options. You will almost certainly need root privileges
to install the resulting executable into /usr/bin, /usr/local/bin,
or some other directory in your path.
If your system doesn't provide a make utility, then you'll have
to do whatever it takes to compile the C files and link them.
No additional libraries are needed beyond the Standard C libraries.
Bugs
If the input HTML includes a scripting language such as JavaScript,
and the script is outside of an HTML comment ("<--...-->"),
and the script contains "</" (probably within
a comment or string literal), html_scrub will interpret these
characters as the beginning of an end tag, and get terribly confused.
This combination of characters is unlikely to occur in practice.
If you are don't want to take that chance, then code
"<SCRIPT> : warn" in your configuration file so that you can
edit any scripts manually. If you encounter this problem and
cannot fix it by a trivial change to the script, then enclose the
script within an HTML comment, where html_scrub will do a better job
of interpreting the syntax. It is good practice to enclose
scripts within HTML comments anyway.
A true, proper, and correct fix will not be simple, because it will
require some parsing of the scripting language, and the rules may be
different for different languages.
Download
For Windows:
html_scrub_1_3.zip
Extract the files with WinZip or a similar utility.
For Linux or UNIX:
html_scrub_1_3.tar.gz
Extract the files with the following commands:
gunzip html_scrub_1_3.tar.gz
tar -xvf html_scrub_1_3.tar
These two archives contain identical source code, except that
the Windows version uses carriage return/line feeds to terminate
the lines, while the Linux/UNIX version uses line feeds.
hscrub.exe
This is an MS-DOS executable, compiled with an ancient creaking
copy of Borland 3.0. The name has been shortened to fit within
the old MS-DOS naming limits. If you can't compile html_scrub
for yourself, or you don't want to bother, and you trust me not to
plague you with viruses, then download it and run it from the DOS
prompt or a .bat file.
Subscribe
To be notified of updates, subscribe to html_scrub at
Freshmeat.
Scott McKellar