HTML_Cleaner Instructions

HTML Cleaner Instructions

James Falkofske, December 2008 - contact jamesfalkofske@yahoo.com

Overview

The HTML_Cleaner.exe program removes non-compliant HTML code and styles from HTML files.  In particular, it removes internal stylesheets from Microsoft Word as well as style, class, and id tags in order to render a "clean" HTML file.

The program requires the .NET Framework, which must be downloaded and installed before running the HTML_Cleaner program.

The process also can thumbnail and enforce ALT attributes on IMG tags.  If the checkbox for THUMBNAIL IMAGES is set, any image larger (in width or height) than the present number will automatically generate a new thumbnail image that is substituted into the HTML document (and linked to the full sized image) to help reduce bandwidth.  The checkbox also forces an ALT attribute for all images - to help ensure that images meet accessibility standards.

The output encoding can be set to any of the following:

Encoding Scheme
Notes:
UTF-8
This is the standard for all web browsers and adaptive technologies, and it should be used unless there is a compelling reason to do otherwise
UTF-16LE
Lower Endian encoding - this is Microsoft Office encoding scheme; not in compliance with international web standards; will only display correctly in Microsoft Internet Exploder
UTF-16BE
Big Endian encoding - this is the standard 16-bit unicode, also referred to as UTF-16
US-ASCII
Standard United States ASCII; tab characters and other special characters may display incorrectly



Process

  1. In MS Word, FILE > SAVE AS > .htm Web Page Filtered
    [Note: the filename MUST NOT contain spaces or punctuation characters]
  2. In the HTML_Cleaner program, use FILE > OPEN to open the HTM file.
  3. In the right window, type in a TITLE for the file.  This will be saved in the <head> section as the <title></title> tag.
  4. In the HEADER/FOOTER tab, indicate any HTML code you wish to be injected as a header immediately after the <body> tag, and any HTML code you wish to be injected as a footer imediately before the </body> tag.  There are six section; header pre-code, header title code, and header post-code, and then also footer pre-code, footer title code, and footer post-code.
    HTML Cleaner showing Header and Footer tab

  5. Set the INSTRUCTOR information field.
    You can manually type in the instructor name and then press ADD to add it to the configuration file (and list).
  6. Set the COURSE information field. 
    You can manually type in the course name and then press the ADD button to add it to the configuration file (and list).
  7. You can indicate a stylesheet that should be linked into the configuration files.  Click the STYLESHEET tab.  Choose one of the predefined styles from the list.  You can also add your own custom CSS/HTML code by typing the information into the edit box for "CSS/HTML CODE" and then typing a style name in the combobox for CSS STYLE NAME, and then pressing the ADD button.  This will also save that information into the configuration file.
    HTML_Cleaner Stylesheet tab
    NOTES:
    1) Use of included CSS files.  If you wish to use the sample stylesheets as included with the ZIP folder package, you must copy the *.CSS files and the /images/ and /images/icons/ folders into the directory where your content files exist.  (For example, if your files exist at  MY DOCUMENTS > FALL 2007 > MGMT 310, then you must copy the materials above into that subdirectory).
    2) Use relative path.  If you decide to write your own CSS files, or if you are storing your content files in sub directories within D2L, then the CSS files should be stored in the ROOT folder and they should be referenced with the RELATIVE PATH within the CSS code segment (such that   "../D2L06.css").

  8. Set the ENCODING type.  Click on the tab for ENCODING.  Under most situations, the encoding type should be set to the web-standard UTF-8. 
    UTF-16LE is used for some compatibility with Microsoft Word HTML files, but it is a proprietary format that does not work on non-IE browsers.
    HTML_Cleaner Encoding tab

  9. Indicate whether images from within the document need to be converted and thumbnailed.  Click on the tab for IMAGES.
    If this is the first-time the file is being cleaned, you should make sure to checkmark the THUMBNAIL IMAGES control.
    If you have previously cleaned the file and thumbnailed the images (and you are trying to repair the file), then do not checkmark the THUMBNAIL IMAGES box (as this will require that new copies of
    HTML_Cleaner Images tab

  10. When the settings have been entered, you can press the PROCESS HTML CLEANING button. This begins stripping out non-compliant HTML code, including internal style, id, class attributes - to create "clean vanilla" HTML.
  11. If the program encounters an image, the program will authomatically create a thumbnail image if the original image exceeds the THUMBNAIL SIZE specified in the IMAGES tab.
    The Update Image Information dialog screen will load. 
    You can then set the NEW INFO for the IMAGE SUBFOLDER and also the FILENAME.  This allows you to specify what the image is named and where it is located. 
    Also - you will need to set an ALT image attribute tag -- which is needed for accessibility (for blind users).  You can also enter some LONG DESCRIPTION information that will end up as a caption directly underneath the image in the cleaned HTML file.   When the information has been updated, press the UPDATE IMAGE button.
    Image processing dialog

  12. When the file processing has been completed, a dialog will appear indicating how much smaller the cleaned file is from the original file. 
  13. Then use the FILE > SAVE AS to save the file.  You can rename the file, but you cannot move it to a different subdirectory (or the image links will be broken).
    Normally you will save the file with the same name as the original file (which will overwrite the original file).


NOTE 1: conversion materials clipboarded and copied from the code-view window.
NOTE 2: To use the dropdown lists for name and course, use the ADD button and then fill in the information for the list. Then use FILE > EXIT in order to save the configurations for the next use.

Options

Configuration / .ini File

The configuration file is a text file.
If you want to remove options from the configuration file, delete items one full line at a time.

If you delete the entire .ini file - one will be created upon start of the program.

In the EDIT > DEFAULT VALUES - the default values for the program are used, and upon exit of the program (and save of configuration), those values will be used upon next startup of the program.


History

The HTML_Cleaner was developed by James Falkofske.  Work continues on this product on the creator's personal time.
Its initial purpose was to clean Microsoft Word HTML files to improve accessibility.

' HTML Cleaner - copyright © 2005-08 James Falkofske - all rights reserved
' Version Maintenance Information
' Version 1.7.1 - added meta tags 1 and 2 for person cleaning content and owner
'   rebuilt file read to convert into Unicode default coding and converting non-compliant punctuation
' Version 1.7.2 - handled tab bug and forced conversions to UTF-8 on write of files.
'   rebuilt for 2007 with new instructions
' Version 1.7.6 - added encoding types for UTF-8, UTF-16LE, UTF-16BE, US-ASCII
' Version 1.7.7 - for Metropolitan State University - Removed date lock; modified ADD Meta button behavior to immediately save additions
' Version 1.7.9 - Date Lock reimplemented.
' Version 1.8.1 - Added FileTabs for setting attributes and options; separated Header and Footer into sets of 3 strings
'   Rewrote configuration file routines to store in initial program directory.
'   Updated CSS File routines to allow user-specified CSS/HTML code.
'   Updated Update Image routines to provide suggested new filename based upon any existing ALT tag.
' Version 1.8.6 - Autonaming of image files from ALT tags updated to reduce filename size to 25 characters or less.
' Verison 1.9.0 - Fixed bug in which program stops if the file to convert is already open in another program.
' ***************************

Copyright and Usage

The HTML_Cleaner program, in its current version is copyright 2005-2008 James C. Falkofske. 
The current version is made available free-of-charge in compiled executable form to institutions of public education. 
The program does have a built-in expiration mechanism to force users to collect the most recent update of the program.


return to top