Re: MSWord2k Cleanser Tool: Re: [ba-ohs-talk] (not) eating your own dogfood
More on this (might as well promote the ideas while the fire is hot):
> http://www.concept67.fsnet.co.uk/xml/ (01)
I have an extra perl script in there that adds XML heirarchy to HTML
according to the heading levels using <div> tags.
A minor edit of the perl would provide a different tag name if that
bivalence wasn't desired.
However, it's handy. If I serve it with one MIMEtype I get HTML in the
browser, with a different MIMEtype it's XML and behaves in the browser or
DOM accordingly. (02)
The XSLT stylesheet is also configured to leave all the 'class="Blahstyle" '
tags on elements.
So you can create XML from WordHTML, add heirarchy, and then add styles
using a CSS stylesheet based on the Word style names that are preserved into
If you have Word style templates in your organisation then you can be
consistent in your CSS with respect to those styles.
If you don't add the CSS some stuff doesn't look too neat, but all the
heading levels and paragraphs and tables get preserved, so it's all legible,
if dowdy. (03)
Good ideas, but the tool is a gross stream of consciousness hack in the
main. It needs refinement in the HTML clean-up stage.
Using it's present, poorly designed, massively wasteful algorithm it (the
beta app) will process a 100page Word doc in around 15-17seconds. (04)
Oh, and it's designed to batch process (but no fancy multi-threading, just
The beta app left some character refs unhandled, but I'm hoping the simple
cure in the new alpha takes care of that. (06)
Like I said, I hacked this a while back, v crudely, and subsequent to that
time I've seen one or two more coherent applications raise their heads.
Still, if you like the freedom to configure just about any part of it at the
code level, it's pretty good in that sense. (07)
----- Original Message -----
From: "Peter Jones" <email@example.com>
Sent: Monday, February 25, 2002 1:01 PM
Subject: MSWord2k Cleanser Tool: Re: [ba-ohs-talk] (not) eating your own
> I've had this sitting on my site for a while.
> It works pretty well on the whole. The new (untested) alpha script is
> probably better if it works (it should).
> I haven't been able to test it because I don't have Word on this machine
> I can't check outputs properly.
> I wrote it precisely because Raggett's Tidy totally killed the utility of
> the files I needed to clean at the time.
> It uses perl, James Clark's SP and Michael Kay's InstantSaxon XSLT engine
> (on Windows OS). Small tweaks required to get it going on Linux.
> If you want to turn it into a Java App with greater control over whether
> cleans MSWord HTML or other HTML, feel free.
> It's pretty well documented, but if you need extra explanations just let
> ----- Original Message -----
> From: "Murray Altheim" <firstname.lastname@example.org>
> To: <email@example.com>
> Sent: Monday, February 25, 2002 10:45 AM
> Subject: Re: [ba-ohs-talk] (not) eating your own dogfood
> > Eugene Eric Kim wrote:
> > > A fella in Finland decided to check the homepages of the W3C's 506
> > > organizations for valid HTML or XHTML. Only 18 sites validated.
> > >
> > > http://homepage.mac.com/marko/20020222.html
> > The big problem in web design is that almost nobody hand edits their
> > markup or even pays attention to it, and the GUI WYSIWYTYG (what you
> > see is what you think you get) editors in general produce some really
> > I challenge anyone to export "HTML" from MS Word and look at what it
> > creates. Amazing.
> > But I don't see that there's much to be done about this, given that
> > the emphasis from the W3C has never been much along the lines of
> > valid markup. It sometimes seems that they've done everything they
> > could to kill the use of the DTD, such that as a DTD and validation
> > advocate I often felt I was swimming upstream. While Tidy was initially
> > produced by Dave Raggett of the W3C, it itself doesn't produce valid
> > markup in many cases -- I've had to edit its output as well.
> > My guess is that those 18 sites may be managed by a validation zealot
> > like me, or had some type of company policy dictated by one. In the
> > end all one can do is produce better tools, or agitate for them, such
> > as this guy in Finland.
> > With the existence of XHTML and XML tools, it's actually pretty easy
> > to check one's markup nowadays, and even clean it up, so it's sad to
> > see so many corporate sites with poor design under the counter,
> > concentrating on flash rather than substance or interoperability.
> > But that's not unusual in business, is it?
> > Murray
> > ......................................................................
> > Murray Altheim <mailto:m.altheim @ open.ac.uk>
> > Knowledge Media Institute
> > The Open University, Milton Keynes, Bucks, MK7 6AA, UK
> > In the evening
> > The rice leaves in the garden
> > Rustle in the autumn wind
> > That blows through my reed hut. -- Minamoto no Tsunenobu