main logo
Subject: RE: Easy HTML to text function?
Author: "Malcolm Greene"
Posted: 2003/02/28 15:59:00
 
View Entire Thread
New Search


Ted,

Here's some notes from my experiences mechanically converting HTML to =
readable text.

1. You have to expand HTML entities, i.e. &, <, >,   plus =
the numeric version of entity codes (à). There are a ton of named =
entities - check the w3 org site for an official list of entity names.

2. You probably want to ignore content between <style> ... </style> and =
<script> ... </script> tags?

3. You may want to have a strategy for preserving labels (or bullets) =
associated with list elements <li>.

4. How do you plan on supporting information contained in tables? Just =
stripping tags will cause your tabled content to explode into an =
unreadable mess.

5. You will need a strategy for cleaning up whitespace because (a) HTML =
ignores extra spaces, tabs and carriage returns/line feeds (you may want =
to filter out white space before de-tagging) and (b) certain tags =
(endtags?) imply a carriage return/line feed: list, paragraph, division, =
etc. tags.

6. You may want to expand <hr> tags into some sequence of dashes or =
equal signs to provide the text equivalent of a horizontal rule. The =
actual length will probably depend on the width of text area and whether =
or not you are displaying your text using a fixed pitch or proportional =
font.

Beware that when you are looking for specific tags (<p>, <li>, <h1>, =
etc) that you can't depend on simple 1:1 matches since HTML tags often =
contain style and script, ie. <p font=3D"arial" onmouseover=3D"..." =
...>. Also, a lot of public HTML is not XHTML compatible. This means =
that your source HTML content may or may not have closing tags and/or =
may have closing tags without opening tags. The <p> tag is often abused =
in this manner.

The process of converting HTML to readable text is like peeling away the =
layers of an onion. Just when you think you've covered all the bases, =
along comes another layer (HTML feature) that you need to handle. =
Stretching the onion analogy a little more, be prepared for lots of =
tears as well<g>!

Looking forward to hearing what strategy you come up with ...

Malcolm

PS: Plan on having several cases of beer on hand for "medicinal =
purposes"!


<snip>
Okay, so I'm scraping HTML on a site, and I'd like to be able to present =
a
snippet of the web site to a viewer in an edit box. What I'd like to do =
is
strip out all of the HTML contained within less-than and greater-than =
signs,
and store it as text. I know I can go through the text with AT() and
STRTRAN() out all the junk, but is there an easier way to do it?

What I want is the opposite of STREXTRACT(), and that's actually one way =
to
do it: searching for "<" and ">" and StrTran the results with what you
StrExtract, but I wonder if anyone knows of an easier, built-in or =
Win32API
function.
</snip>




 
©2003 Malcolm Greene
<-- Prior Message New Search Next Message -->