main logo
Subject: RE: Easy HTML to text function?
Author: "Ted Roche"
Posted: 2003/02/28 18:06:00
 
View Entire Thread
New Search


Very good points, Malcolm. Fortunately, I'm just going for a simple
representation of the text content of a page, and speed and simplicity is
more important than precision (at least in this round, the client is always
more than welcome to pay me for more <g>!)

-----Original Message-----
From: profox-admin (AT) leafe .DO.T com [mailto:profox-admin@leafe.com]On Behalf Of
Malcolm Greene
Sent: Friday, 28 February, 2003 15:01
To: profox (AT) leafe .DO.T com
Subject: RE: Easy HTML to text function?



Ted,

Here's some notes from my experiences mechanically converting HTML to
readable text.

1. You have to expand HTML entities, i.e. &, <, >,   plus the
numeric version of entity codes (à). There are a ton of named entities -
check the w3 org site for an official list of entity names.

2. You probably want to ignore content between <style> ... </style> and
<script> ... </script> tags?

3. You may want to have a strategy for preserving labels (or bullets)
associated with list elements <li>.

4. How do you plan on supporting information contained in tables? Just
stripping tags will cause your tabled content to explode into an unreadable
mess.

5. You will need a strategy for cleaning up whitespace because (a) HTML
ignores extra spaces, tabs and carriage returns/line feeds (you may want to
filter out white space before de-tagging) and (b) certain tags (endtags?)
imply a carriage return/line feed: list, paragraph, division, etc. tags.

6. You may want to expand <hr> tags into some sequence of dashes or equal
signs to provide the text equivalent of a horizontal rule. The actual length
will probably depend on the width of text area and whether or not you are
displaying your text using a fixed pitch or proportional font.

Beware that when you are looking for specific tags (<p>, <li>, <h1>, etc)
that you can't depend on simple 1:1 matches since HTML tags often contain
style and script, ie. <p font="arial" onmouseover="..." ...>. Also, a lot of
public HTML is not XHTML compatible. This means that your source HTML
content may or may not have closing tags and/or may have closing tags
without opening tags. The <p> tag is often abused in this manner.

The process of converting HTML to readable text is like peeling away the
layers of an onion. Just when you think you've covered all the bases, along
comes another layer (HTML feature) that you need to handle. Stretching the
onion analogy a little more, be prepared for lots of tears as well<g>!

Looking forward to hearing what strategy you come up with ...

Malcolm

PS: Plan on having several cases of beer on hand for "medicinal purposes"!


<snip>
Okay, so I'm scraping HTML on a site, and I'd like to be able to present a
snippet of the web site to a viewer in an edit box. What I'd like to do is
strip out all of the HTML contained within less-than and greater-than signs,
and store it as text. I know I can go through the text with AT() and
STRTRAN() out all the junk, but is there an easier way to do it?

What I want is the opposite of STREXTRACT(), and that's actually one way to
do it: searching for "<" and ">" and StrTran the results with what you
StrExtract, but I wonder if anyone knows of an easier, built-in or Win32API
function.
</snip>


[excessive quoting removed by server]



 
©2003 Ted Roche
<-- Prior Message New Search Next Message -->