Very good points, Malcolm. Fortunately, I'm just going for a simple representation of the text content of a page, and speed and simplicity is more important than precision (at least in this round, the client is always more than welcome to pay me for more <g>!)
-----Original Message----- From: profox-admin (AT) leafe .DO.T com [mailto:profox-admin@leafe.com]On Behalf Of Malcolm Greene Sent: Friday, 28 February, 2003 15:01 To: profox (AT) leafe .DO.T com Subject: RE: Easy HTML to text function?
Ted,
Here's some notes from my experiences mechanically converting HTML to readable text.
1. You have to expand HTML entities, i.e. &, <, >, plus the numeric version of entity codes (à). There are a ton of named entities - check the w3 org site for an official list of entity names.
2. You probably want to ignore content between <style> ... </style> and <script> ... </script> tags?
3. You may want to have a strategy for preserving labels (or bullets) associated with list elements <li>.
4. How do you plan on supporting information contained in tables? Just stripping tags will cause your tabled content to explode into an unreadable mess.
5. You will need a strategy for cleaning up whitespace because (a) HTML ignores extra spaces, tabs and carriage returns/line feeds (you may want to filter out white space before de-tagging) and (b) certain tags (endtags?) imply a carriage return/line feed: list, paragraph, division, etc. tags.
6. You may want to expand <hr> tags into some sequence of dashes or equal signs to provide the text equivalent of a horizontal rule. The actual length will probably depend on the width of text area and whether or not you are displaying your text using a fixed pitch or proportional font.
Beware that when you are looking for specific tags (<p>, <li>, <h1>, etc) that you can't depend on simple 1:1 matches since HTML tags often contain style and script, ie. <p font="arial" onmouseover="..." ...>. Also, a lot of public HTML is not XHTML compatible. This means that your source HTML content may or may not have closing tags and/or may have closing tags without opening tags. The <p> tag is often abused in this manner.
The process of converting HTML to readable text is like peeling away the layers of an onion. Just when you think you've covered all the bases, along comes another layer (HTML feature) that you need to handle. Stretching the onion analogy a little more, be prepared for lots of tears as well<g>!
Looking forward to hearing what strategy you come up with ...
Malcolm
PS: Plan on having several cases of beer on hand for "medicinal purposes"!
<snip> Okay, so I'm scraping HTML on a site, and I'd like to be able to present a snippet of the web site to a viewer in an edit box. What I'd like to do is strip out all of the HTML contained within less-than and greater-than signs, and store it as text. I know I can go through the text with AT() and STRTRAN() out all the junk, but is there an easier way to do it?
What I want is the opposite of STREXTRACT(), and that's actually one way to do it: searching for "<" and ">" and StrTran the results with what you StrExtract, but I wonder if anyone knows of an easier, built-in or Win32API function. </snip>
[excessive quoting removed by server]
©2003 Ted Roche |