Ted,
Here's some notes from my experiences mechanically converting HTML to = readable text.
1. You have to expand HTML entities, i.e. &, <, >, plus = the numeric version of entity codes (à). There are a ton of named = entities - check the w3 org site for an official list of entity names.
2. You probably want to ignore content between <style> ... </style> and = <script> ... </script> tags?
3. You may want to have a strategy for preserving labels (or bullets) = associated with list elements <li>.
4. How do you plan on supporting information contained in tables? Just = stripping tags will cause your tabled content to explode into an = unreadable mess.
5. You will need a strategy for cleaning up whitespace because (a) HTML = ignores extra spaces, tabs and carriage returns/line feeds (you may want = to filter out white space before de-tagging) and (b) certain tags = (endtags?) imply a carriage return/line feed: list, paragraph, division, = etc. tags.
6. You may want to expand <hr> tags into some sequence of dashes or = equal signs to provide the text equivalent of a horizontal rule. The = actual length will probably depend on the width of text area and whether = or not you are displaying your text using a fixed pitch or proportional = font.
Beware that when you are looking for specific tags (<p>, <li>, <h1>, = etc) that you can't depend on simple 1:1 matches since HTML tags often = contain style and script, ie. <p font=3D"arial" onmouseover=3D"..." = ...>. Also, a lot of public HTML is not XHTML compatible. This means = that your source HTML content may or may not have closing tags and/or = may have closing tags without opening tags. The <p> tag is often abused = in this manner.
The process of converting HTML to readable text is like peeling away the = layers of an onion. Just when you think you've covered all the bases, = along comes another layer (HTML feature) that you need to handle. = Stretching the onion analogy a little more, be prepared for lots of = tears as well<g>!
Looking forward to hearing what strategy you come up with ...
Malcolm
PS: Plan on having several cases of beer on hand for "medicinal = purposes"!
<snip> Okay, so I'm scraping HTML on a site, and I'd like to be able to present = a snippet of the web site to a viewer in an edit box. What I'd like to do = is strip out all of the HTML contained within less-than and greater-than = signs, and store it as text. I know I can go through the text with AT() and STRTRAN() out all the junk, but is there an easier way to do it?
What I want is the opposite of STREXTRACT(), and that's actually one way = to do it: searching for "<" and ">" and StrTran the results with what you StrExtract, but I wonder if anyone knows of an easier, built-in or = Win32API function. </snip>
©2003 Malcolm Greene |