html-parsing: Permissive Parsing of HTML to SXML/xexp in Racket
License: LGPL 3 Web: http://www.neilvandyke.org/racket-html-parsing/
(require sxml/html) | package: html-parsing |
1 Introduction
The html-parsing library provides a permissive HTML parser. The parser is useful for software agent extraction of information from Web pages, for programmatically transforming HTML files, and for implementing interactive Web browsers. html-parsing emits SXML/xexp, so that conventional invalid HTML may be processed with XML tools such as SXPath. Like Oleg Kiselyov’s SSAX-based HTML parser, html-parsing provides a permissive tokenizer, but html-parsing extends this by attempting to recover syntactic structure.
The html-parsing parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. html-parsing’s handling of errors is intended to generally emulate popular Web browsers’ interpretation of the structure of erroneous HTML. We euphemistically term this kind of parse “pragmatic.”
html-parsing also has some support for XHTML, although XML namespace qualifiers are accepted but stripped from the resulting SXML/xexp. Note that valid XHTML input might be better handled by a validating XML parser like Kiselyov’s SSAX.
This package obsoletes HtmlPrag.
2 Interface
procedure
(html->xexp input) → any/c
input : any/c
Permissively parse HTML from input, which is either an input port or a string, and emit an SXML/xexp equivalent or approximation. To borrow and slightly modify an example from Kiselyov’s discussion of his HTML parser:
(html->xexp "<html><head><title></title><title>whatever</title></head><body>\n<a href=\"url\">link</a><p align=center><ul compact style=\"aa\">\n<p>BLah<!-- comment <comment> --> <i> italic <b> bold <tt> ened</i>\nstill < bold </b></body><P> But not done yet...") ==> (*TOP* (html (head (title) (title "whatever")) (body "\n" (a (@ (href "url")) "link") (p (@ (align "center")) (ul (@ (compact) (style "aa")) "\n")) (p "BLah" (*COMMENT* " comment <comment> ") " " (i " italic " (b " bold " (tt " ened"))) "\n" "still < bold ")) (p " But not done yet...")))
Note that, in the emitted SXML/xexp, the text token "still < bold" is not inside the b element, which represents an unfortunate failure to emulate all the quirks-handling behavior of some popular Web browsers.
3 History
Version 0.3 —
2011-08-27 - PLaneT (1 2) Converted test suite from Testeez to Overeasy.
Version 0.2 —
2011-08-27 - PLaneT (1 1) Fixed embarrassing bug due to code tidying. Thanks to Danny Yoo for reporting.
Version 0.1 —
2011-08-21 - PLaneT (1 0) Part of forked development from HtmlPrag.
4 Legal
Copyright (c) 2003–2011 Neil Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License (LGPL 3), or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See http://www.gnu.org/licenses/ for details. For other licenses and consulting, please contact the author.
Standard Documentation Format Note: The API signatures in this documentation are likely incorrect in some regards, such as indicating type any/c for things that are not, and not indicating when arguments are optional. This is due to a transitioning from the Texinfo documentation format to Scribble, which the author intends to finish someday.