moria.de
Michael's home page
Computing
dehtml
Dehtml 1.8
Dehtml removes HTML constructs from documents for indexing,
spell checking and so on. My own implementation is a little smarter than the
other implementations I have seen, because it knows about certain tags
and expands entities to Latin 1 characters. It is able to generate
a word list for spell checking tools and to omit headers for sentence
analysis tools.
The current version is available as a
GNU zipped tape archive.