>moria.de >Michael's home page >Computing >dehtml

Dehtml 1.8

Dehtml removes HTML constructs from documents for indexing, spell checking and so on. My own implementation is a little smarter than the other implementations I have seen, because it knows about certain tags and expands entities to Latin 1 characters. It is able to generate a word list for spell checking tools and to omit headers for sentence analysis tools.

The current version is available as a >GNU zipped tape archive.