Source: html-text Section: python Priority: optional Maintainer: Christian Marillat Homepage: https://github.com/TeamHG-Memex/html-text Rules-Requires-Root: no Standards-Version: 4.6.1 Build-Depends: debhelper-compat (= 13), dh-sequence-python3, python3, python3-setuptools Package: python3-html-text Architecture: all Depends: ${python3:Depends}, ${misc:Depends} Description: extract text from HTML. How is html_text different from .xpath('//text()') from LXML or .get_text() from Beautiful Soup ? . * Text extracted with html_text does not contain inline styles, javascript, comments and other text that is not normally visible to users; * html_text normalizes whitespace, but in a way smarter than .xpath('normalize-space()), adding spaces around inline elements (which are often used as block elements in html markup), and trying to avoid adding extra spaces for punctuation; * html-text can add newlines (e.g. after headers or paragraphs), so that the output text looks more like how it is rendered in browsers.