Overview
HTML email bodies are designed for visual rendering — they contain inline styles, table layouts, image tags, tracking pixels, and nested elements that make them unsuitable for direct text processing. When you receive an HTML-only email via JsonHook and need to extract its readable content, you must convert the HTML to plain text.
JsonHook includes a textBody field with the plain-text alternative if the sender included one. Many automated sending systems (marketing tools, SaaS platforms) send HTML-only emails without a plain-text alternative, resulting in textBody: null. In those cases, you must convert htmlBody to text yourself.
The goal of HTML-to-text conversion is to produce human-readable text that preserves the meaningful content of the email while discarding visual layout noise. A good converter handles: block element spacing, list formatting, link text extraction, table structure, and special HTML entities.
Prerequisites
Requirements for HTML-to-text conversion:
- A JsonHook webhook payload with an
htmlBodyfield - An HTML-to-text library for your language:
- Node.js:
html-to-text,node-html-parser - Python:
html2text,BeautifulSoup - Ruby:
Nokogiri,html2textgem - Go:
html-to-markdown,goquery
- Node.js:
Process Every Email Format in Your Pipeline
HTML, plain text, mixed — JsonHook delivers it all as clean JSON.
Get Free API KeyStep-by-Step Instructions
Process HTML email to usable plain text:
- Prefer textBody when available:
const body = email.textBody ?? convertHtmlToText(email.htmlBody ?? ""); - Strip script and style tags first — their content should never appear in plain text output:
const cleanHtml = html .replace(/<script[^>]*>[sS]*?</script>/gi, "") .replace(/<style[^>]*>[sS]*?</style>/gi, ""); - Use a proper HTML-to-text library rather than a simple tag-stripping regex. Libraries handle block spacing, lists, tables, and entities correctly.
- Configure the converter to preserve links, format headings, and handle table layouts as your use case requires.
- Post-process the result — collapse multiple blank lines, trim leading/trailing whitespace.
Code Example
Node.js handler using the html-to-text library:
import { convert } from "html-to-text";
function getEmailText(email: any): string {
// Prefer the pre-existing plain text version
if (email.textBody) return email.textBody;
if (!email.htmlBody) return "";
// Convert HTML to plain text
return convert(email.htmlBody, {
wordwrap: 130,
selectors: [
{ selector: "a", options: { hideLinkHrefIfSameAsText: true } },
{ selector: "img", format: "skip" }, // Skip images
{ selector: "table", options: { uppercaseHeaderCells: false } },
],
}).trim();
}
app.post("/webhooks/email", (req, res) => {
// ... verify signature ...
const { email } = JSON.parse(req.body.toString());
const text = getEmailText(email);
// Now safe to process as plain text
const wordCount = text.split(/s+/).length;
const hasOrderNumber = /orders+#?d+/i.test(text);
console.log(`Email text: ${wordCount} words, hasOrder: ${hasOrderNumber}`);
console.log("Preview:", text.slice(0, 300));
res.sendStatus(200);
});Common Pitfalls
HTML-to-text conversion pitfalls:
- Using a simple tag-stripping regex. Naive regex like
/<[^>]+>/gproduces unreadable output — it strips all whitespace handling that HTML block elements provide. Use a library that understands HTML structure. - HTML entities in the output. Raw HTML parsers may leave
,&, and other entities in the text. Make sure your library decodes HTML entities or run a decode pass after conversion. - Tracking pixels and 1x1 images. Marketing emails contain tracking pixels (
<img>tags with tiny dimensions). Configure your converter to skipimgelements to avoid broken image references in the text output. - Table-heavy marketing emails. Many HTML emails use complex table layouts. The text output from these can be hard to process programmatically. Consider using the
htmlBodyfor display and targeting specific DOM elements with a CSS selector rather than converting the entire body to text. - Assuming the HTML is valid. Email HTML is frequently malformed — unclosed tags, missing doctype, inline scripts mixed with content. Use a lenient parser (not an XML parser) that can handle real-world email HTML.