How to Process HTML Email to Plain Text

Many emails arrive as HTML-only. Learn how to extract readable plain text from HTML email bodies so your application can process, display, or analyze email content reliably.

Table of Contents
  1. Overview
  2. Prerequisites
  3. Step-by-Step Instructions
  4. Code Example
  5. Common Pitfalls

Overview

HTML email bodies are designed for visual rendering — they contain inline styles, table layouts, image tags, tracking pixels, and nested elements that make them unsuitable for direct text processing. When you receive an HTML-only email via JsonHook and need to extract its readable content, you must convert the HTML to plain text.

JsonHook includes a textBody field with the plain-text alternative if the sender included one. Many automated sending systems (marketing tools, SaaS platforms) send HTML-only emails without a plain-text alternative, resulting in textBody: null. In those cases, you must convert htmlBody to text yourself.

The goal of HTML-to-text conversion is to produce human-readable text that preserves the meaningful content of the email while discarding visual layout noise. A good converter handles: block element spacing, list formatting, link text extraction, table structure, and special HTML entities.

Prerequisites

Requirements for HTML-to-text conversion:

  • A JsonHook webhook payload with an htmlBody field
  • An HTML-to-text library for your language:
    • Node.js: html-to-text, node-html-parser
    • Python: html2text, BeautifulSoup
    • Ruby: Nokogiri, html2text gem
    • Go: html-to-markdown, goquery

Process Every Email Format in Your Pipeline

HTML, plain text, mixed — JsonHook delivers it all as clean JSON.

Get Free API Key

Step-by-Step Instructions

Process HTML email to usable plain text:

  1. Prefer textBody when available:
    const body = email.textBody ?? convertHtmlToText(email.htmlBody ?? "");
  2. Strip script and style tags first — their content should never appear in plain text output:
    const cleanHtml = html
      .replace(/<script[^>]*>[sS]*?</script>/gi, "")
      .replace(/<style[^>]*>[sS]*?</style>/gi, "");
  3. Use a proper HTML-to-text library rather than a simple tag-stripping regex. Libraries handle block spacing, lists, tables, and entities correctly.
  4. Configure the converter to preserve links, format headings, and handle table layouts as your use case requires.
  5. Post-process the result — collapse multiple blank lines, trim leading/trailing whitespace.

Code Example

Node.js handler using the html-to-text library:

import { convert } from "html-to-text";

function getEmailText(email: any): string {
  // Prefer the pre-existing plain text version
  if (email.textBody) return email.textBody;

  if (!email.htmlBody) return "";

  // Convert HTML to plain text
  return convert(email.htmlBody, {
    wordwrap: 130,
    selectors: [
      { selector: "a",  options: { hideLinkHrefIfSameAsText: true } },
      { selector: "img", format: "skip" }, // Skip images
      { selector: "table", options: { uppercaseHeaderCells: false } },
    ],
  }).trim();
}

app.post("/webhooks/email", (req, res) => {
  // ... verify signature ...
  const { email } = JSON.parse(req.body.toString());
  const text = getEmailText(email);

  // Now safe to process as plain text
  const wordCount = text.split(/s+/).length;
  const hasOrderNumber = /orders+#?d+/i.test(text);

  console.log(`Email text: ${wordCount} words, hasOrder: ${hasOrderNumber}`);
  console.log("Preview:", text.slice(0, 300));

  res.sendStatus(200);
});

Common Pitfalls

HTML-to-text conversion pitfalls:

  • Using a simple tag-stripping regex. Naive regex like /<[^>]+>/g produces unreadable output — it strips all whitespace handling that HTML block elements provide. Use a library that understands HTML structure.
  • HTML entities in the output. Raw HTML parsers may leave &nbsp;, &amp;, and other entities in the text. Make sure your library decodes HTML entities or run a decode pass after conversion.
  • Tracking pixels and 1x1 images. Marketing emails contain tracking pixels (<img> tags with tiny dimensions). Configure your converter to skip img elements to avoid broken image references in the text output.
  • Table-heavy marketing emails. Many HTML emails use complex table layouts. The text output from these can be hard to process programmatically. Consider using the htmlBody for display and targeting specific DOM elements with a CSS selector rather than converting the entire body to text.
  • Assuming the HTML is valid. Email HTML is frequently malformed — unclosed tags, missing doctype, inline scripts mixed with content. Use a lenient parser (not an XML parser) that can handle real-world email HTML.

Frequently Asked Questions

Should I use textBody or convert htmlBody?

Always prefer textBody if it exists — it is the sender's canonical plain-text version and is typically cleaner and more accurate than a converted HTML body. Only convert htmlBody when textBody is null. Use the pattern email.textBody ?? convertHtmlToText(email.htmlBody ?? "") to handle both cases.

Why do many emails have textBody: null?

Marketing email tools, SaaS notification systems, and many automated senders send HTML-only emails without a plain-text alternative. This violates email best practices but is extremely common. Handling the null textBody case is a necessity for any production email processing pipeline.

What is the best library for HTML-to-text conversion in Node.js?

The html-to-text npm package is the most configurable and best-maintained option for Node.js. It handles block elements, tables, links, lists, and entities correctly and provides selector-based customization. For simpler cases, sanitize-html with tag removal is lighter weight but less structured in its output.

Can I use an LLM to extract meaningful content from complex HTML emails?

Yes, for complex marketing emails where HTML structure makes automated extraction difficult, passing the raw HTML (or converted text) to an LLM with a focused extraction prompt can produce better results than structural parsing. Just be mindful of token costs for high-volume pipelines.