Can I use an LLM to extract data from unstructured emails?

Yes. Because JsonHook delivers a clean textBody string, you can pass it directly to an LLM API (OpenAI, Anthropic, etc.) with a structured extraction prompt. This is especially useful when email formats vary between senders. Just be mindful of latency and cost for high-volume pipelines.

What if the same data field appears in different formats across senders?

Write multiple regex patterns for each known format and try them in order. For truly variable formats, an LLM extraction approach scales better than maintaining an ever-growing list of regex patterns. You can also use sender-specific extraction functions: route different sender addresses to different handlers using multiple JsonHook addresses.

How do I extract data from HTML email bodies?

Use an HTML parsing library appropriate for your language: cheerio in Node.js, BeautifulSoup in Python, or golang.org/x/net/html in Go. Parse the htmlBody field and select specific elements by CSS selector or XPath to extract the data you need.

Are custom email headers available for extraction?

Yes. All email headers are available under email.headers as a lowercase-key map. Many automated notification systems include custom headers like X-Order-Id , X-Customer-Id , or X-Event-Type that are more reliable extraction targets than parsing the body text.

How to Extract Data from Emails Automatically

2025-01-224 min read

Overview

Data extraction from email is the process of identifying and pulling specific pieces of information out of an email message — an order number from a confirmation email, a lead's phone number from a notification, a tracking code from a shipping alert, or an invoice total from an automated billing email.

Traditional approaches involve connecting to an IMAP mailbox, downloading messages, parsing MIME, and running regex patterns — a fragile, polling-based pipeline. JsonHook modernizes this: email arrives at your inbound address, gets parsed into clean JSON, and is delivered to your webhook within seconds. Your handler then applies whatever extraction logic it needs to the ready-to-use JSON fields.

Because JsonHook handles all MIME parsing, encoding normalization, and delivery logistics, your extraction code is purely business logic — simple string operations, regex, or even an NLP/LLM call — applied to a clean textBody string.

Prerequisites

You need:

A JsonHook inbound address configured to receive the emails you want to extract from
A webhook handler that can run extraction logic (Node.js, Python, Go, or any language)
Knowledge of the email format you are extracting from: look at sample emails to identify consistent patterns

Tip: Before writing extraction code, send 5-10 representative emails to a test JsonHook address pointed at webhook.site. Examine the textBody and htmlBody fields to understand the consistent patterns you can target.

Extract Data from Every Inbound Email

JsonHook delivers the full parsed payload. Your code focuses on extraction logic, not MIME.

Get Free API Key

Step-by-Step Instructions

Follow this approach to extract structured data from emails reliably:

Profile your email format. Collect sample emails and identify the fields you want to extract. Note patterns: do order numbers always appear as "Order #12345"? Are prices in "$XX.XX" format?
Decide on extraction strategy:
- Regex on textBody: Best for structured notification emails with consistent formats
- HTML parsing on htmlBody: Better for HTML emails where data is in table cells or specific elements
- Header extraction: Some senders include custom headers like X-Order-ID — check email.headers first
- LLM extraction: For highly variable formats, pass textBody to an LLM with a structured extraction prompt
Write and test your extraction logic against sample payloads before connecting to live email.
Implement your webhook handler that calls the extraction function and stores or forwards the result.
Add validation. Validate extracted fields before acting on them — a regex match failure should trigger an alert rather than silently producing bad data.

Code Example

This Python example extracts an order number, total, and tracking number from a structured notification email:

import re
import json
from flask import Flask, request, abort
import hmac, hashlib

app = Flask(__name__)

def extract_order_data(text_body: str) -> dict:
    patterns = {
        "order_number": r"Orders+#?(w+)",
        "total":        r"Total[:s]+$?([d,]+.d{2})",
        "tracking":     r"Tracking[:s]+([A-Z0-9]{10,30})",
        "eta":          r"Estimated delivery[:s]+([A-Za-z]+s+d+)",
    }
    result = {}
    for key, pattern in patterns.items():
        match = re.search(pattern, text_body, re.IGNORECASE)
        result[key] = match.group(1) if match else None
    return result

@app.route("/webhooks/extract", methods=["POST"])
def handle():
    sig = request.headers.get("X-JsonHook-Signature", "")
    expected = hmac.new(
        b"your_secret", request.data, hashlib.sha256
    ).hexdigest()
    if not hmac.compare_digest(sig, expected):
        abort(401)

    payload = json.loads(request.data)
    text = payload["email"].get("textBody") or ""
    data = extract_order_data(text)

    if not data["order_number"]:
        # Alert ops team — extraction failed
        send_alert(f"Extraction failed for {payload['deliveryId']}")
    else:
        save_order(data)

    return "", 200

Common Pitfalls

Data extraction from email is fragile by nature — here is how to make it robust:

Senders change their email templates. Any change to the sending application's email template can break your regex. Monitor extraction failure rates and alert on anomalies.
Whitespace and line break variations. Email bodies rendered differently by different mail clients may have extra whitespace, vs , or HTML entities. Normalize before applying patterns.
Extracting from HTML instead of text. HTML bodies are harder to regex reliably because of inline styles, tag attributes, and varying whitespace. Prefer textBody when both are available.
Silent failures. If your extraction returns null for a required field, log it as a failure and alert rather than silently continuing with incomplete data.
Not handling international formats. Prices, dates, and phone numbers vary by locale. Make your patterns flexible enough to handle common international variations if your email sources are global.

How to Extract Data from Emails Automatically

Overview

Prerequisites

Extract Data from Every Inbound Email

Step-by-Step Instructions

Code Example

Common Pitfalls

Frequently Asked Questions

Related Guides

How to Parse Email to JSON

How to Process Email Headers

Receive Email Webhooks in Python