Overview
Data extraction from email is the process of identifying and pulling specific pieces of information out of an email message — an order number from a confirmation email, a lead's phone number from a notification, a tracking code from a shipping alert, or an invoice total from an automated billing email.
Traditional approaches involve connecting to an IMAP mailbox, downloading messages, parsing MIME, and running regex patterns — a fragile, polling-based pipeline. JsonHook modernizes this: email arrives at your inbound address, gets parsed into clean JSON, and is delivered to your webhook within seconds. Your handler then applies whatever extraction logic it needs to the ready-to-use JSON fields.
Because JsonHook handles all MIME parsing, encoding normalization, and delivery logistics, your extraction code is purely business logic — simple string operations, regex, or even an NLP/LLM call — applied to a clean textBody string.
Prerequisites
You need:
- A JsonHook inbound address configured to receive the emails you want to extract from
- A webhook handler that can run extraction logic (Node.js, Python, Go, or any language)
- Knowledge of the email format you are extracting from: look at sample emails to identify consistent patterns
Tip: Before writing extraction code, send 5-10 representative emails to a test JsonHook address pointed at webhook.site. Examine the textBody and htmlBody fields to understand the consistent patterns you can target.
Extract Data from Every Inbound Email
JsonHook delivers the full parsed payload. Your code focuses on extraction logic, not MIME.
Get Free API KeyStep-by-Step Instructions
Follow this approach to extract structured data from emails reliably:
- Profile your email format. Collect sample emails and identify the fields you want to extract. Note patterns: do order numbers always appear as "Order #12345"? Are prices in "$XX.XX" format?
- Decide on extraction strategy:
- Regex on textBody: Best for structured notification emails with consistent formats
- HTML parsing on htmlBody: Better for HTML emails where data is in table cells or specific elements
- Header extraction: Some senders include custom headers like
X-Order-ID— checkemail.headersfirst - LLM extraction: For highly variable formats, pass
textBodyto an LLM with a structured extraction prompt
- Write and test your extraction logic against sample payloads before connecting to live email.
- Implement your webhook handler that calls the extraction function and stores or forwards the result.
- Add validation. Validate extracted fields before acting on them — a regex match failure should trigger an alert rather than silently producing bad data.
Code Example
This Python example extracts an order number, total, and tracking number from a structured notification email:
import re
import json
from flask import Flask, request, abort
import hmac, hashlib
app = Flask(__name__)
def extract_order_data(text_body: str) -> dict:
patterns = {
"order_number": r"Orders+#?(w+)",
"total": r"Total[:s]+$?([d,]+.d{2})",
"tracking": r"Tracking[:s]+([A-Z0-9]{10,30})",
"eta": r"Estimated delivery[:s]+([A-Za-z]+s+d+)",
}
result = {}
for key, pattern in patterns.items():
match = re.search(pattern, text_body, re.IGNORECASE)
result[key] = match.group(1) if match else None
return result
@app.route("/webhooks/extract", methods=["POST"])
def handle():
sig = request.headers.get("X-JsonHook-Signature", "")
expected = hmac.new(
b"your_secret", request.data, hashlib.sha256
).hexdigest()
if not hmac.compare_digest(sig, expected):
abort(401)
payload = json.loads(request.data)
text = payload["email"].get("textBody") or ""
data = extract_order_data(text)
if not data["order_number"]:
# Alert ops team — extraction failed
send_alert(f"Extraction failed for {payload['deliveryId']}")
else:
save_order(data)
return "", 200Common Pitfalls
Data extraction from email is fragile by nature — here is how to make it robust:
- Senders change their email templates. Any change to the sending application's email template can break your regex. Monitor extraction failure rates and alert on anomalies.
- Whitespace and line break variations. Email bodies rendered differently by different mail clients may have extra whitespace,
vs, or HTML entities. Normalize before applying patterns. - Extracting from HTML instead of text. HTML bodies are harder to regex reliably because of inline styles, tag attributes, and varying whitespace. Prefer
textBodywhen both are available. - Silent failures. If your extraction returns null for a required field, log it as a failure and alert rather than silently continuing with incomplete data.
- Not handling international formats. Prices, dates, and phone numbers vary by locale. Make your patterns flexible enough to handle common international variations if your email sources are global.