PII Protection Comparison

Pryv vs OpenAI site: what is redacted and what still reaches OpenAI

This page compares what PII exists around an AI request, how Pryv redacts it, and what reaches OpenAI when using Pryv versus using the OpenAI website directly.

Audience

Security reviewers, privacy engineering, product teams.

Purpose

Document what PII exists around a request, what Pryv redacts, and what reaches OpenAI when using Pryv vs sending a request directly to the OpenAI website.

Scope notes

Pryv behavior is derived from backend code (app.py, ai_functions.py, utils.py, pii_recognizers.py, secure_http.py, secure_openai_client.py, metadata_pii.readme). The OpenAI site column describes typical web telemetry; actual collection may vary and is outside this repo.

Jump to

Section

Overview

Section

Comparison at a glance

Section

Redaction pipelines

Section

Custom recognizers

Section

Entity coverage

Section

What reaches OpenAI

Section

Fingerprinting signals

Section

Configuration overrides

Section

Residual risks

Section

Code references

Overview

Data path and PII surfaces

Pryv path

User app or browser -> Pryv TEE (/prompt) -> server-side PII redaction -> OpenAI API (or Anthropic/Gemini/xAI)

OpenAI site path

User browser -> OpenAI website -> OpenAI API (no Pryv redaction in the path)

PII around a request (high level)

Content you submit: text prompt, attachments, images, PDFs, URLs, tool payloads.
Request metadata: IP address, headers (User-Agent, Accept-Language, Referer), cookies, request IDs, timestamps.
Device and browser fingerprinting: canvas/WebGL, fonts, screen size, audio, and more.
Network-layer fingerprints: TLS ClientHello (JA3/JA4), HTTP/2/3 patterns, TCP/IP stack.
Storage and state: cookies, localStorage, cache revalidation, bounce tracking, and more.

Comparison

Comparison at a glance

Surface	Pryv - to OpenAI	OpenAI site direct
Text prompt	Placeholderized via Presidio	Raw text
JSON payloads (if JSON mode)	String values anonymized	Raw JSON
Images	OCR redacted + optional face blur	Raw images
PDFs	Text-layer redaction + metadata scrub; raster fallback	Raw PDFs
Attachment IDs	Forwarded as-is	Not applicable (site handles uploads directly)
URLs provided for tools/contexts	Forwarded as-is	Raw URLs
Headers to OpenAI	UA replaced, Accept-Language blank, cookies stripped, DNT/GPC set	Browser headers and cookies
IP to OpenAI	Pryv TEE IP only	User IP

Redaction

What Pryv redacts inside a request

1) Text redaction (default path)

Code: ai_functions.py -> anonymize_prompt().

Uses Presidio AnalyzerEngine with default recognizers (no custom recognizers added).
Replaces detected PII with deterministic placeholders like <EMAIL_ADDRESS_1>.
Stores placeholder mapping per chat in pii_parameters for reinsertion.

Default entity allowlist

CREDIT_CARDCRYPTODATE_TIMEEMAIL_ADDRESSIBAN_CODEIP_ADDRESSNRPLOCATIONPERSONPHONE_NUMBERMEDICAL_LICENSEURLUS_BANK_NUMBERUS_DRIVER_LICENSEUS_ITINUS_PASSPORTUS_SSNUS_EINUK_NHSUK_NINOADDRESSCREDIT_CARD_LAST4RECEIPT_NUMBERDATE

Notes: PII_ENTITIES does not affect this text path; only the pii_entities request override does.

PII_SCORE_THRESHOLD is not applied here; Presidio default threshold is used.

Custom entity types (ADDRESS, CREDIT_CARD_LAST4, RECEIPT_NUMBER, DATE, US_EIN) are listed, but custom recognizers are not attached in this path, so detection for those types depends on Presidio built-ins (if any).

Input: "Bill to John Doe at john@example.com, card ending 1234"
Output: "Bill to <PERSON_1> at <EMAIL_ADDRESS_1>, card ending <CREDIT_CARD_LAST4_1>"

2) JSON mode (optional)

Code: app.py -> anonymize_json_strings() (in utils.py).

Enabled by PII_JSON_MODE=1 or pii_config.json_mode.enabled.
Parses the message as JSON and anonymizes string values only.
Uses Presidio AnonymizerEngine with label-style replacements ([NAME], [EMAIL], [PII]).
Does not produce pii_parameters mapping (no reinsertion).

Entities and thresholds

Uses pii_entities if provided on the request.
Else uses PII_ENTITIES env (comma list).
Else uses Presidio defaults (all built-in entities).
Score threshold uses pii_config.json_mode.score_threshold or PII_SCORE_THRESHOLD (default 0.35).

3) Image redaction

Code: app.py -> redact_image_with_style() -> redact_base64_image() (in utils.py).

Applies OCR (Tesseract) + Presidio + custom recognizers.
Draws black rectangles over detected PII (irreversible).
Optional face blur (OpenCV or face_recognition) with request-level tuning.
Returns PNG (lossless); xAI images may be JPEG-compressed for size.
Default entity allowlist when PII_ENTITIES is not set: the aggressive list from pii_recognizers.get_aggressive_entities() (see below).
Score threshold: uses pii_config.image.score_threshold if provided, else PII_SCORE_THRESHOLD env (default 0.35).
Custom recognizers are enabled for images.

4) PDF redaction

Code: app.py -> redact_pdf_base64() (in utils.py).

Two-tier approach

Tier 1: text-layer redaction using PyMuPDF line extraction + Presidio AnalyzerEngine with spaCy. Uses the same entity allowlist as images by default. Custom recognizers are not attached in this tier.
Tier 2: raster fallback renders pages to images, applies the image pipeline (OCR + custom recognizers), and rebuilds a PDF from redacted images.

Security scrubbing in both tiers

Metadata removed (XMP and document metadata).
Embedded files removed.
Non-redaction annotations removed.
JavaScript removed.
Form fields cleared.
Bookmarks/outline removed.

Thresholds and limits

Score threshold uses pii_config.pdf.score_threshold if provided, else PII_SCORE_THRESHOLD env (default 0.45 for PDFs).
Max pages: default 20; configurable via pii_config.pdf.max_pages.

Presidio

Custom recognizers (pattern-based)

Defined in pii_recognizers.py and added to the image pipeline:

Enhanced phone formats (7 patterns, international and common layouts).
Address patterns (street, PO Box, ZIP, city/state/ZIP).
Card last-4 formats ("Visa - 1234", "****1234").
Receipt/invoice numbers and alphanumeric codes.
Context-aware names (labels like "Bill to", "Name", "From").
Date formats (MM/DD/YYYY, Month DD, YYYY, ISO).
US EIN / Tax ID formats.

Active in image redaction and PDF raster fallback. They are not attached to the default text analyzer.

Coverage

Entity coverage by pipeline (default)

Standard = Presidio built-inCustom = pii_recognizersRequested = allowlist only

Entity	Text	JSON	Image OCR	PDF text layer	PDF raster
EMAIL_ADDRESS	Standard	Standard	Standard	Standard	Standard
PHONE_NUMBER	Standard	Standard	StandardCustom	Standard	StandardCustom
CREDIT_CARD	Standard	Standard	Standard	Standard	Standard
US_SSN	Standard	Standard	Standard	Standard	Standard
IBAN_CODE	Standard	Standard	Standard	Standard	Standard
IP_ADDRESS	Standard	Standard	Standard	Standard	Standard
LOCATION	Standard	Standard	Standard	Standard	Standard
NRP	Standard	Standard	Standard	Standard	Standard
PERSON	Standard	Standard	StandardCustom	Standard	StandardCustom
DATE_TIME	Standard	Standard	Standard	Standard	Standard
URL	Standard	Standard	Standard	Standard	Standard
US_BANK_NUMBER	Standard	Standard	Standard	Standard	Standard
US_DRIVER_LICENSE	Standard	Standard	Standard	Standard	Standard
US_ITIN	Standard	Standard	Standard	Standard	Standard
US_PASSPORT	Standard	Standard	Standard	Standard	Standard
UK_NHS	Standard	Standard	Standard	Standard	Standard
UK_NINO	Standard	Standard	Standard	Standard	Standard
CRYPTO	Standard	Standard	Standard	Standard	Standard
MEDICAL_LICENSE	Standard	Standard	Standard	Standard	Standard
US_EIN	Requested	Requested	Custom	Requested	Custom
ADDRESS	Requested	Requested	Custom	Requested	Custom
CREDIT_CARD_LAST4	Requested	Requested	Custom	Requested	Custom
RECEIPT_NUMBER	Requested	Requested	Custom	Requested	Custom
DATE	Requested	Requested	Custom	Requested	Custom

The table shows default behavior with no pii_entities override. For JSON mode, entity coverage depends on pii_entities or PII_ENTITIES. If not set, Presidio defaults (all built-ins) are used, without custom recognizers.

Transmission

What reaches OpenAI

Pryv - redaction enabled

Text prompt: placeholderized or JSON-anonymized.
Images: redacted base64 data URLs (PNG; JPEG for xAI compression).
PDFs: redacted base64 (text-layer redacted, metadata scrubbed; raster fallback if needed).
Tools, context, URLs, attachment IDs: forwarded as provided.

Pryv - redaction disabled

skip_pii_redaction (boolean or object with text, image, pdf) can bypass redaction per format.
no_pii_redaction bypasses all redaction (intended for internal use).

Failure behavior

Text redaction errors fall back to the raw prompt.
Image redaction uses a best-effort wrapper; on failure it returns the original image data.
PDF redaction failures raise PDF_REDACTION_FAILED and block the request.

What the OpenAI site receives (direct use)

Raw prompt text and uploads (no Pryv placeholders).
Full browser and device metadata (IP, User-Agent, Accept-Language, cookies).
Web-exposed fingerprinting signals if the site chooses to access them.
Network-layer identifiers from your device (TLS/HTTP/2/3 fingerprints).
Pryv cannot affect this path.

Fingerprinting

Fingerprinting and metadata signals

A. HTTP headers and browser-exposed settings

IP address and coarse geolocation. Pryv: used transiently for rate limiting, not stored, not forwarded; OpenAI site: sees client IP directly.
User-Agent reduction and Client Hints (UA-CH). Pryv: outbound UA is "Mozilla/5.0 (TEEProxy)" and Accept-Language is empty; OpenAI site: receives browser UA/CH if requested.
Accept-Language and locale list. Pryv: stripped on outbound; OpenAI site: receives browser preferences.
Time zone and locale via JS Intl. Pryv: backend does not collect; OpenAI site: accessible to browser JS.
Device memory and Client Hints. Pryv: not collected/forwarded; OpenAI site: accessible via UA-CH in supporting browsers.
Network info (wifi, cellular) APIs. Pryv: not collected/forwarded; OpenAI site: accessible in supporting browsers.

B. Rendering and device-side signals (high entropy)

Canvas fingerprinting. Pryv: not applicable server-side; OpenAI site: accessible via JS.
WebGL/GPU fingerprinting. Pryv: not applicable server-side; OpenAI site: accessible via JS.
Audio fingerprinting via Web Audio. Pryv: not applicable server-side; OpenAI site: accessible via JS.
Screen/viewport size and devicePixelRatio. Pryv: not collected/forwarded; OpenAI site: accessible via JS/CSS.
Fonts and font metrics. Pryv: not collected/forwarded; OpenAI site: accessible via JS/CSS.
Gamepad or device enumeration. Pryv: not collected/forwarded; OpenAI site: accessible via JS where allowed.

C. Storage, cache, and state-based identifiers

ETag/Last-Modified respawning. Pryv: outbound requests strip cookies and do not rely on browser cache; OpenAI site: browser cache can be used for respawn tracking.
Favicon cache "supercookie". Pryv: not applicable server-side; OpenAI site: browser cache can be used.
TLS session resumption as a tracker. Pryv: only between Pryv and OpenAI; OpenAI site: between user and OpenAI.
Bounce tracking and link decoration. Pryv: not applicable; OpenAI site: browser can be impacted by redirects.
CNAME cloaking. Pryv: not applicable; OpenAI site: potential in web context.
History sniffing and CSS side channels. Pryv: not applicable; OpenAI site: possible if site uses such techniques.
Cross-site cache probing. Pryv: not applicable; OpenAI site: browser cache partitioning applies.

D. Network-layer and protocol fingerprints

TLS ClientHello fingerprints (JA3/JA4). Pryv: OpenAI sees Pryv server fingerprint; can be mitigated by proxy hardening playbook; OpenAI site: sees user device fingerprint.
HTTP/2 client fingerprints. Pryv: OpenAI sees Pryv server; OpenAI site: sees user device.
HTTP/3/QUIC fingerprints. Pryv: outbound uses HTTP/1.1 by default; playbook suggests blocking QUIC; OpenAI site: browser may use HTTP/3.
WebRTC IP discovery and mDNS. Pryv: not applicable server-side; OpenAI site: possible in browser if enabled.
Passive OS/TCP/IP fingerprinting. Pryv: OpenAI sees Pryv server stack; OpenAI site: sees user OS/network stack.

E. Platform and ecosystem realities

Device fingerprinting vendors (fraud/bot defense). Pryv: not part of backend; OpenAI site: may run vendor scripts.
iOS ATT and anti-fingerprinting rules. Pryv: not applicable to server; OpenAI site: applies to iOS Safari/app context.
Android Privacy Sandbox. Pryv: not applicable to server; OpenAI site: applies to Android browser/app context.
Policy pressure and probabilistic tracking. Pryv: not applicable to server; OpenAI site: depends on site practices.
User-Agent reduction status and ecosystem adaptations. Pryv: outbound UA is fixed; OpenAI site: browser behavior applies.
Browser-level defenses (Firefox RFP, Tor letterboxing, Safari ITP, Chrome Privacy Sandbox). Pryv: not applicable to server; OpenAI site: applies to user browser.

Configuration

Configuration and request-level overrides

pii_entities: request field to override the entity allowlist (text, image, pdf, json mode).
skip_pii_redaction: boolean or object with text, image, pdf.
no_pii_redaction: boolean to bypass all redaction (internal only).
PII_ENTITIES: env allowlist used by JSON mode and media redaction.
PII_SCORE_THRESHOLD: env default for JSON mode and media redaction.
PII_JSON_MODE: env default for JSON mode.
PII_IMAGE_REDACTION: env toggle for image redaction.
pii_config.image.faces: face detection/blur controls.
pii_config.pdf.max_pages and pii_config.pdf.score_threshold: PDF controls.

Risks

Residual risks and non-redacted data

PII inside context, tools, URLs, or attachment_ids is forwarded as-is.
Text redaction uses a fixed allowlist and Presidio defaults; custom recognizers are not attached to the text pipeline.
Obfuscated PII (spelled-out numbers, Unicode lookalikes) can bypass detection.
Image OCR may miss low-quality or handwritten text.
If image or text redaction fails, the current behavior is pass-through (not fail-closed); PDF failures are blocked.
OpenAI and other providers still see Pryv server metadata (IP, TLS fingerprint) even after header scrubbing.

References

Code references

Text redaction: ai_functions.py (anonymize_prompt)
JSON mode: app.py (json_mode path) and utils.py (anonymize_json_strings)
Image redaction: app.py + utils.py (redact_base64_image)
PDF redaction: utils.py (redact_pdf_base64)
Custom recognizers: pii_recognizers.py
Outbound header scrubbing: secure_openai_client.py, secure_http.py
Metadata handling: metadata_pii.readme
Proxy hardening: docs/guides/network-hardening-playbook.md