PII Protection Comparison

Pryv vs OpenAI site: what is redacted and what still reaches OpenAI

This page compares what PII exists around an AI request, how Pryv redacts it, and what reaches OpenAI when using Pryv versus using the OpenAI website directly.

Audience

Security reviewers, privacy engineering, product teams.

Purpose

Document what PII exists around a request, what Pryv redacts, and what reaches OpenAI when using Pryv vs sending a request directly to the OpenAI website.

Scope notes

Pryv behavior is derived from backend code (app.py, ai_functions.py, utils.py, pii_recognizers.py, secure_http.py, secure_openai_client.py, metadata_pii.readme). The OpenAI site column describes typical web telemetry; actual collection may vary and is outside this repo.

Jump to

Overview

Data path and PII surfaces

Pryv path

User app or browser -> Pryv TEE (/prompt) -> server-side PII redaction -> OpenAI API (or Anthropic/Gemini/xAI)

OpenAI site path

User browser -> OpenAI website -> OpenAI API (no Pryv redaction in the path)

PII around a request (high level)

  1. Content you submit: text prompt, attachments, images, PDFs, URLs, tool payloads.
  2. Request metadata: IP address, headers (User-Agent, Accept-Language, Referer), cookies, request IDs, timestamps.
  3. Device and browser fingerprinting: canvas/WebGL, fonts, screen size, audio, and more.
  4. Network-layer fingerprints: TLS ClientHello (JA3/JA4), HTTP/2/3 patterns, TCP/IP stack.
  5. Storage and state: cookies, localStorage, cache revalidation, bounce tracking, and more.

Comparison

Comparison at a glance

SurfacePryv - to OpenAIOpenAI site direct
Text promptPlaceholderized via PresidioRaw text
JSON payloads (if JSON mode)String values anonymizedRaw JSON
ImagesOCR redacted + optional face blurRaw images
PDFsText-layer redaction + metadata scrub; raster fallbackRaw PDFs
Attachment IDsForwarded as-isNot applicable (site handles uploads directly)
URLs provided for tools/contextsForwarded as-isRaw URLs
Headers to OpenAIUA replaced, Accept-Language blank, cookies stripped, DNT/GPC setBrowser headers and cookies
IP to OpenAIPryv TEE IP onlyUser IP

Redaction

What Pryv redacts inside a request

1) Text redaction (default path)

Code: ai_functions.py -> anonymize_prompt().

  • Uses Presidio AnalyzerEngine with default recognizers (no custom recognizers added).
  • Replaces detected PII with deterministic placeholders like <EMAIL_ADDRESS_1>.
  • Stores placeholder mapping per chat in pii_parameters for reinsertion.

Default entity allowlist

CREDIT_CARDCRYPTODATE_TIMEEMAIL_ADDRESSIBAN_CODEIP_ADDRESSNRPLOCATIONPERSONPHONE_NUMBERMEDICAL_LICENSEURLUS_BANK_NUMBERUS_DRIVER_LICENSEUS_ITINUS_PASSPORTUS_SSNUS_EINUK_NHSUK_NINOADDRESSCREDIT_CARD_LAST4RECEIPT_NUMBERDATE

Notes: PII_ENTITIES does not affect this text path; only the pii_entities request override does.

PII_SCORE_THRESHOLD is not applied here; Presidio default threshold is used.

Custom entity types (ADDRESS, CREDIT_CARD_LAST4, RECEIPT_NUMBER, DATE, US_EIN) are listed, but custom recognizers are not attached in this path, so detection for those types depends on Presidio built-ins (if any).

Input: "Bill to John Doe at john@example.com, card ending 1234"
Output: "Bill to <PERSON_1> at <EMAIL_ADDRESS_1>, card ending <CREDIT_CARD_LAST4_1>"

2) JSON mode (optional)

Code: app.py -> anonymize_json_strings() (in utils.py).

  • Enabled by PII_JSON_MODE=1 or pii_config.json_mode.enabled.
  • Parses the message as JSON and anonymizes string values only.
  • Uses Presidio AnonymizerEngine with label-style replacements ([NAME], [EMAIL], [PII]).
  • Does not produce pii_parameters mapping (no reinsertion).

Entities and thresholds

  • Uses pii_entities if provided on the request.
  • Else uses PII_ENTITIES env (comma list).
  • Else uses Presidio defaults (all built-in entities).
  • Score threshold uses pii_config.json_mode.score_threshold or PII_SCORE_THRESHOLD (default 0.35).

3) Image redaction

Code: app.py -> redact_image_with_style() -> redact_base64_image() (in utils.py).

  • Applies OCR (Tesseract) + Presidio + custom recognizers.
  • Draws black rectangles over detected PII (irreversible).
  • Optional face blur (OpenCV or face_recognition) with request-level tuning.
  • Returns PNG (lossless); xAI images may be JPEG-compressed for size.
  • Default entity allowlist when PII_ENTITIES is not set: the aggressive list from pii_recognizers.get_aggressive_entities() (see below).
  • Score threshold: uses pii_config.image.score_threshold if provided, else PII_SCORE_THRESHOLD env (default 0.35).
  • Custom recognizers are enabled for images.

4) PDF redaction

Code: app.py -> redact_pdf_base64() (in utils.py).

Two-tier approach

  • Tier 1: text-layer redaction using PyMuPDF line extraction + Presidio AnalyzerEngine with spaCy. Uses the same entity allowlist as images by default. Custom recognizers are not attached in this tier.
  • Tier 2: raster fallback renders pages to images, applies the image pipeline (OCR + custom recognizers), and rebuilds a PDF from redacted images.

Security scrubbing in both tiers

  • Metadata removed (XMP and document metadata).
  • Embedded files removed.
  • Non-redaction annotations removed.
  • JavaScript removed.
  • Form fields cleared.
  • Bookmarks/outline removed.

Thresholds and limits

  • Score threshold uses pii_config.pdf.score_threshold if provided, else PII_SCORE_THRESHOLD env (default 0.45 for PDFs).
  • Max pages: default 20; configurable via pii_config.pdf.max_pages.

Presidio

Custom recognizers (pattern-based)

Defined in pii_recognizers.py and added to the image pipeline:

  • Enhanced phone formats (7 patterns, international and common layouts).
  • Address patterns (street, PO Box, ZIP, city/state/ZIP).
  • Card last-4 formats ("Visa - 1234", "****1234").
  • Receipt/invoice numbers and alphanumeric codes.
  • Context-aware names (labels like "Bill to", "Name", "From").
  • Date formats (MM/DD/YYYY, Month DD, YYYY, ISO).
  • US EIN / Tax ID formats.

Active in image redaction and PDF raster fallback. They are not attached to the default text analyzer.

Coverage

Entity coverage by pipeline (default)

Standard = Presidio built-inCustom = pii_recognizersRequested = allowlist only
EntityTextJSONImage OCRPDF text layerPDF raster
EMAIL_ADDRESS
Standard
Standard
Standard
Standard
Standard
PHONE_NUMBER
Standard
Standard
StandardCustom
Standard
StandardCustom
CREDIT_CARD
Standard
Standard
Standard
Standard
Standard
US_SSN
Standard
Standard
Standard
Standard
Standard
IBAN_CODE
Standard
Standard
Standard
Standard
Standard
IP_ADDRESS
Standard
Standard
Standard
Standard
Standard
LOCATION
Standard
Standard
Standard
Standard
Standard
NRP
Standard
Standard
Standard
Standard
Standard
PERSON
Standard
Standard
StandardCustom
Standard
StandardCustom
DATE_TIME
Standard
Standard
Standard
Standard
Standard
URL
Standard
Standard
Standard
Standard
Standard
US_BANK_NUMBER
Standard
Standard
Standard
Standard
Standard
US_DRIVER_LICENSE
Standard
Standard
Standard
Standard
Standard
US_ITIN
Standard
Standard
Standard
Standard
Standard
US_PASSPORT
Standard
Standard
Standard
Standard
Standard
UK_NHS
Standard
Standard
Standard
Standard
Standard
UK_NINO
Standard
Standard
Standard
Standard
Standard
CRYPTO
Standard
Standard
Standard
Standard
Standard
MEDICAL_LICENSE
Standard
Standard
Standard
Standard
Standard
US_EIN
Requested
Requested
Custom
Requested
Custom
ADDRESS
Requested
Requested
Custom
Requested
Custom
CREDIT_CARD_LAST4
Requested
Requested
Custom
Requested
Custom
RECEIPT_NUMBER
Requested
Requested
Custom
Requested
Custom
DATE
Requested
Requested
Custom
Requested
Custom

The table shows default behavior with no pii_entities override. For JSON mode, entity coverage depends on pii_entities or PII_ENTITIES. If not set, Presidio defaults (all built-ins) are used, without custom recognizers.

Transmission

What reaches OpenAI

Pryv - redaction enabled

  • Text prompt: placeholderized or JSON-anonymized.
  • Images: redacted base64 data URLs (PNG; JPEG for xAI compression).
  • PDFs: redacted base64 (text-layer redacted, metadata scrubbed; raster fallback if needed).
  • Tools, context, URLs, attachment IDs: forwarded as provided.

Pryv - redaction disabled

  • skip_pii_redaction (boolean or object with text, image, pdf) can bypass redaction per format.
  • no_pii_redaction bypasses all redaction (intended for internal use).

Failure behavior

  • Text redaction errors fall back to the raw prompt.
  • Image redaction uses a best-effort wrapper; on failure it returns the original image data.
  • PDF redaction failures raise PDF_REDACTION_FAILED and block the request.

What the OpenAI site receives (direct use)

  • Raw prompt text and uploads (no Pryv placeholders).
  • Full browser and device metadata (IP, User-Agent, Accept-Language, cookies).
  • Web-exposed fingerprinting signals if the site chooses to access them.
  • Network-layer identifiers from your device (TLS/HTTP/2/3 fingerprints).
  • Pryv cannot affect this path.

Fingerprinting

Fingerprinting and metadata signals

A. HTTP headers and browser-exposed settings

  • IP address and coarse geolocation. Pryv: used transiently for rate limiting, not stored, not forwarded; OpenAI site: sees client IP directly.
  • User-Agent reduction and Client Hints (UA-CH). Pryv: outbound UA is "Mozilla/5.0 (TEEProxy)" and Accept-Language is empty; OpenAI site: receives browser UA/CH if requested.
  • Accept-Language and locale list. Pryv: stripped on outbound; OpenAI site: receives browser preferences.
  • Time zone and locale via JS Intl. Pryv: backend does not collect; OpenAI site: accessible to browser JS.
  • Device memory and Client Hints. Pryv: not collected/forwarded; OpenAI site: accessible via UA-CH in supporting browsers.
  • Network info (wifi, cellular) APIs. Pryv: not collected/forwarded; OpenAI site: accessible in supporting browsers.

B. Rendering and device-side signals (high entropy)

  • Canvas fingerprinting. Pryv: not applicable server-side; OpenAI site: accessible via JS.
  • WebGL/GPU fingerprinting. Pryv: not applicable server-side; OpenAI site: accessible via JS.
  • Audio fingerprinting via Web Audio. Pryv: not applicable server-side; OpenAI site: accessible via JS.
  • Screen/viewport size and devicePixelRatio. Pryv: not collected/forwarded; OpenAI site: accessible via JS/CSS.
  • Fonts and font metrics. Pryv: not collected/forwarded; OpenAI site: accessible via JS/CSS.
  • Gamepad or device enumeration. Pryv: not collected/forwarded; OpenAI site: accessible via JS where allowed.

C. Storage, cache, and state-based identifiers

  • ETag/Last-Modified respawning. Pryv: outbound requests strip cookies and do not rely on browser cache; OpenAI site: browser cache can be used for respawn tracking.
  • Favicon cache "supercookie". Pryv: not applicable server-side; OpenAI site: browser cache can be used.
  • TLS session resumption as a tracker. Pryv: only between Pryv and OpenAI; OpenAI site: between user and OpenAI.
  • Bounce tracking and link decoration. Pryv: not applicable; OpenAI site: browser can be impacted by redirects.
  • CNAME cloaking. Pryv: not applicable; OpenAI site: potential in web context.
  • History sniffing and CSS side channels. Pryv: not applicable; OpenAI site: possible if site uses such techniques.
  • Cross-site cache probing. Pryv: not applicable; OpenAI site: browser cache partitioning applies.

D. Network-layer and protocol fingerprints

  • TLS ClientHello fingerprints (JA3/JA4). Pryv: OpenAI sees Pryv server fingerprint; can be mitigated by proxy hardening playbook; OpenAI site: sees user device fingerprint.
  • HTTP/2 client fingerprints. Pryv: OpenAI sees Pryv server; OpenAI site: sees user device.
  • HTTP/3/QUIC fingerprints. Pryv: outbound uses HTTP/1.1 by default; playbook suggests blocking QUIC; OpenAI site: browser may use HTTP/3.
  • WebRTC IP discovery and mDNS. Pryv: not applicable server-side; OpenAI site: possible in browser if enabled.
  • Passive OS/TCP/IP fingerprinting. Pryv: OpenAI sees Pryv server stack; OpenAI site: sees user OS/network stack.

E. Platform and ecosystem realities

  • Device fingerprinting vendors (fraud/bot defense). Pryv: not part of backend; OpenAI site: may run vendor scripts.
  • iOS ATT and anti-fingerprinting rules. Pryv: not applicable to server; OpenAI site: applies to iOS Safari/app context.
  • Android Privacy Sandbox. Pryv: not applicable to server; OpenAI site: applies to Android browser/app context.
  • Policy pressure and probabilistic tracking. Pryv: not applicable to server; OpenAI site: depends on site practices.
  • User-Agent reduction status and ecosystem adaptations. Pryv: outbound UA is fixed; OpenAI site: browser behavior applies.
  • Browser-level defenses (Firefox RFP, Tor letterboxing, Safari ITP, Chrome Privacy Sandbox). Pryv: not applicable to server; OpenAI site: applies to user browser.

Configuration

Configuration and request-level overrides

  • pii_entities: request field to override the entity allowlist (text, image, pdf, json mode).
  • skip_pii_redaction: boolean or object with text, image, pdf.
  • no_pii_redaction: boolean to bypass all redaction (internal only).
  • PII_ENTITIES: env allowlist used by JSON mode and media redaction.
  • PII_SCORE_THRESHOLD: env default for JSON mode and media redaction.
  • PII_JSON_MODE: env default for JSON mode.
  • PII_IMAGE_REDACTION: env toggle for image redaction.
  • pii_config.image.faces: face detection/blur controls.
  • pii_config.pdf.max_pages and pii_config.pdf.score_threshold: PDF controls.

Risks

Residual risks and non-redacted data

  • PII inside context, tools, URLs, or attachment_ids is forwarded as-is.
  • Text redaction uses a fixed allowlist and Presidio defaults; custom recognizers are not attached to the text pipeline.
  • Obfuscated PII (spelled-out numbers, Unicode lookalikes) can bypass detection.
  • Image OCR may miss low-quality or handwritten text.
  • If image or text redaction fails, the current behavior is pass-through (not fail-closed); PDF failures are blocked.
  • OpenAI and other providers still see Pryv server metadata (IP, TLS fingerprint) even after header scrubbing.

References

Code references

  • Text redaction: ai_functions.py (anonymize_prompt)
  • JSON mode: app.py (json_mode path) and utils.py (anonymize_json_strings)
  • Image redaction: app.py + utils.py (redact_base64_image)
  • PDF redaction: utils.py (redact_pdf_base64)
  • Custom recognizers: pii_recognizers.py
  • Outbound header scrubbing: secure_openai_client.py, secure_http.py
  • Metadata handling: metadata_pii.readme
  • Proxy hardening: docs/guides/network-hardening-playbook.md