PII Protection Comparison
Pryv vs OpenAI site: what is redacted and what still reaches OpenAI
This page compares what PII exists around an AI request, how Pryv redacts it, and what reaches OpenAI when using Pryv versus using the OpenAI website directly.
Audience
Security reviewers, privacy engineering, product teams.
Purpose
Document what PII exists around a request, what Pryv redacts, and what reaches OpenAI when using Pryv vs sending a request directly to the OpenAI website.
Scope notes
Pryv behavior is derived from backend code (app.py, ai_functions.py, utils.py, pii_recognizers.py, secure_http.py, secure_openai_client.py, metadata_pii.readme). The OpenAI site column describes typical web telemetry; actual collection may vary and is outside this repo.
Jump to
Overview
Data path and PII surfaces
Pryv path
User app or browser -> Pryv TEE (/prompt) -> server-side PII redaction -> OpenAI API (or Anthropic/Gemini/xAI)
OpenAI site path
User browser -> OpenAI website -> OpenAI API (no Pryv redaction in the path)
PII around a request (high level)
- Content you submit: text prompt, attachments, images, PDFs, URLs, tool payloads.
- Request metadata: IP address, headers (User-Agent, Accept-Language, Referer), cookies, request IDs, timestamps.
- Device and browser fingerprinting: canvas/WebGL, fonts, screen size, audio, and more.
- Network-layer fingerprints: TLS ClientHello (JA3/JA4), HTTP/2/3 patterns, TCP/IP stack.
- Storage and state: cookies, localStorage, cache revalidation, bounce tracking, and more.
Comparison
Comparison at a glance
| Surface | Pryv - to OpenAI | OpenAI site direct |
|---|---|---|
| Text prompt | Placeholderized via Presidio | Raw text |
| JSON payloads (if JSON mode) | String values anonymized | Raw JSON |
| Images | OCR redacted + optional face blur | Raw images |
| PDFs | Text-layer redaction + metadata scrub; raster fallback | Raw PDFs |
| Attachment IDs | Forwarded as-is | Not applicable (site handles uploads directly) |
| URLs provided for tools/contexts | Forwarded as-is | Raw URLs |
| Headers to OpenAI | UA replaced, Accept-Language blank, cookies stripped, DNT/GPC set | Browser headers and cookies |
| IP to OpenAI | Pryv TEE IP only | User IP |
Redaction
What Pryv redacts inside a request
1) Text redaction (default path)
Code: ai_functions.py -> anonymize_prompt().
- Uses Presidio AnalyzerEngine with default recognizers (no custom recognizers added).
- Replaces detected PII with deterministic placeholders like
<EMAIL_ADDRESS_1>. - Stores placeholder mapping per chat in
pii_parametersfor reinsertion.
Default entity allowlist
Notes: PII_ENTITIES does not affect this text path; only the pii_entities request override does.
PII_SCORE_THRESHOLD is not applied here; Presidio default threshold is used.
Custom entity types (ADDRESS, CREDIT_CARD_LAST4, RECEIPT_NUMBER, DATE, US_EIN) are listed, but custom recognizers are not attached in this path, so detection for those types depends on Presidio built-ins (if any).
Input: "Bill to John Doe at john@example.com, card ending 1234" Output: "Bill to <PERSON_1> at <EMAIL_ADDRESS_1>, card ending <CREDIT_CARD_LAST4_1>"
2) JSON mode (optional)
Code: app.py -> anonymize_json_strings() (in utils.py).
- Enabled by
PII_JSON_MODE=1orpii_config.json_mode.enabled. - Parses the message as JSON and anonymizes string values only.
- Uses Presidio AnonymizerEngine with label-style replacements ([NAME], [EMAIL], [PII]).
- Does not produce
pii_parametersmapping (no reinsertion).
Entities and thresholds
- Uses
pii_entitiesif provided on the request. - Else uses
PII_ENTITIESenv (comma list). - Else uses Presidio defaults (all built-in entities).
- Score threshold uses
pii_config.json_mode.score_thresholdorPII_SCORE_THRESHOLD(default 0.35).
3) Image redaction
Code: app.py -> redact_image_with_style() -> redact_base64_image() (in utils.py).
- Applies OCR (Tesseract) + Presidio + custom recognizers.
- Draws black rectangles over detected PII (irreversible).
- Optional face blur (OpenCV or face_recognition) with request-level tuning.
- Returns PNG (lossless); xAI images may be JPEG-compressed for size.
- Default entity allowlist when
PII_ENTITIESis not set: the aggressive list frompii_recognizers.get_aggressive_entities()(see below). - Score threshold: uses
pii_config.image.score_thresholdif provided, elsePII_SCORE_THRESHOLDenv (default 0.35). - Custom recognizers are enabled for images.
4) PDF redaction
Code: app.py -> redact_pdf_base64() (in utils.py).
Two-tier approach
- Tier 1: text-layer redaction using PyMuPDF line extraction + Presidio AnalyzerEngine with spaCy. Uses the same entity allowlist as images by default. Custom recognizers are not attached in this tier.
- Tier 2: raster fallback renders pages to images, applies the image pipeline (OCR + custom recognizers), and rebuilds a PDF from redacted images.
Security scrubbing in both tiers
- Metadata removed (XMP and document metadata).
- Embedded files removed.
- Non-redaction annotations removed.
- JavaScript removed.
- Form fields cleared.
- Bookmarks/outline removed.
Thresholds and limits
- Score threshold uses
pii_config.pdf.score_thresholdif provided, elsePII_SCORE_THRESHOLDenv (default 0.45 for PDFs). - Max pages: default 20; configurable via
pii_config.pdf.max_pages.
Presidio
Custom recognizers (pattern-based)
Defined in pii_recognizers.py and added to the image pipeline:
- Enhanced phone formats (7 patterns, international and common layouts).
- Address patterns (street, PO Box, ZIP, city/state/ZIP).
- Card last-4 formats ("Visa - 1234", "****1234").
- Receipt/invoice numbers and alphanumeric codes.
- Context-aware names (labels like "Bill to", "Name", "From").
- Date formats (MM/DD/YYYY, Month DD, YYYY, ISO).
- US EIN / Tax ID formats.
Active in image redaction and PDF raster fallback. They are not attached to the default text analyzer.
Coverage
Entity coverage by pipeline (default)
| Entity | Text | JSON | Image OCR | PDF text layer | PDF raster |
|---|---|---|---|---|---|
| EMAIL_ADDRESS | Standard | Standard | Standard | Standard | Standard |
| PHONE_NUMBER | Standard | Standard | StandardCustom | Standard | StandardCustom |
| CREDIT_CARD | Standard | Standard | Standard | Standard | Standard |
| US_SSN | Standard | Standard | Standard | Standard | Standard |
| IBAN_CODE | Standard | Standard | Standard | Standard | Standard |
| IP_ADDRESS | Standard | Standard | Standard | Standard | Standard |
| LOCATION | Standard | Standard | Standard | Standard | Standard |
| NRP | Standard | Standard | Standard | Standard | Standard |
| PERSON | Standard | Standard | StandardCustom | Standard | StandardCustom |
| DATE_TIME | Standard | Standard | Standard | Standard | Standard |
| URL | Standard | Standard | Standard | Standard | Standard |
| US_BANK_NUMBER | Standard | Standard | Standard | Standard | Standard |
| US_DRIVER_LICENSE | Standard | Standard | Standard | Standard | Standard |
| US_ITIN | Standard | Standard | Standard | Standard | Standard |
| US_PASSPORT | Standard | Standard | Standard | Standard | Standard |
| UK_NHS | Standard | Standard | Standard | Standard | Standard |
| UK_NINO | Standard | Standard | Standard | Standard | Standard |
| CRYPTO | Standard | Standard | Standard | Standard | Standard |
| MEDICAL_LICENSE | Standard | Standard | Standard | Standard | Standard |
| US_EIN | Requested | Requested | Custom | Requested | Custom |
| ADDRESS | Requested | Requested | Custom | Requested | Custom |
| CREDIT_CARD_LAST4 | Requested | Requested | Custom | Requested | Custom |
| RECEIPT_NUMBER | Requested | Requested | Custom | Requested | Custom |
| DATE | Requested | Requested | Custom | Requested | Custom |
The table shows default behavior with no pii_entities override. For JSON mode, entity coverage depends on pii_entities or PII_ENTITIES. If not set, Presidio defaults (all built-ins) are used, without custom recognizers.
Transmission
What reaches OpenAI
Pryv - redaction enabled
- Text prompt: placeholderized or JSON-anonymized.
- Images: redacted base64 data URLs (PNG; JPEG for xAI compression).
- PDFs: redacted base64 (text-layer redacted, metadata scrubbed; raster fallback if needed).
- Tools, context, URLs, attachment IDs: forwarded as provided.
Pryv - redaction disabled
- skip_pii_redaction (boolean or object with text, image, pdf) can bypass redaction per format.
- no_pii_redaction bypasses all redaction (intended for internal use).
Failure behavior
- Text redaction errors fall back to the raw prompt.
- Image redaction uses a best-effort wrapper; on failure it returns the original image data.
- PDF redaction failures raise PDF_REDACTION_FAILED and block the request.
What the OpenAI site receives (direct use)
- Raw prompt text and uploads (no Pryv placeholders).
- Full browser and device metadata (IP, User-Agent, Accept-Language, cookies).
- Web-exposed fingerprinting signals if the site chooses to access them.
- Network-layer identifiers from your device (TLS/HTTP/2/3 fingerprints).
- Pryv cannot affect this path.
Fingerprinting
Fingerprinting and metadata signals
A. HTTP headers and browser-exposed settings
- IP address and coarse geolocation. Pryv: used transiently for rate limiting, not stored, not forwarded; OpenAI site: sees client IP directly.
- User-Agent reduction and Client Hints (UA-CH). Pryv: outbound UA is "Mozilla/5.0 (TEEProxy)" and Accept-Language is empty; OpenAI site: receives browser UA/CH if requested.
- Accept-Language and locale list. Pryv: stripped on outbound; OpenAI site: receives browser preferences.
- Time zone and locale via JS Intl. Pryv: backend does not collect; OpenAI site: accessible to browser JS.
- Device memory and Client Hints. Pryv: not collected/forwarded; OpenAI site: accessible via UA-CH in supporting browsers.
- Network info (wifi, cellular) APIs. Pryv: not collected/forwarded; OpenAI site: accessible in supporting browsers.
B. Rendering and device-side signals (high entropy)
- Canvas fingerprinting. Pryv: not applicable server-side; OpenAI site: accessible via JS.
- WebGL/GPU fingerprinting. Pryv: not applicable server-side; OpenAI site: accessible via JS.
- Audio fingerprinting via Web Audio. Pryv: not applicable server-side; OpenAI site: accessible via JS.
- Screen/viewport size and devicePixelRatio. Pryv: not collected/forwarded; OpenAI site: accessible via JS/CSS.
- Fonts and font metrics. Pryv: not collected/forwarded; OpenAI site: accessible via JS/CSS.
- Gamepad or device enumeration. Pryv: not collected/forwarded; OpenAI site: accessible via JS where allowed.
C. Storage, cache, and state-based identifiers
- ETag/Last-Modified respawning. Pryv: outbound requests strip cookies and do not rely on browser cache; OpenAI site: browser cache can be used for respawn tracking.
- Favicon cache "supercookie". Pryv: not applicable server-side; OpenAI site: browser cache can be used.
- TLS session resumption as a tracker. Pryv: only between Pryv and OpenAI; OpenAI site: between user and OpenAI.
- Bounce tracking and link decoration. Pryv: not applicable; OpenAI site: browser can be impacted by redirects.
- CNAME cloaking. Pryv: not applicable; OpenAI site: potential in web context.
- History sniffing and CSS side channels. Pryv: not applicable; OpenAI site: possible if site uses such techniques.
- Cross-site cache probing. Pryv: not applicable; OpenAI site: browser cache partitioning applies.
D. Network-layer and protocol fingerprints
- TLS ClientHello fingerprints (JA3/JA4). Pryv: OpenAI sees Pryv server fingerprint; can be mitigated by proxy hardening playbook; OpenAI site: sees user device fingerprint.
- HTTP/2 client fingerprints. Pryv: OpenAI sees Pryv server; OpenAI site: sees user device.
- HTTP/3/QUIC fingerprints. Pryv: outbound uses HTTP/1.1 by default; playbook suggests blocking QUIC; OpenAI site: browser may use HTTP/3.
- WebRTC IP discovery and mDNS. Pryv: not applicable server-side; OpenAI site: possible in browser if enabled.
- Passive OS/TCP/IP fingerprinting. Pryv: OpenAI sees Pryv server stack; OpenAI site: sees user OS/network stack.
E. Platform and ecosystem realities
- Device fingerprinting vendors (fraud/bot defense). Pryv: not part of backend; OpenAI site: may run vendor scripts.
- iOS ATT and anti-fingerprinting rules. Pryv: not applicable to server; OpenAI site: applies to iOS Safari/app context.
- Android Privacy Sandbox. Pryv: not applicable to server; OpenAI site: applies to Android browser/app context.
- Policy pressure and probabilistic tracking. Pryv: not applicable to server; OpenAI site: depends on site practices.
- User-Agent reduction status and ecosystem adaptations. Pryv: outbound UA is fixed; OpenAI site: browser behavior applies.
- Browser-level defenses (Firefox RFP, Tor letterboxing, Safari ITP, Chrome Privacy Sandbox). Pryv: not applicable to server; OpenAI site: applies to user browser.
Configuration
Configuration and request-level overrides
- pii_entities: request field to override the entity allowlist (text, image, pdf, json mode).
- skip_pii_redaction: boolean or object with text, image, pdf.
- no_pii_redaction: boolean to bypass all redaction (internal only).
- PII_ENTITIES: env allowlist used by JSON mode and media redaction.
- PII_SCORE_THRESHOLD: env default for JSON mode and media redaction.
- PII_JSON_MODE: env default for JSON mode.
- PII_IMAGE_REDACTION: env toggle for image redaction.
- pii_config.image.faces: face detection/blur controls.
- pii_config.pdf.max_pages and pii_config.pdf.score_threshold: PDF controls.
Risks
Residual risks and non-redacted data
- PII inside context, tools, URLs, or attachment_ids is forwarded as-is.
- Text redaction uses a fixed allowlist and Presidio defaults; custom recognizers are not attached to the text pipeline.
- Obfuscated PII (spelled-out numbers, Unicode lookalikes) can bypass detection.
- Image OCR may miss low-quality or handwritten text.
- If image or text redaction fails, the current behavior is pass-through (not fail-closed); PDF failures are blocked.
- OpenAI and other providers still see Pryv server metadata (IP, TLS fingerprint) even after header scrubbing.
References
Code references
- Text redaction: ai_functions.py (anonymize_prompt)
- JSON mode: app.py (json_mode path) and utils.py (anonymize_json_strings)
- Image redaction: app.py + utils.py (redact_base64_image)
- PDF redaction: utils.py (redact_pdf_base64)
- Custom recognizers: pii_recognizers.py
- Outbound header scrubbing: secure_openai_client.py, secure_http.py
- Metadata handling: metadata_pii.readme
- Proxy hardening: docs/guides/network-hardening-playbook.md