Marketplace Model: Personal Identifiable Information PII Masking Standard
From Minibase
Personal Identifiable Information PII Masking Standard
Purpose: Identify and mask personally identifiable information (PII) in text with low latency across multiple languages.
Training: ~100k labeled examples focused on PII detection and redaction patterns.
Primary value: High-accuracy masking with tunable precision/recall and entity-level controls suitable for real-time pipelines.
⸻
Intended Use
• Use cases: real-time chat redaction, log scrubbing, customer-support transcripts, analytics pipelines, ETL/ELT preprocessing, data sharing/anonymization.
• Users: platform engineers, data engineers, privacy/compliance teams, researchers preparing shareable corpora.
• Input: UTF-8 text strings (plain text).
• Output: Redacted text plus optional entity spans and labels.
Out-of-Scope
• Re-identification or linkage of anonymized text.
• Legal guarantees of anonymization (use this as a technical control, not a regulatory determination).
• Imaging/OCR inputs (unless text has been reliably extracted).
⸻
Model Details
• Task: Named entity detection of PII + deterministic/templated redaction.
• Languages: Multilingual (training targeted multi-language coverage). Expect strongest performance on high-resource languages (e.g., English, Spanish, French, German, Portuguese, Italian) with graceful degradation on others.
Training: ~100k labeled examples focused on PII detection and redaction patterns.
Primary value: High-accuracy masking with tunable precision/recall and entity-level controls suitable for real-time pipelines.
⸻
Intended Use
• Use cases: real-time chat redaction, log scrubbing, customer-support transcripts, analytics pipelines, ETL/ELT preprocessing, data sharing/anonymization.
• Users: platform engineers, data engineers, privacy/compliance teams, researchers preparing shareable corpora.
• Input: UTF-8 text strings (plain text).
• Output: Redacted text plus optional entity spans and labels.
Out-of-Scope
• Re-identification or linkage of anonymized text.
• Legal guarantees of anonymization (use this as a technical control, not a regulatory determination).
• Imaging/OCR inputs (unless text has been reliably extracted).
⸻
Model Details
• Task: Named entity detection of PII + deterministic/templated redaction.
• Languages: Multilingual (training targeted multi-language coverage). Expect strongest performance on high-resource languages (e.g., English, Spanish, French, German, Portuguese, Italian) with graceful degradation on others.
Basic Information
Base Model:Standard Base
Created by:Michaelminibase
Times imported:625
Released:Sep 26, 2025
Model Size:368 MB
Model Type:Causal Language Model
Format:HIGH
Technical Details
Hidden Size:960
Hidden Layers:32
Attention Heads:15
Vocabulary Size:49,152
Max Context Length:8,192 tokens
Precision:BFloat16 (BF16)
Learning Rate:0.000050
Training Epochs:3
Effective Batch Size:16
Optimizer:AdamW
Training Datasets
| Name | Type | Examples | Size |
|---|---|---|---|
| Multilingual PII Masking (Part 1) | SFT | 10,000 | 4.9 MB |
| Multilingual PII Masking (Part 2) | SFT | 10,000 | 4.8 MB |
| Multilingual PII Masking (Part 3) | SFT | 10,000 | 4.8 MB |
| Multilingual PII Masking (Part 4) | SFT | 10,000 | 4.9 MB |
| Multilingual PII Masking (Part 5) | SFT | 10,000 | 4.8 MB |
| Multilingual PII Masking (Part 6) | SFT | 10,000 | 4.8 MB |
| Multilingual PII Masking (Part 7) | SFT | 10,000 | 4.8 MB |
| Multilingual PII Masking (Part 8) | SFT | 10,000 | 4.9 MB |
| Multilingual PII Masking (Part 9) | SFT | 10,000 | 4.8 MB |
| Multilingual PII Masking (Part 10) | SFT | 10,000 | 4.8 MB |