AI4Privacy ai4privacy Collection

PII Masking 2M European Release

The largest open-source collection of synthetic PII datasets for European languages. 2M+ examples across 32 locales and 98 entity types, purpose-built for training privacy-preserving NLP models.

Token Classification NER Text Generation Multilingual Synthetic GDPR EU AI Act
2M+
Examples
32
European Locales
98
Entity Types
10M+
Annotations

8 component datasets Collection Contents

OpenPII 1M

pii-masking-openpii-1m

Open

The core open-source component with 1.4M examples across 23 European languages and 19 PII entity types.

1.43M rows 23 languages 19 entities CC-BY-4.0

Health / PHI

pii-masking-health-phi-200k

Enterprise

Personal Health Information with 24 medical-specific labels including diagnoses, medications, test results, and allergies.

200K rows 24 entities Commercial License

Financial / PFI

pii-masking-financial-pfi-200k

Enterprise

Personal Financial Information covering finance and insurance-specific PII entities for banking and fintech applications.

200K rows Commercial License

Digital / PDI

pii-masking-digital-pdi-200k

Enterprise

Personal Digital Information for tech platforms, covering digital identifiers, usernames, IPs, and online activity data.

200K rows Commercial License

Work / PWI

pii-masking-work-pwi-200k

Enterprise

Personal Work Information for HR and employment, including employee IDs, salary data, performance reviews, and contracts.

200K rows Commercial License

Location / PLI

pii-masking-location-pli-200k

Enterprise

Personal Location Information with fine-grained geographic and address entities across all 32 European locales.

200K rows Commercial License

OpenPII Mini 10K

openpii-masking-mini-10k

Benchmark

Standardized micro-benchmark for evaluating PII detection across 23 languages and 29 regions. Enables rapid evaluation of NER models and commercial APIs.

10K rows 23 languages 19 entities CC-BY-4.0

OpenPII Nano 1K

openpii-masking-nano-1k

Benchmark

The fast PII detection benchmark. 1K samples for rapid iteration, CI/CD pipelines, and quick provider comparisons across 23 languages.

1K rows 23 languages 19 entities CC-BY-4.0

Schema Data Structure

example.json
{
  "source_text": "Dear John Smith, your appointment at St. Mary's Hospital...",
  "masked_text": "Dear [GIVENNAME_1] [SURNAME_1], your appointment at [HOSPITALNAME_1]...",
  "privacy_mask": [
    {
      "value": "[REDACTED]",
      "start": 5,
      "end": 9,
      "label": "GIVENNAME",
      "label_index": 1
    }
  ],
  "language": "en",
  "region": "GB",
  "mbert_token_classes": ["O", "B-GIVENNAME", "B-SURNAME", "O", ...]
}

19 core + 79 industry-specific Entity Types

Core PII Labels (Open)

DATE GIVENNAME SURNAME EMAIL CITY TITLE TELEPHONENUM AGE STREET BUILDINGNUM ZIPCODE IDCARDNUM CREDITCARDNUMBER DRIVERLICENSENUM GENDER TAXNUM SEX SOCIALNUM PASSPORTNUM

Industry-Specific Labels (Enterprise)

DIAGNOSES MEDICATION TESTRESULTS ALLERGIES HOSPITALNAME ACCOUNTNUM IBAN EMPLOYEEID IPADDRESS USERNAME + 69 more

32 locales Language Coverage

European language coverage map showing 32 locales
Language distribution across the dataset

Annotations Label Distribution

Distribution of PII entity labels across the dataset

Tasks supported Use Cases

Named Entity Recognition

Train NER models to detect and classify PII entities with pre-computed mBERT-compatible BIO labels.

Data Anonymization

Build production-grade anonymization pipelines compliant with GDPR and the EU AI Act.

LLM Fine-tuning

Fine-tune large language models for privacy-aware text generation and redaction tasks.

Need enterprise-grade data?

Get access to the full 2M dataset including all industry-specific components with commercial licensing for your organization.