AI4Privacy ai4privacy Collection New release

PII Masking 3M Asia-Pacific Release

The world's largest open multilingual PII masking corpus. 3M+ synthetic examples across 30 languages spanning Europe, the Americas, and Asia-Pacific, purpose-built for training privacy-preserving NLP models in a truly global setting.

Token Classification NER Text Generation Multilingual Asia-Pacific Synthetic GDPR PDPA APPI PIPA EU AI Act
3M+
Examples
30
Languages
3
World Regions
5
Industries
Asia-Pacific regional partner

In partnership with VNCyberS

The Asia-Pacific expansion is delivered together with VNCyberS, pioneering data protection and cybersecurity in Vietnam, bringing regional language expertise and on-the-ground privacy compliance to the release.

VNCyberS

6 component datasets Collection Contents

30 languages across 3 regions Global Coverage

Asia-Pacific

7 new languages

Japanese Korean Chinese Vietnamese Indonesian Malay Tagalog

Europe

23 locales

English German French Spanish Italian Dutch Polish Swedish

Americas

North & South

English (US) Spanish (LatAm) Portuguese (BR) French (CA)

Schema · real samples in every language Data Structure & Examples

{
  "source_text": "本日の集合場所は 射水市 円池 の 昼場 6-28-20、郵便番号は 520-2111 です。",
  "masked_text": "本日の集合場所は [CITY_1] の [STREET_1] [BUILDINGNUM_1]、郵便番号は [ZIPCODE_1] です。",
  "privacy_mask": [ { "value": "射水市 円池", "label": "CITY" }, { "value": "昼場", "label": "STREET" }, { "value": "6-28-20", "label": "BUILDINGNUM" }, { "value": "520-2111", "label": "ZIPCODE" } ],
  "language": "ja", "region": "JP", "script": "Jpan"
}
language: ja · region: JP · script: Jpan

20 core + 61 industry-specific Entity Types

Core PII Labels (Open)

DATE GIVENNAME SURNAME EMAIL CITY TITLE TELEPHONENUM AGE STREET BUILDINGNUM ZIPCODE IDCARDNUM CREDITCARDNUMBER DRIVERLICENSENUM GENDER TAXNUM SEX SOCIALNUM PASSPORTNUM URL

Industry-Specific Labels (Enterprise)

DIAGNOSES MEDICATION TESTRESULTS HOSPITALNAME IBAN ACCOUNTNUM BIC SALARY JOBTITLE ORGANISATION APIKEY PASSWORD MACADDRESS GEOCOORD VEHICLEVIN + 46 more

Tasks supported Use Cases

Named Entity Recognition

Train NER models to detect and classify PII entities with pre-computed mBERT-compatible BIO labels.

Data Anonymization

Build production-grade anonymization pipelines compliant with GDPR, the EU AI Act, PDPA, APPI, and PIPA.

LLM Fine-tuning

Fine-tune large language models for privacy-aware text generation and redaction across languages.

Need enterprise-grade data?

Get access to the full 3M dataset including all industry-specific components and Asia-Pacific coverage, with commercial licensing for your organization.