Contact us

Contact us

Company

A swiss-based company
Founded in Switzerland.
Artificial Intelligence Suisse SA, PO 280, Delemont, Switzerland.

Follow us

AI4Privacy

ai4privacy Collection New release

PII Masking 3M Asia-Pacific Release

The world's largest open multilingual PII masking corpus. 3M+ synthetic examples across 30 languages spanning Europe, the Americas, and Asia-Pacific, purpose-built for training privacy-preserving NLP models in a truly global setting.

View on Hugging Face License Enterprise Data

Token Classification NER Text Generation Multilingual Asia-Pacific Synthetic GDPR PDPA APPI PIPA EU AI Act

3M+

Examples

30

Languages

3

World Regions

5

Industries

Asia-Pacific regional partner

In partnership with VNCyberS

The Asia-Pacific expansion is delivered together with VNCyberS, pioneering data protection and cybersecurity in Vietnam, bringing regional language expertise and on-the-ground privacy compliance to the release.

VNCyberS

6 core datasets + 5 benchmark slices Collection Contents

OpenPII 1.5M

pii-masking-openpii-1.5m

The expanded open-source core, now covering Asia-Pacific languages alongside Europe and the Americas across 30 languages.

1.64M rows 30 languages 20 entities CC-BY-4.0

Work / PWI

pii-masking-work-pwi-400k

Work & HR Information (PWI): job titles, organisations, salaries, document numbers, and employment identifiers.

400K rows Commercial License

Financial / PFI

pii-masking-financial-pfi-400k

Financial Information (PFI): IBAN, account & card details, balances, crypto wallet addresses, and insurance policy numbers.

400K rows Commercial License

Location / PLI

pii-masking-location-pli-400k

Location & Travel Information (PLI): geo-coordinates, addresses, airport & station codes, and vehicle and travel identifiers.

400K rows Commercial License

Health / PHI

pii-masking-health-phi-400k

Health & Medical Information (PHI): diagnoses, medications, test results, allergies, hospital names, and medical record numbers.

400K rows Commercial License

Digital / PDI

pii-masking-digital-pdi-350k

Digital Information (PDI): usernames, passwords, API keys, MAC addresses, device IMEIs, OTPs, and user agents.

350K rows Commercial License

5 open benchmark slices · CC-BY-4.0 Benchmarks & samples

OpenPII Micro

openpii-masking-micro-100k

100K open-core samples for evaluating PII detection across all 30 languages.

100K rows 30 languages CC-BY-4.0

OpenPII Nano

openpii-masking-nano-1k

1K open-core samples for rapid iteration, CI/CD pipelines, and quick provider comparisons.

1K rows 30 languages CC-BY-4.0

PII Masking Micro

pii-masking-micro-100k

100K full-taxonomy samples — the largest benchmark slice across all 30 languages.

100K rows 30 languages CC-BY-4.0

PII Masking Mini

pii-masking-mini-10k

10K full-taxonomy samples for quick model and commercial-API evaluation.

10K rows 30 languages CC-BY-4.0

PII Masking Nano

pii-masking-nano-1k

1K full-taxonomy samples for the fastest smoke tests across all 30 languages.

1K rows 30 languages CC-BY-4.0

30 languages across 3 regions Global Coverage

Asia-Pacific

7 new languages

Japanese (日本語) Korean (한국어) Chinese (中文) Vietnamese (Tiếng Việt) Indonesian (Bahasa Indonesia) Malay (Bahasa Melayu) Tagalog (Filipino)

Europe

23 locales

English German (Deutsch) French (Français) Spanish (Español) Italian (Italiano) Dutch (Nederlands) Polish (Polski) Swedish (Svenska)

Americas

North & South

English (US) Spanish (LatAm) Portuguese (BR) French (CA)

Schema · real samples in every language Data Structure & Examples

Open in dataset viewer

{
  "source_text": "本日の集合場所は 射水市 円池 の 昼場 6-28-20、郵便番号は 520-2111 です。",
  "masked_text": "本日の集合場所は [CITY_1] の [STREET_1] [BUILDINGNUM_1]、郵便番号は [ZIPCODE_1] です。",
  "privacy_mask": [ { "value": "射水市 円池", "label": "CITY" }, { "value": "昼場", "label": "STREET" }, { "value": "6-28-20", "label": "BUILDINGNUM" }, { "value": "520-2111", "label": "ZIPCODE" } ],
  "language": "ja", "region": "JP", "script": "Jpan"
}

language: ja · region: JP · script: Jpan

20 core + 61 industry-specific Entity Types

Core PII Labels (Open)

DATE GIVENNAME SURNAME EMAIL CITY TITLE TELEPHONENUM AGE STREET BUILDINGNUM ZIPCODE IDCARDNUM CREDITCARDNUMBER DRIVERLICENSENUM GENDER TAXNUM SEX SOCIALNUM PASSPORTNUM URL

Industry-Specific Labels (Enterprise)

DIAGNOSES MEDICATION TESTRESULTS HOSPITALNAME IBAN ACCOUNTNUM BIC SALARY JOBTITLE ORGANISATION APIKEY PASSWORD MACADDRESS GEOCOORD VEHICLEVIN + 46 more

Tasks supported Use Cases

Named Entity Recognition

Train NER models to detect and classify PII entities with pre-computed mBERT-compatible BIO labels.

Data Anonymization

Build production-grade anonymization pipelines compliant with GDPR, the EU AI Act, PDPA, APPI, and PIPA.

LLM Fine-tuning

Fine-tune large language models for privacy-aware text generation and redaction across languages.

Need enterprise-grade data?

Get access to the full 3M dataset including all industry-specific components and Asia-Pacific coverage, with commercial licensing for your organization.

Schedule a call Browse on Hugging Face