Authors: Rutvik Acharya, Nitin Agarwal
Personally Identifiable Information (PII) removal is a critical task in data privacy and security, requiring the identification and redaction of sensitive entities such as names, addresses, and social security numbers from unstructured text. Traditional Named Entity Recognition (NER) models used for PII removal are limited to predefined entity types, necessitating retraining for each new PII category. This paper presents zero-shot NER architectures that enable the efficient removal of any type of PII without extensive retraining.We leverage two advanced architectures for zero-shot NER in the context of PII removal: bi-encoder and poly-encoder models. The bi-encoder architecture separates the encoding of input text and PII entity types into distinct transformer models, allowing for efficient and scalable processing. PII entity type encodings can be pre-computed and reused across different input texts, reducing computational overhead. The poly-encoder architecture enhances the bi-encoder approach by incorporating a post-fusion step to model interactions between input text and PII entity representations explicitly, addressing the lack of inter-entity understanding in standalone bi-encoder models.To evaluate the effectiveness of these architectures for PII removal, we conduct experiments using a diverse, high-quality dataset containing various types of PII. We compare the performance of our proposed models with existing zero-shot NER approaches, such as GLiNER, in terms of precision, recall, and F1 score. The results demonstrate that our bi-encoder model outperforms GLiNER in identifying and removing PII entities, setting a new benchmark for zero-shot NER in the context of data privacy and security.These architectures offer several advantages for PII removal, including the ability to recognize an unlimited number of PII entities simultaneously, faster inference with preprocessed PII entity embeddings, and better generalization to unseen PII categories. These advancements enable the development of efficient and scalable PII removal systems capable of handling diverse and evolving PII requirements, ensuring compliance with data privacy regulations and protecting sensitive information.In this paper, we present an adaptive approach to PII detection that dynamically selects between GLINER and Presidio models based on contextual analysis. Our methodology first analyzes input text for regional markers, script patterns, and format variations to determine the most suitable model for PII detection. GLINER is prioritized for Western contexts and standardized formats, while Presidio handles region-specific and non-standard patterns. This context-aware selection is complemented by a robust validation framework that includes both primary and secondary validation layers, confidence scoring, and enhanced processing for ambiguous cases. Experimental results demonstrate an 12%-14% improvement in overall accuracy compared to single-model approaches, with particularly strong performance in handling diverse regional formats and multi-script environments, while maintaining acceptable processing overhead.
Comments: 5 Pages.
Download: PDF
[v1] 2026-02-08 17:33:59
Unique-IP document downloads: 298 times
Vixra.org is a pre-print repository rather than a journal. Articles hosted may not yet have been verified by peer-review and should be treated as preliminary. In particular, anything that appears to include financial or legal advice or proposed medical treatments should be treated with due caution. Vixra.org will not be responsible for any consequences of actions that result from any form of use of any documents on this website.
Add your own feedback and questions here:
You are equally welcome to be positive or negative about any paper but please be polite. If you are being critical you must mention at least one specific error, otherwise your comment will be deleted as unhelpful.