Artificial Intelligence

   

Enhancing Neural Language Models: A Comprehensive Approach with Tensorized Transformer and Over-Parameterization

Authors: Pratham Taneja, Keshav Chandra, Daamini Batra, Akshita Gupta, Rahul Kumar, Bhaumik Tyagi

Abstract—This research paper introduces novel strategies to enhance the performance and efficiency of neural language models, addressing challenges in resource-limited settings and scalability. This research presents multi-linear attention with Block-Term Tensor Decomposition (BTD), a self-attention model leveraging tensor decomposition and parameters sharing. This approach achieves significant parameter compression while demonstrating improved performance on language modeling tasks. Comparative evaluations against traditional Transformer models underscore the effectiveness of multi-linear attention. TensorCoder employs a dimension-wise attention mechanism to address the quadratic complexity of the scaled dot-product attention in Transformers, making it suitable for long sequence tasks. The proposed approach is validated on masked language modeling and neural machine translation tasks, showcasing a substantial reduction in computational complexity while maintaining or surpassing performance compared to the original Transformer. This research also optimizes pre-trained language models (PLMs) through fine-tuning. To overcome computational challenges associated with large PLMs, the paper introduces a matrix product operator for over-parameterization during fine-tuning. Efficient decomposition methods factorize parameter matrices into higher-dimensional tensors, enabling the selection of important parameter matrices through static and dynamic strategies. Extensive experiments demonstrate that this approach significantly enhances the fine-tuning performance of small PLMs, enabling them to outperform larger counterparts with three times the parameters. This research opens avenues for efficiently scaling language models without compromising inference latency, showcasing the potential of over-parameterization in enhancing the applicability of large PLMs in real-world systems.

Comments: 10 Pages.

Download: PDF

Submission history

[v1] 2024-02-12 22:57:57

Unique-IP document downloads: 345 times

Vixra.org is a pre-print repository rather than a journal. Articles hosted may not yet have been verified by peer-review and should be treated as preliminary. In particular, anything that appears to include financial or legal advice or proposed medical treatments should be treated with due caution. Vixra.org will not be responsible for any consequences of actions that result from any form of use of any documents on this website.

Add your own feedback and questions here:
You are equally welcome to be positive or negative about any paper but please be polite. If you are being critical you must mention at least one specific error, otherwise your comment will be deleted as unhelpful.

comments powered by Disqus