Large Language Models (LLMs), which are often recognized as AI systems trained on massive quantities of data to anticipate the next portion of a word, are now being evaluated from a new angle.
According to a new research article published by Google’s AI subsidiary, DeepMind, LLMs may be viewed as powerful data compressors. Their studies show that, with minor tweaks, LLMs may compress information as well, if not better, than commonly used compression methods. This point of view provides new insights into the development and evaluation of LLMs.
LLMs as data compression devices
Open-source Learning Language Models (LLMs) were repurposed by Google DeepMind researchers to perform arithmetic coding, a lossless compression process. LLMs are trained using log-loss to increase the probability of natural text sequences while decreasing the probability of all others. This yields a probability distribution over the sequences followed with a compression equivalence of 1-1. Lossless compression, like gzip, refers to a class of methods that can precisely rebuild the original data using the compressed data while assuring no information loss. Although the compression component of learning and intelligence has long been recognized, most researchers are ignorant of this critical similarity.
LLMs versus standard compression techniques
Researchers tested the compression capabilities of Language Learning Machines (LLMs) on text, picture, and audio data using vanilla transformers and Chinchilla models. LLMs excelled in text compression, with the Chinchilla model’s 70 billion parameters compressing data to 8.3% of its original size, surpassing gzip and LZMA2. However, in terms of image and audio data compression paces, they surpassed domain-specific compression algorithms. The Chinchilla models accomplish their excellent compression efficiency by in-context learning a meta-trained model for a job. Unexpected modalities, such as text and audio, can be predicted using LLM compressors. However, due to size and performance discrepancies, LLMs are not effective instruments for data compression when compared to current methods.
Classic compressors, like gzip, are still superior to LLMs since they are small and slower to operate on consumer devices.
Taking a fresh look at LLMs
Language Learning Models (LLMs) can be seen via the lens of compression, which provides insight into how scale influences their performance. While bigger models perform better on bigger data sets, their performance suffers on smaller datasets. The researchers discovered that model sizes reach a critical point at which the adjusted compression rate increases due to the number of parameters being too great in comparison to the size of the dataset. This implies that a larger model is not always preferable for any activity. By examining the compression ratio, compression gives a rational technique for thinking about scalability.
Conclusion
The work indicates that test set contamination, an important challenge in machine learning (ML) training, is not a serious concern when models are evaluated using compression methods such as Minimum Description Length (MDL). This is due to the ill-defined test set contamination problem, which could result in misleading results. However, researchers believe that MDL should be used frequently for assessing models because it takes into account model complexity and penalizes pure memorizers. This has substantial consequences for future LLM assessment.