Recently, I was reflecting on Large Language Models (LLMs) and how they contain all or much of the knowledge of the internet in a much smaller storeage space. For example, the ChatGPT 4 model is about 570 GB, a size almost anyone could store on their own computer’s storage. Whereas the internet is enormous–and growing by an astonishing 400 million TB each day.
It got me thinking–could LLMs be thought of as sophisticated, lossy compression algorithms like JPEGs? After all, they distill enormous amounts of data, like the entire internet, into a much small form while preserving what matters most about it.
The Fundamentals of Data Compression
To appreciate the analogy, it helps to review the basics of data compression, which I’ve outlined below:
Lossless Compression
Lossless algorithms (e.g., ZIP, PNG) reduce file size by eliminating redundancy without sacrificing any original data. It ensures that every bit of the source information can be recovered exactly during decompression.

Lossy Compression
Lossy methods (e.g., JPEG, MP3) achieve higher compression ratios by approximating and discarding data deemed less critical. While some details are lost, the result is a significantly smaller file that retains the essential content.

In both cases, the goal is to capture the “essence” of the original information, though through different trade-offs between accuracy and space savings.
When Compression Is “Just Right”
It’s a common concern that lossy compression might omit important details. However, in practice, efficient lossy techniques are designed to keep critical information intact.
For example, the most well-known lossy compression algorithm, JPEG, maintains the appearance of a photograph by leveraging shortcomings of human visual perception. It discards information that the human eye is less sensitive to, such as subtle color variations and fine details, while preserving more critical elements like brightness and contrast. By transforming the image into the frequency domain using a process called Discrete Cosine Transform (DCT), JPEG identifies and removes high-frequency components that contribute less to the overall visual experience. This selective reduction allows for significant file size savings without noticeably compromising image quality, demonstrating how lossy compression can effectively retain essential information.
LLMs do something similar with the information on the internet. Consider the example of those immutable physics equations E = MC², F = MA, and V = IR. Though the LLM is much, much smaller than the portion of the internet it was trained on, these key pieces of information are preserved and can be retrieved accurately because they represent the fundamental structure of physical laws.
These formulas are analogous to the “core content” in language data that LLMs compress. Even though the process might discard myriad less essential details, the model retains these vital building blocks—proving that the compression is both efficient and sufficiently precise for practical applications.
How LLMs Mirror Lossy Compression
While an LLM won’t be able to tell you everyone who has ever written about the physics equations above, or every nuance of what has been said about them, the formulas themselves are important enough that they are stored in the model and “understood” by it. This same things applies to billions of other facts as well, all stored in a file the size about four 8k feature films. Incredible when you think about it that way!
Large Language Models work by learning the underlying patterns in language. In doing so, they essentially perform a form of lossy compression with the following characteristics:
- Selective Retention:
LLMs prioritize retaining patterns that define the structure and meaning of language, much like a JPEG image compression discards minor color details while preserving image clarity. - Generalization Over Memorization:
Instead of storing every word or sentence, LLMs internalize the “rules” of language. The fact that they can reproduce core factual elements, such as our physics equations, shows that the process focuses on what’s important. - Balancing Loss and Utility:
Just as well-tuned lossy compression ensures that an image still looks good even after data reduction, LLMs strike a balance between compressing data and preserving the factual and semantic integrity of the content.
Evaluating the Effectiveness of Compression
In traditional compression algorithms, engineers evaluate performance using key metrics such as compression ratio, fidelity and reconstruction quality, and speed and efficiency.
- Compression Ratio:
Determines the reduction in size from raw data to its compressed form. LLMs knock this one out of the park, compressing much of the internet into a very manageable file size. - Fidelity and Reconstruction Quality:
In traditional lossy systems, metrics such as Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) gauge quality. For LLMs, we consider how accurately the model reproduces core facts, like the scientific formulas I referenced, as evidence that critical information is not lost. - Speed and Efficiency:
Both LLMs and conventional compression must meet performance standards in real-time applications. Currently the training of an LLM is VERY, VERY time-consuming and expensive in both energy and processing power. In this one area, LLMs aren’t as efficient as other compression algorithms, but since LLMs are written infrequently and read frequently, this is not a huge problem at the moment.
The fact that a model can output accurate equations like E = MC², F = MA, and V = IR confirms that despite operating with a “lossy” approach, the compression is intelligently crafted to preserve the essence of factual knowledge.
Conclusion
As I’ve shown, LLMs demonstrate a compelling parallel to lossy compression algorithms. While they streamline vast amounts of data, they ingeniously preserve essential details. This observation may not change much except to give you a different perspective on how these models work and what they can accomplish.


