Home » Blog » LLMs: The Internet, But Make It a JPEG

LLMs: The Internet, But Make It a JPEG

Recently, I was reflecting on Large Language Models (LLMs) and how they contain all or much of the knowledge of the internet in a much smaller storeage space. For example, the ChatGPT 4 model is about 570 GB, a size almost anyone could store on their own computer’s storage. Whereas the internet is enormous–and growing by an astonishing 400 million TB each day.

It got me thinking–could LLMs be thought of as sophisticated, lossy compression algorithms like JPEGs? After all, they distill enormous amounts of data, like the entire internet, into a much small form while preserving what matters most about it.

The Fundamentals of Data Compression

To appreciate the analogy, it helps to review the basics of data compression, which I’ve outlined below:

Lossless Compression

Lossless algorithms (e.g., ZIP, PNG) reduce file size by eliminating redundancy without sacrificing any original data. It ensures that every bit of the source information can be recovered exactly during decompression.

Lossy Compression

Lossy methods (e.g., JPEG, MP3) achieve higher compression ratios by approximating and discarding data deemed less critical. While some details are lost, the result is a significantly smaller file that retains the essential content.

In both cases, the goal is to capture the “essence” of the original information, though through different trade-offs between accuracy and space savings.

When Compression Is “Just Right”

It’s a common concern that lossy compression might omit important details. However, in practice, efficient lossy techniques are designed to keep critical information intact.

For example, the most well-known lossy compression algorithm, JPEG, maintains the appearance of a photograph by leveraging shortcomings of human visual perception. It discards information that the human eye is less sensitive to, such as subtle color variations and fine details, while preserving more critical elements like brightness and contrast. By transforming the image into the frequency domain using a process called Discrete Cosine Transform (DCT), JPEG identifies and removes high-frequency components that contribute less to the overall visual experience. This selective reduction allows for significant file size savings without noticeably compromising image quality, demonstrating how lossy compression can effectively retain essential information.

LLMs do something similar with the information on the internet. Consider the example of those immutable physics equations E = MC², F = MA, and V = IR. Though the LLM is much, much smaller than the portion of the internet it was trained on, these key pieces of information are preserved and can be retrieved accurately because they represent the fundamental structure of physical laws.

These formulas are analogous to the “core content” in language data that LLMs compress. Even though the process might discard myriad less essential details, the model retains these vital building blocks—proving that the compression is both efficient and sufficiently precise for practical applications.

How LLMs Mirror Lossy Compression

While an LLM won’t be able to tell you everyone who has ever written about the physics equations above, or every nuance of what has been said about them, the formulas themselves are important enough that they are stored in the model and “understood” by it. This same things applies to billions of other facts as well, all stored in a file the size about four 8k feature films. Incredible when you think about it that way!

Large Language Models work by learning the underlying patterns in language. In doing so, they essentially perform a form of lossy compression with the following characteristics:

Selective Retention:
LLMs prioritize retaining patterns that define the structure and meaning of language, much like a JPEG image compression discards minor color details while preserving image clarity.
Generalization Over Memorization:
Instead of storing every word or sentence, LLMs internalize the “rules” of language. The fact that they can reproduce core factual elements, such as our physics equations, shows that the process focuses on what’s important.
Balancing Loss and Utility:
Just as well-tuned lossy compression ensures that an image still looks good even after data reduction, LLMs strike a balance between compressing data and preserving the factual and semantic integrity of the content.

Evaluating the Effectiveness of Compression

In traditional compression algorithms, engineers evaluate performance using key metrics such as compression ratio, fidelity and reconstruction quality, and speed and efficiency.

Compression Ratio:
Determines the reduction in size from raw data to its compressed form. LLMs knock this one out of the park, compressing much of the internet into a very manageable file size.
Fidelity and Reconstruction Quality:
In traditional lossy systems, metrics such as Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) gauge quality. For LLMs, we consider how accurately the model reproduces core facts, like the scientific formulas I referenced, as evidence that critical information is not lost.
Speed and Efficiency:
Both LLMs and conventional compression must meet performance standards in real-time applications. Currently the training of an LLM is VERY, VERY time-consuming and expensive in both energy and processing power. In this one area, LLMs aren’t as efficient as other compression algorithms, but since LLMs are written infrequently and read frequently, this is not a huge problem at the moment.

The fact that a model can output accurate equations like E = MC², F = MA, and V = IR confirms that despite operating with a “lossy” approach, the compression is intelligently crafted to preserve the essence of factual knowledge.

Conclusion

As I’ve shown, LLMs demonstrate a compelling parallel to lossy compression algorithms. While they streamline vast amounts of data, they ingeniously preserve essential details. This observation may not change much except to give you a different perspective on how these models work and what they can accomplish.

Jonathan "J." Tower

Jonathan, or J. as he's known to friends, is a founding partner of Trailhead Technology Partners, a custom software consulting company with employees across the U.S., Europe, and South America. He is Vice President of the .NET Foundation, a 12-time recipient of the Microsoft MVP award for his work with .NET, and a frequent speaker at software conferences around the world. He doesn’t mind the travel, though, as it allows him to share what he's been learning and also gives him the chance to visit beautiful places like national parks—one of his passions. So far, he's visited 58 of the 63 U.S. national parks. J. is also passionate about building the software community. Over the years, he has served on several non-profit boards, including more than a decade as president of the board for Beer City Code, Western Michigan's largest professional software conference. Outside of work, J. enjoys hiking, reading, photography, and watching all the Best Picture nominees before the Oscars ceremony each year.

Free Consultation

Sign up for a FREE consultation with one of Trailhead's experts.

"*" indicates required fields

Our Gear Is Packed and We're Excited to Explore With You

Ready to come with us?

Together, we can map your company’s software journey and start down the right trails. If you’re set to take the first step, simply fill out our contact form. We’ll be in touch quickly – and you’ll have a partner who is ready to help your company take the next step on its software journey.

We can’t wait to hear from you!

Main Contact

Together, we can map your company’s tech journey and start down the trails. If you’re set to take the first step, simply fill out the form below. We’ll be in touch – and you’ll have a partner who cares about you and your company.

We can’t wait to hear from you!

About Us

Our Team

Core Values

Alliances

Awards

Join Us

What We Do

Services

Technologies

Industries

Insights

Blog

Podcast

Events

Courses

LLMs: The Internet, But Make It a JPEG

The Fundamentals of Data Compression

Lossless Compression

Lossy Compression

When Compression Is “Just Right”

How LLMs Mirror Lossy Compression

Evaluating the Effectiveness of Compression

Conclusion

Jonathan "J." Tower

Free Consultation

Sign up for a FREE consultation with one of Trailhead's experts.

Related Blog Posts

Backlog Zero: How AI Can Clear You Development Backlog

From Idea to Production in 4 Hours with AI: A Case Study

With AI Writing Code, What Are Developers For?

Our Gear Is Packed and We're Excited to Explore With You

Main Contact

Montage Portal

Project Background

Logistics

Custom App and Software Development

Cloud and Mobile Applications

User Experience and Interface (UX/UI) Design

DevOps

Technologies Used

Expertise

Our Gear Is Packed and We're Excited to Explore with You

Thank you for reaching out.