DeepSeek OCR: Beyond Just Text Recognition

Recently, DeepSeek released a paper and came up with a new approach called Context Optical Compression that stores text as vision tokens to help AI process long context in a more efficient manner. 
I have broken down their whole research in an easy and beginner-friendly manner so you can understand what exactly DeepSeek discovered and why the whole AI world is talking about it.

Before going into depth, let's talk about a few things

OCR (Optical Character Recognition) - OCR means reading a text from an image. 
Token - A token is a small piece of a word. Example - "Kartik" may be split into "Kar" + "tik" or "Kartik".
LLM (Large Language Model) - LLM models are trained on massive datasets and provide answers depending on the information they have learnt.
Context Window - Is the memory used by the AI to recall the previous conversation.
Vision Token - A small piece of an image that AI can read and understand.

DeepSeek-OCR: Contexts Optical Compression

Goal - The Idea is to encode the equivalent of a thousand words in a single image and have the model read them back. This approach has the potential to transform the way of thinking regarding AI memory and long context processing. This can be useful to process very long contexts, possibly reaching 10 million tokens or beyond, which is its aim.

Why does AI need this?

Currently, the Large Language Models (LLMs) like DeepSeek, ChatGPT, and Gemini talk to us using tokens and they do struggle with processing long textual content due to limited memory. Because of this, they forget the past conversation.
Right Now, 
  • 1 word = 1 token
  • More words - more tokens - more memory needed
There is always a memory limit for a normal account and even for the premium account. The Longer you chat to an AI Model, the more tokens it uses. Once the limit is full, it starts forgetting the old parts of the conversation.  This is the Big Problem with the Model. 

What is DeepSeek's new Idea?

What if instead of storing the text as text, we could store text in an image and later these images are broken down into vision tokens and AI can read them back. This idea is called COC (Context Optical Compression).
Breakdown of the COC-
  • Context - memory or conversation history 
  • Optical - using images
  • Compression - making things compressed (smaller)

Why Images? - Because Images take less space and can store well amount of the data. Imagine you have taken a picture of the classroom board instead of writing everything in your notebook. The photo stores everything in less space.
Same with the AI:
That's the reason DeepSeek wants to use Images as AI Memory.

How Good Is DeepSeek’s Compression?

DeepSeek OCR can convert text into vision tokens and convert these vision tokens into text with high accuracy. Here is how well DeepSeek performs in the benchmark:

According to the stats. 100 tokens can store around 1000 words with almost perfect accuracy. This is 10 times smaller than the normal way.

Can This Change the Future of AI Memory?

Yes, today AI can handle maybe 128k to 1M tokens in long chats. With DeepSeek's compression idea, it could go to 10M to 20M. 
Benefits from this - AI can remember more, faster response, cheaper computation

How Do Images Become Tokens?

DeepSeek uses a model called ViT: Vision Transformer to read images. In Simple terms, ViT cuts an image into small patches. Each patch becomes a token that the AI can understand.
Example:
A patch of 16x16 pixels has 256 pixels.
Each pixel has 3 colors (red, green, blue).
So 256 x 3 = 768 numbers = embedding for that patch. This lets the AI understand the image in small parts.

DeepSeek’s Secret Ingredient: Deep Encoder

The big issue with this process is that Images can produce too many vision tokens, which again increases the memory. So DeepSeek added a smart tool called the Deep Encoder.
Deep Encoder helps with:
  • Reducing the number of vision tokens
  • keeping only important parts
  • handling high-quality images better

It works in 2 stages:

Stage 1: 
SAM (Segment Anything Model) - SAM always looks at which parts of the image matter the most. 
Example - If the image has a page with text along with the background, SAM focuses on the text part, not the blank spaces.

Stage 2: 
CLIP + ViT + Deep Encoder - Once SAM selects important areas:

CLIP ViT creates embeddings (understandable picture pieces)
Deep Encoder compresses these pieces into fewer tokens

Finally, it sends the compressed vision tokens to DeepSeek-3B MOE.

What Is DeepSeek-3B MOE (Mixture of Experts)?

It is a decoder model that chooses which expert module is best for the job. It has 3B total parameters, but only 570M are active at a time. This makes it fast and efficient. This decoder reads the vision tokens and converts them back to text.

Different Modes - DeepSeek OCR has different modes depending on how much detail is needed:


Why does this matter so much?

DeepSeek is not just improving the Optical Character Recognition but they are changing how AI stores, compresses and remembers information. This is going to be the biggest change in the way LLM works.
This research could lead to:

  • AI systems with huge memory to remember more context
  • Better knowledge storage 
  • Faster processing
  • Cheaper AI costs

This could become a new type of AI memory.

Where Is It Available?

Research paper is released - https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf
Code is on GitHub - https://github.com/deepseek-ai/DeepSeek-OCR

Happy Exploring! Happy Learning!   



0 Comments