Recently, DeepSeek released a paper and came up with a new approach
called Context Optical Compression that stores text as vision tokens to help AI
process long context in a more efficient manner.
I have broken down their whole
research in an easy and beginner-friendly manner so you can understand what
exactly DeepSeek discovered and why the whole AI world is talking about it.
Before going into depth, let's talk about a few things
LLM (Large Language Model) - LLM models are trained on massive datasets and provide answers depending on the information they have learnt.
Context Window - Is the memory used by the AI to recall the previous conversation.
Vision Token - A small piece of an image that AI can read and understand.
- 1 word = 1 token
- More words - more tokens - more memory needed
- Context - memory or conversation history
- Optical - using images
- Compression - making things compressed (smaller)
How Good Is DeepSeek’s Compression?
DeepSeek OCR can convert text into vision tokens and convert these vision tokens into text with high accuracy. Here is how well DeepSeek performs in the benchmark:
According to the stats. 100 tokens can store around 1000 words with
almost perfect accuracy. This is 10 times smaller than the normal way.
Can This Change the Future of AI Memory?
Benefits from this - AI can remember more, faster response, cheaper computation
How Do Images Become Tokens?
A patch of 16x16 pixels has 256 pixels.
Each pixel has 3 colors (red, green, blue).
So 256 x 3 = 768 numbers = embedding for that patch. This lets the AI understand the image in small parts.
DeepSeek’s Secret Ingredient: Deep Encoder
Deep Encoder helps with:
- Reducing the number of vision tokens
- keeping only important parts
- handling high-quality images better
It works in 2 stages:
SAM (Segment Anything Model) - SAM always looks at which parts of the image matter the most.
CLIP + ViT + Deep Encoder - Once SAM selects important areas:
CLIP ViT creates embeddings (understandable picture
pieces)
Deep Encoder compresses these pieces into fewer tokens
Finally, it sends the compressed vision tokens to DeepSeek-3B
MOE.
What Is DeepSeek-3B MOE (Mixture of Experts)?
It is a decoder model that chooses which expert module is best for the job. It has 3B total parameters, but only 570M are active at a time. This makes it fast and efficient. This decoder reads the vision tokens and converts them back to text.
Different Modes - DeepSeek OCR has different modes depending on how much detail is needed:
- AI systems with huge memory to remember more context
- Better knowledge storage
- Faster processing
- Cheaper AI costs
This could become a new type of AI memory.
Where Is It Available?
Research paper is released - https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf
Code is on GitHub - https://github.com/deepseek-ai/DeepSeek-OCR
Happy Exploring! Happy Learning!
.png)
0 Comments