Understanding Deepseek

GenAI

Deepseek

OpenAI

LLM

Shivam Kharje Senior Software Engineer @ Infocusp

19 min read . 18 February 2025

blog banner

Introduction

Contributors: Utkarsh Pandya, Sumanth Vullamparthi

"I have cities but no houses, mountains but no trees, and water but no fish. What am I?"

An AI model would likely get this right (it’s a map, by the way). But does solving riddles prove intelligence? If an AI beats humans on every standardized test, does that mean it truly understands language, reasoning, or the world?

The rapid advancement of generative AI has led to the emergence of increasingly sophisticated language models, with each new iteration outperforming its predecessors on widely accepted benchmarks. These standardized tests, designed to measure model proficiency in areas like reasoning, comprehension, and problem-solving, provide a structured way to compare different AI systems. At first glance, these benchmarks appear to be a reliable indicator of progress, helping researchers and users gauge which models perform best in various scenarios.

Yet, as AI models become more powerful, a fundamental question arises: do benchmark scores truly reflect a model’s real-world utility? Or have language models, much like students who train for standardized exams, become adept at optimizing for specific tests while lacking broader adaptability?

DeepSeek, a state-of-the-art large language model (LLM), has demonstrated impressive results on numerous benchmarks. But while high scores suggest competence, they might not necessarily translate to effectiveness in practical applications. The true measure of an AI model isn’t just its ability to perform well on pre-defined tasks; rather on how well it adapts to real-world challenges, understands user intent, and generates meaningful, contextually appropriate responses.

In this blog, we take a critical look at DeepSeek’s capabilities, going beyond traditional benchmarks to explore its practical strengths and limitations. We’ll examine how it performs in real-world scenarios, comparing it with other models to comment on factors like adaptability, reasoning, and usability.

The Limitations of Traditional Benchmarks

To understand why benchmark scores don’t always reflect real-world effectiveness, let’s break down a few of the key limitations of traditional evaluation methods.

1. Data Leakage and Inflated Scores: Publicly available benchmarks are often incorporated, knowingly or unknowingly, into the training datasets of modern LLMs. This results in a situation where models “memorize” answers rather than demonstrating genuine problem-solving abilities. A model scoring exceptionally well on a benchmark may simply be recalling information it has seen before, rather than showcasing real reasoning or comprehension skills.

2. Static and Repetitive Testing: Most benchmarks are static, meaning they don’t evolve alongside AI advancements. Since models are continuously trained on the same set of evaluation tasks, developers often optimize performance for those specific benchmarks rather than improving the model’s general reasoning abilities. Over time, this leads to diminishing returns, as newer models keep refining performance on outdated tests without necessarily becoming better at handling real-world queries.

3. Lack of Task-Specific Relevance: Benchmarks are designed to be broad and generalizable, but they often fail to capture the nuanced needs of individual users or specific industries. A model might perform well on general language tasks but struggle with domain-specific applications such as legal analysis, medical diagnostics, or financial forecasting. This disconnect between benchmark success and practical application raises concerns about how much weight we should place on benchmark scores when evaluating AI models.

4. Bias and Subjectivity in Evaluations: AI-generated evaluations introduce an inherent risk of bias. When LLMs themselves are used to assess other models, they can amplify existing biases rather than providing neutral, objective evaluations. Furthermore, human evaluators involved in benchmark design may unintentionally introduce subjective biases, shaping models in ways that reflect limited perspectives rather than a truly comprehensive understanding of language and reasoning.

Deepseek-R1 benchmark performance

DeepSeek-R1 demonstrates significant improvements over its predecessor, DeepSeek-V3, across a range of benchmarks, primarily due to large-scale reinforcement learning (RL). It excels in STEM-related questions on education-focused benchmarks like MMLU, MMLU-Pro, and GPQA Diamond. Its document analysis capabilities are evident in its strong performance on the long-context QA task, FRAMES.

DeepSeek-R1 also shows impressive results on IF-Eval (instruction following), AlpacaEval2.0 (writing), and ArenaHard (open-domain QA), highlighting the benefits of instruction-following data in SFT and RL training. Its concise summary lengths on these benchmarks suggest it avoids length bias. In math and coding, DeepSeek-R1 performs on par with or surpasses other models, particularly on reasoning-focused tasks like LiveCodeBench and Codeforces. While OpenAI-o1-1217 outperforms it on the engineering-oriented coding task Aider, their performance is comparable on SWE Verified. The authors anticipate further improvements in DeepSeek-R1's engineering coding abilities with more targeted RL training data. Overall, the large-scale RL has boosted DeepSeek-R1's reasoning and general performance across diverse domains.

What it means for Deepseek

DeepSeek’s strong benchmark results are impressive, but do they truly reflect its capabilities in real-world scenarios? To answer this, we will put DeepSeek to the test across various practical, mathematical, logical, philosophical applications, comparing its reasoning and adaptability against OpenAI’s o3-mini .

Ultimately, the goal isn’t just to determine whether DeepSeek performs well on standardized tests—it’s to understand whether it can truly think, adapt, and solve problems in meaningful ways. In the next sections, we’ll dive deeper into these evaluations, exploring what makes a GenAI model truly exceptional beyond just its benchmark scores.

Mathematical evaluation

Experiment 1

Objective

As humans, there are multiple ways in which our mind tries to get to a solution, some people try to see if there is a formula that can solve the equation and some first try to see if there is a logical approach or intuition to solve the problem before jumping to the formula.
The purpose of this experiment is to analyze the reasoning process each model follows to arrive at the correct answer. The mathematical question presented has two possible approaches:

Applying a trigonometric formula to derive the solution.
Recognizing a key logical insight to reach the answer quickly.

Setup

Calculate the value of the following mathematical expression. Note that all the numerical values in the equation are values of angles in degrees.

cos 0 * cos 2 * cos 4 * cos 6 * cos 8 * .... cos 100

Observations

R1 Model approach	O3 mini approach
As a first step R1 was able to understand the sequence that there are angles from 0 to 100, all multiples of 2	Even o3 mini was able to understand the sequence that there are angles from 0 to 100, all multiples of 2
In the 2nd step of reasoning, the R1 model was able to figure out the catch in the equation that cos 90 would be 0, hence the whole product would be 0.	In the 2nd step of reasoning, the o3-mini model tried to look up for a trigonometric formula that could solve this equation. Hence the first answer it reasoned for was a value that is very close to 0, since it tried to calculate it computationally rather than mathematically.
After solving it logically, the R1 model also reasoned about how solving this computationally would be a challenge since the value of cos 90 is not 0 computationally, but it is a value very close to 0.	After a few steps of reasoning it saw the catch in the question where the cos 90 term would make the whole product 0.

Experiment 2

Objective

Understand the reasoning behind the basic mathematical understanding of the models.

Setup

Which of these numbers is greater?  
   
10.1 or 10.11

Observations

R1 model	O3 mini model
The model is able to give the correct answer to the question by saying 10.11 is greater than 10.1	The model is able to give the correct answer to the question by saying 10.11 is greater than 10.1
The model gives multiple reasons with analogies for the user to understand as to why 10.11 is greater than 10.1. The model gives an example of currency showing how 10.11 is greater For example: `Also, maybe they're thinking about money. Like, $10.10 versus $10.11. The latter is 10 cents versus 11 cents, so 10.11 is more. Using a real-life example could help solidify the concept.`	O3 mini only gives a mathematical reasoning as to why 10.11 is greater. There was not a lot of explanation of the reasoning
The R1 model seems like it is not only trying to think of the problem in a mathematical way but is expanding its reasoning to other domains to make sure that answer is correct.	We observe that the O3-mini does not reason in multiple directions, but is to the point in its reasoning.

Experiment 3

Objective

Understand if the model is able to understand the context and explain the mathematical concepts.

Setup

Prompt 1:

I am a 4th grade student. Please explain to me how Pythagoras theorem works.

Prompt 2:

I am a Mathematics major. Please explain to me how Pythagoras theorem works.

Observations

R1 model	O3 mini model
The model is able to understand the context of who the model is explaining the concept to.	The model is able to understand the context of who the model is explaining the concept to.
Used the visualization technique to explain the concepts to a 4th grade student.	Used the visualization technique to explain the concepts to a 4th grade student.
Used only geometric intuition to explain the concept but it went into the depth of limitations of the theorem.	Used the geometric intuitions, proof by similarity and coordinate geometry to explain the theorem to a math major student.

Weather event impact analysis

Objective

In this experiment, we evaluate the performance in the task of analyzing the weather forecast and providing event-specific advice, focusing on their reasoning capabilities. The task involves determining how weather conditions impact user-scheduled events, making suggestions for precautions, and proposing alternative times in case of significant disruptions.

Setup

We provided both models with the following input:

Weather Forecast for 7 days: Includes details such as temperature, precipitation, wind speed, humidity, etc.
User Calendar Events for 7 days: A list of events, including the event name, time, and duration.

Both models are tasked with evaluating how the weather affects each event and generating a response that includes:

An analysis of potential impacts based on weather conditions.
Suggestions for precautionary measures (e.g., carrying an umbrella, adjusting plans).
If necessary, recommendations for rescheduling based on weather conditions.

You are an intelligent personal meteorologist and planner, assisting users by analyzing how upcoming weather forecasts will impact their scheduled events.
You will receive:
1. Weather forecast data for the next 7 days, including temperature, precipitation, wind speed, humidity, and other relevant details.
2. The user's calendar events for the next 7 days, including the event name, time, and duration.
For each event, you must:
1. Analyze the weather conditions during that time slot.
2. Determine if the weather will significantly affect the event (e.g., heavy rain disrupting an outdoor activity, extreme temperatures affecting safety, strong winds making cycling difficult).
3. Provide a brief event update that explains any potential impact.
Suggest precautionary measures (e.g., carrying an umbrella, dressing appropriately, rescheduling) if necessary.
4. If the impact is severe, suggest an alternative time slot in the user's calendar that has more favorable weather conditions.

Weather forecast:
{weather_forecast}

Calendar events:
{calendar_events}

Model Response

	DeepSeek R1	OpenAI o3-mini
Approach:	DeepSeek R1 took a structured approach to the problem. It methodically reviewed each event against the weather forecast and provided a thorough analysis based on the type of event and the severity of the weather. For each event, DeepSeek R1 evaluated whether the event would be impacted by external factors like rain, wind, or temperature, then provided mitigation strategies or alternative times if required.	OpenAI o3-Mini relied on a more streamlined approach for analyzing the event-weather impact. The model identified key weather conditions that would likely affect outdoor activities (like rain and extreme temperatures) and provided simpler, actionable advice. Unlike DeepSeek, OpenAI O3-Mini focused more on broad generalizations rather than diving deeply into each event’s unique details.
Example Analysis:	Monday: Consulting Call at 09:00 AM: Since this is a remote event, DeepSeek correctly identified that the sunny weather (30°C, 5 mph wind) would not impact the event. Tuesday: Morning Walk at 09:00 AM: The model suggested hydration and rescheduling to an earlier time due to the high temperature (32°C). It also considered the possibility of discomfort, providing practical advice to ensure the user remains safe.	Monday: Consulting Call at 09:00 AM: O3-Mini correctly identified that the sunny weather (30°C) would have no impact on the remote event. Tuesday: Morning Walk at 09:00 AM: The model recognized the high temperature (32°C) and suggested staying hydrated but did not emphasize the possibility of discomfort as strongly as DeepSeek.
Accuracy of Impact Analysis	R1 demonstrated a more nuanced understanding of weather impacts. It evaluated events based on precise weather conditions (e.g., light rain, thunderstorms, high wind speeds) and the nature of the event. This led to more personalized recommendations, especially for indoor vs. outdoor events.	o3-Mini provided accurate general advice but lacked depth in analyzing more subtle weather patterns and the specific context of each event. It was less likely to suggest alternative time slots unless there was a significant weather issue, such as thunderstorms.
Responsiveness and Practical Suggestions	R1 offered detailed suggestions and alternatives, such as rescheduling events to avoid bad weather or suggesting clothing adjustments. However, it struggled with forecasting gaps and missing weather information.	o3-Mini was more efficient and concise in its advice but lacked the proactive and event specific suggestions that DeepSeek provided. Its recommendations were more one-size-fits-all, which might not suit every event.

Premise order and logical reasoning

Experiment 1

Objective

In this experiment, we try to test the model’s capabilities to answer a question when fed with unordered premises. This experiment is conducted in two parts:

The 1st test is to see if the model can follow the unordered mathematical facts to answer the question. This question is similar to a Grade School Math Problem. Another is where we provide the rules it can use to prove a hypothesis provided some facts.
The 2nd test will be that it has to use unordered premises/rules to arrive at the logical conclusion of whether the hypothesis is correct or not. Here, sometimes the model goes out of its way to prove the hypothesis so the test will check if it hallucinates any rules/facts or not.

Setup

We provided the following information to the models:

Logical Ordered/Unordered Premises: Premises contain the relative amount of objects compared to other objects.

Prompts:

Ordered Premise:

Larry loves taking care of animals. He has 3 cats. He has 3 times as many dogs as cats. He has 2 fewer rabbits than dogs. **He has a fish tank with three times the number of fish as rabbits. He also has a collection of gerbils that's 1/3 the number of fish he has.** How many pets does Larry have?

Unordered Premise:

Larry loves taking care of animals. He has 3 cats. He has 3 times as many dogs as cats. He has 2 fewer rabbits than dogs. **He also has a collection of gerbils that's 1/3 the number of fish he has. He has a fish tank with three times the number of fish as rabbits.** How many pets does Larry have?

Model Response

	Deepseek-R1	OpenAI o3-mini
Ordered Premise	Does not list down the number of cats and starts breaking down the relations between amounts and then answers correctly.	Starts by listing the number of cats and then breaks down the relations between amounts and then answers correctly.
Unordered Premise	Does not list down the number of cats and starts breaking down the relations between amounts in the perfect order that information connects and then answers correctly.	Starts by listing the number of cats and then breaks down the relations between amounts in the order that information is connected and then answers correctly.

Note: One can try adding the longer sequence of premises to see to what extent is each model able to follow the unordered premises trail.

Experiment 2

We provided the following information to the models:

Rules, Facts and Hypothesis: Here, the Premises comprise of Facts as starting point to prove the hypothesis and a bunch of rules that it can use, not necessarily all rules and not necessarily in the provided order.

Setup

Rules:
If rebe and chade, then jite.
If chade, then pan.
If riff and chade and chise, then pan.
If pan and whiss and jite, then bobe.
If rebe, then hev.
If pan and whiss, then chade.
If chefe, then jite.
If ap and pum, then chefe.
If ap, then vope.
If chade and vope, then chise.
If shime and chefe and bobe, then ap.
If chefe and chade, then pan.
If hev, then vope.
If jite, then chade.
If hev and whiss, then un.
If chade, then shime.
If chise, then chefe.
Facts:
Alice is whiss.
Alice is chefe.
Query: Is Alice vope? In each step, indicate whether you are using a fact, or a rule. End with a line with {‘type": "answer", "content": "The answer is ", followed by True or False}

Observations

Deepseek-R1	OpenAI o3-mini
Deduces step-by-step whether Alice can be “vope” but when stating the rules that cannot be used yet, adds the reasoning line “But we do not know if Alice is xyz yet, so lets deduce that first”. It also references whether each step uses a rule or fact and does not hallucinate any rule or fact.	Step-by-step deduces whether the Alice can be “vope” and states in each step as instructed the explanation of whether that step uses a rule or a fact or derived earlier. Does not hallucinate any rule or fact.

Philosophical evaluation

Objective

The objective of this experiment was to evaluate how DeepSeek R1 and OpenAI O3-mini approach open-ended philosophical questions, since there is no right or definitive answer to philosophical its interesting to see how both the models reasoning processes and their approach to answering complex, open-ended questions
The core question posed to the models explored the idea of whether art serves as a way for individuals to fill a void by seeking attention, and if artists who avoid fame exist, what their objectives might be. Furthermore, the question also touched upon the nature of art itself—whether it is truly an act of creation or simply curation.

Setup

Do you think art in general is a way for people/Individuals to fill void by gaining attention, has there been an artist who never sought fame, if so what was his objective? Art itself claims to be a selfless act of creation, but is it really creation or mere curation?

Model Response

	Deepseek-R1	OpenAI o3-mini
Reasoning	DeepSeek R1 approaches the question by first breaking down the concept of art as a tool for seeking attention. It acknowledges that many individuals create art as a form of validation or recognition but offers a nuanced perspective by considering that not all artists are motivated by fame. The model provides examples of artists such as Emily Dickinson and Henry Darger, who did not seek fame in their lifetimes. For these artists, DeepSeek suggests their objective was personal expression, a deeply introspective and self-fulfilling act rather than a public pursuit of attention. Furthermore, DeepSeek R1 elaborates on the selfless nature of creation versus curation. It acknowledges that art could be both a creation of something entirely new and a curation of pre-existing influences. It emphasizes that motivation is key to determining whether art is selfless. “If an artist creates solely for personal satisfaction, the act could be deemed selfless. However, if fame or external rewards drive the artist, the act is seen as less selfless.”	OpenAI O3-mini takes a more general approach in addressing the question. It starts by acknowledging the common perspective that many artists seek attention and validation, aligning with the notion that art can serve as a means of filling a void. However, it also questions this assumption by considering the possibility of art being purely self-expressive. It suggests that some artists, like Vincent van Gogh, were primarily motivated by a desire for self-expression and pain of expression, not fame or recognition. O3-mini introduces the concept of art being a blend of creation and curation. The model acknowledges that while artists may draw from past influences, there is a component of personal expression that shapes the final product. This balance between creation and curation is explored in the context of well-known artists like Claude Monet, who focused on their craft rather than fame. O3-mini emphasizes that art serves as a way to explore emotions, experiences, and life itself, making it more than just a pursuit of external validation.

Conclusions

DeepSeek R1’s reasoning is more detailed, thorough, context-aware, and personalized while OpenAI O3-Mini’s approach is more efficient, faster but somewhat lacking in depth, and more generalized as seen in weather impact analysis and the mathematical evaluations.
DeepSeek R1 leaned more into specific historical examples and framed the question of selflessness in terms of an artist's motivations, while O3-mini approached the question more holistically, focusing on the emotional and self-expressive aspects of art.
While stating the reasoning steps in deductive reasoning, O3-mini takes a more pragmatic and grounded approach while Deepseek-R1 provides more human-aligned reasoning statements as seen in Premise Order Experiment.

It is observed from these models’ papers that they are already at par with humans on many widely used benchmarks that have existed for a while. However, there are still some newer benchmarks that claim to be “contamination-free” (e.g. LiveCodeBench) of any training data of these models. One needs to be on the lookout for such nuances in order to gauge a model’s capabilities until a standardized framework for such contamination-free evaluation and training comes along. As LLM’s are most likely to be used as an engine to power AI agents, we need to know how effective they perform in these domains. Hence, it is crucial to assess their performance not only on traditional benchmarks, but also on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory (e.g. Minerva), and performing basic operations when inputs are structured into distinct blocks, simulating real-world data. This would help the community understand how LLMs would perform when used as engines to power AI agents infrastructure. It's essential to establish a comprehensive and dynamic benchmark that tests each capability in isolation.