Large Language Models: A Comprehensive Survey of its Applications, Challenges, Limitations, And Future Prospects (Updated 2025)
The Large Language Models Survey repository is a comprehensive compendium dedicated to the exploration and understanding of Large Language Models (LLMs). It houses an assortment of resources including research papers, blog posts, tutorials, code examples, and more to provide an in-depth look at the progression, methodologies, and applications of LLMs. This repo is an invaluable resource for AI researchers, data scientists, or enthusiasts interested in the advancements and inner workings of LLMs. We encourage contributions from the wider community to promote collaborative learning and continue pushing the boundaries of LLM research.
The year 2024 was transformative for the LLM landscape, with multiple breakthrough releases that established new benchmarks and capabilities:
OpenAI's Major Releases: GPT-4o launched in May 2024 brought true multimodal capabilities with 232ms response times, while o1 and o1-mini in September introduced reasoning models that spend more time "thinking" through problems, achieving 83% on mathematical olympiad problems compared to GPT-4o's 13%.
Anthropic's Claude 3 Family: The Claude 3 series (Haiku, Sonnet, Opus) launched in March 2024 were the first models to challenge GPT-4's dominance on leaderboards, followed by Claude 3.5 Sonnet in June and Claude 3.7 Sonnet in October, which became particularly popular for coding tasks.
Google's Gemini Evolution: Gemini 1.5 Pro debuted in February 2024 with up to 2M token context windows, followed by Gemini 1.5 Flash in May for faster performance, and Gemini 2.0 Flash in December 2024.
Meta's Llama Progression: Llama 3 (8B, 70B) launched in April 2024, followed by the groundbreaking Llama 3.1 series in July including the massive 405B parameter model - the largest open-source model at the time. Llama 3.2 brought multimodal capabilities in September, and Llama 3.3 concluded the year in December.
Microsoft's Phi Revolution: Microsoft's Phi-3 family proved that smaller models could punch above their weight, with Phi-3 Mini (3.8B parameters) matching much larger models on benchmarks. The series expanded with Phi-3 Small (7B), Phi-3 Medium (14B), and Phi-3.5 Mini throughout 2024.
Enterprise-Focused Models: IBM Granite 3.0 launched in October 2024 focused on enterprise use cases, while Cohere's Command R and Command R+ models excelled in retrieval-augmented generation tasks.
Google's Open Models: Gemma 2 (9B, 27B parameters) launched in June 2024 became highly popular in the open-source community, consistently ranking high in community evaluations.
The year 2025 has been marked by several breakthrough releases in the LLM landscape. Grok 3, launched by xAI in February 2025, introduced a 1 million token context window and achieved a record-breaking Elo score of 1402 in the Chatbot Arena, making it the first AI model to surpass this milestone. The model was trained on 12.8 trillion tokens and boasts 10x the computational power of its predecessor.
Meta's Llama 4 family represents a major leap forward with the introduction of Mixture-of-Experts (MoE) architecture. Llama 4 Scout features an unprecedented 10 million token context window, while Llama 4 Maverick achieves an ELO score of 1417 on LMSYS Chatbot Arena, outperforming GPT-4o and Gemini 2.0 Flash.
DeepSeek-R1 emerged as the first major open-source reasoning model, trained purely through reinforcement learning without supervised fine-tuning. The model demonstrates performance comparable to OpenAI's o1 across math, code, and reasoning tasks while being completely open-source under the MIT license.
Qwen 3, released by Alibaba in April 2025, features a family of "hybrid" reasoning models ranging from 0.6B to 235B parameters, supporting 119 languages and trained on over 36 trillion tokens. The models seamlessly integrate thinking and non-thinking modes, offering users flexibility to control the thinking budget.
OpenAI continued its reasoning model series with o3 and o4-mini in April 2025, while Anthropic launched Claude 4 (Opus 4 and Sonnet 4) in May 2025, setting new standards for coding and advanced reasoning with extended thinking capabilities and tool use.
Google's Gemini 2.5 Pro debuted as a thinking model with a 1 million token context window, leading on LMArena leaderboards and excelling in coding, math, and multimodal understanding tasks.
-
Reasoning Models: The emergence of models that can "think" through problems step-by-step, with extended reasoning capabilities becoming standard.
-
Massive Context Windows: Models now support context windows ranging from 1M to 10M tokens, enabling processing of entire codebases and documents.
-
Mixture-of-Experts (MoE) Architecture: More efficient model architectures that activate only a subset of parameters during inference.
-
Open-Source Reasoning: DeepSeek-R1's success has democratized access to reasoning capabilities previously available only in proprietary models.
-
Multimodal Integration: Native multimodality becoming standard, with models trained on text, images, audio, and video from the ground up.
-
Tool Use and Agentic Capabilities: Enhanced ability to use tools, execute code, and perform complex multi-step tasks autonomously.
- Grok 3: 93.3%
- DeepSeek-R1-0528: 87.5%
- Gemini 2.5 Pro: 86.7%
- o3-mini: 86.5%
- Claude Opus 4: 72.5%
- Claude Sonnet 4: 72.7%
- OpenAI Codex 1: 72.1%
- Llama 4 Maverick: ~70%
- Llama 4 Scout: 10M tokens
- Grok 3: 1M tokens
- Gemini 2.5 Pro: 1M tokens
- Llama 4 Maverick: 1M tokens
- ChatGPT revolutionized conversational AI
- InstructGPT introduced instruction following
- Large proprietary models dominated (GPT-3, PaLM, Chinchilla)
- LLaMA sparked the open-source revolution
- Claude introduced constitutional AI
- Specialized coding models emerged (Code Llama, StarCoder)
- Model sizes optimized for efficiency (Phi, Mistral)
- GPT-4o achieved true multimodality
- o1 introduced step-by-step reasoning
- Claude 3 challenged GPT-4 dominance
- Llama 3.1 405B became largest open model
- Gemini 1.5 pushed context limits to 2M tokens
- Grok 3 achieved highest Arena scores
- DeepSeek-R1 democratized reasoning capabilities
- Llama 4 introduced 10M token contexts
- Claude 4 set new coding standards
- Qwen 3 pioneered hybrid reasoning modes
If you find our survey useful for your research, please cite the following paper:
@article{hadi2024large,
title={Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects},
author={Hadi, Muhammad Usman and Al Tashi, Qasem and Shah, Abbas and Qureshi, Rizwan and Muneer, Amgad and Irfan, Muhammad and Zafar, Anas and Shaikh, Muhammad Bilal and Akhtar, Naveed and Wu, Jia and others},
journal={Authorea Preprints},
year={2024},
publisher={Authorea}
}
π΄ Proprietary Models:
- OpenAI: GPT-4, GPT-4.5, GPT-4o, o1, o3, o4-mini, ChatGPT, InstructGPT
- Anthropic: Claude 3 Family, Claude 3.5, Claude 3.7, Claude 4, Anthropic LM
- Google/DeepMind: Gemini 2.5, Gemini 2.0, Gemini 1.5, PaLM 2, Bard, T5, UL2, Chinchilla, Sparrow, Gopher, GLaM, Minerva
- xAI: Grok 3, Grok 3 Mini
- AI21 Labs: Jurassic-1, Jurassic-2
- Mistral AI: Mistral 7B, Mistral Large 2, Mistral Medium
π’ Open Source Models:
- Meta: Llama 4, Llama 3.x, Llama 2, OPT, Code Llama, Gallactica
- Alibaba: Qwen 3, Qwen 2.5, QwQ-32B
- DeepSeek: DeepSeek-R1, DeepSeek-V3
- Microsoft: Phi-3 Family, Phi-2
- IBM: Granite 3.0, Granite 3.1
- Google: Gemma 2
- Cohere: Command R, Command R+
- BigScience: BLOOM
- EleutherAI: GPT-J, GPT-NeoX, Pythia
- BigCode: StarCoder, StarChat, SantaCoder
- Salesforce: CodeGen2, CodeT5+, XGen
- TIIUAE: Falcon
- Upstage: SOLAR
π Academic/Research:
- LMSYS: Vicuna, FastChat-T5
- Stanford: Alpaca
- UC Berkeley: Koala
- LAION: Open Assistant
- OpenLM Research: OpenLLaMA
- MLFoundations: OpenLM
π’ Other Companies:
- Yandex: YaLM
- Replit: Replit Code
- H2O.ai: h2oGPT
- Databricks: Dolly
- Together: RedPajama-INCITE
- MosaicML: MPT Family
- Stability AI: StableLM
- Nous Research: OpenHermes
- Cerebras: Cerebras-GPT
- Deci AI: DeciCoder
- AI Squared: DLite
- BlinkDL: RWKV
π§ Reasoning Models (2024-2025):
- OpenAI: o1, o1-mini, o3, o3-mini, o4-mini
- DeepSeek: DeepSeek-R1 Family
- Alibaba: QwQ-32B, Qwen 3 (hybrid reasoning)
- Google: Gemini 2.5 (thinking models)
π¬ Conversational Models:
- OpenAI: ChatGPT, GPT-4o
- Anthropic: Claude 3/4 Family
- Google: Bard, Gemini
- xAI: Grok 3
π» Code-Specialized:
- Meta: Code Llama
- BigCode: StarCoder, SantaCoder
- Salesforce: CodeGen2, CodeT5+
- Replit: Replit Code
- Deci AI: DeciCoder
π Multimodal:
- OpenAI: GPT-4o
- Google: Gemini 2.0/2.5
- Meta: Llama 4, Llama 3.2
β‘ Efficient/Small:
- Microsoft: Phi-3 Family, Phi-2
- Google: Gemma 2
- AI Squared: DLite
- Upstage: SOLAR
Last updated: July 2025
Original repository: https://www.techrxiv.org/doi/full/10.36227/techrxiv.23589741.v3