Explore the latest papers
Latest Papers
Llama 2: Open Foundation and Fine-Tuned Chat Models
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.
Low energy resolvent expansions in dimension two
The behavior of the resolvent at low energies has implications for many kinds of asymptotics, including for the scattering matrix and phase, for the Dirichlet-to-Neumann map, and for wave evolution. In this paper we present a robust method, based in part on resolvent identity arguments following Vodev and boundary pairing arguments following Melrose, for deriving such expansions, and implement it in detail for compactly supported perturbations of the Laplacian on $\mathbb R^2$. We obtain precise results for general self-adjoint black box perturbations, in the sense of Sj\"ostrand--Zworski, and also for some non-self-adjoint ones. The most important terms are the most singular ones, and we compute them in detail, relating them to spaces of zero eigenvalues and resonances.
Anomalous diffusion via iterative quantitative homogenization: an overview of the main ideas
Anomalous diffusion is the fundamental ansatz of phenomenological theories of passive scalar turbulence, and has been confirmed numerically and experimentally to an extraordinary extent. The purpose of this survey is to discuss our recent result, in which we construct a class of incompressible vector fields that have many of the properties observed in a fully turbulent velocity field, and for which the associated scalar advection-diffusion equation generically displays anomalous diffusion. Our main contribution is to propose an analytical framework in which to study anomalous diffusion via a backward cascade of renormalized eddy viscosities.
Attention Is All You Need
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Dynamic Routing Between Capsules
A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.
The heat kernel on curvilinear polygonal domains in surfaces
We construct the heat kernel on curvilinear polygonal domains in arbitrary surfaces for Dirichlet, Neumann, and Robin boundary conditions as well as mixed problems, including those of Zaremba type. We compute the short time asymptotic expansion of the heat trace and apply this expansion to demonstrate a collection of results showing that corners are spectral invariants.
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
The game of chess is the most widely-studied domain in the history of artificial intelligence. The strongest programs are based on a combination of sophisticated search techniques, domain-specific adaptations, and handcrafted evaluation functions that have been refined by human experts over several decades. In contrast, the AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains. Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.
Language Models are Few-Shot Learners
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
A Review of DeepSeek Models' Key Innovative Techniques
DeepSeek-V3 and DeepSeek-R1 are leading open-source Large Language Models (LLMs) for general-purpose tasks and reasoning, achieving performance comparable to state-of-the-art closed-source models from companies like OpenAI and Anthropic -- while requiring only a fraction of their training costs. Understanding the key innovative techniques behind DeepSeek's success is crucial for advancing LLM research. In this paper, we review the core techniques driving the remarkable effectiveness and efficiency of these models, including refinements to the transformer architecture, innovations such as Multi-Head Latent Attention and Mixture of Experts, Multi-Token Prediction, the co-design of algorithms, frameworks, and hardware, the Group Relative Policy Optimization algorithm, post-training with pure reinforcement learning and iterative training alternating between supervised fine-tuning and reinforcement learning. Additionally, we identify several open questions and highlight potential research opportunities in this rapidly advancing field.
A Nash-Kuiper theorem for isometric immersions beyond Borisov's exponent
For any given short immersion from an $n$-dimensional bounded domain into $(n+1)$-dimensional Euclidean space and any H\"older exponent $\alpha<(1+n^2-n)^{-1}$, we construct a $C^{1, \alpha}$ isometric immersion within any $C^0$ neighbourhood of the given short immersion using convex integration. This refines the classical Nash--Kuiper theorem and extends the flexibility of $C^{1, \alpha}$ isometric immersion beyond Borisov's exponent. In particular, the regularity threshold aligns with the Onsager exponent $\frac13$ for the incompressible Euler equations in the case $n=2$. The convex integration scheme relies on a new corrugation ansatz, which allows the cancellation of leading order error terms by a novel "integration by parts" technique.
Mask R-CNN
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: https://github.com/facebookresearch/Detectron
Asymptotic expansions for semilinear waves on asymptotically flat spacetimes
We establish precise asymptotic expansions for solutions to semilinear wave equations with power-type nonlinearities on asymptotically flat spacetimes. For cubic nonlinearities $a(t,x)\phi^3$, we prove $\phi(t, x) = 2c t^{-2} + O(t^{-3+})$ in compact spatial regions, with $c$ computable. For $a(t,x)\phi^p$ with $p \geq 4$, we show $\phi(t, x) = d t^{-3} + O(t^{-4+})$, extending Price's law to the nonlinear setting. Our approach combines radiation field analysis with a generalized low-energy resolvent expansion, providing a bridge between spectral and physical space methods. These results sharpen previous decay estimates and yield complete asymptotics across the entire spacetime, including black hole backgrounds.
The Erdos discrepancy problem
We show that for any sequence $f: {\bf N} \to \{-1,+1\}$ taking values in $\{-1,+1\}$, the discrepancy $$ \sup_{n,d \in {\bf N}} \left|\sum_{j=1}^n f(jd)\right| $$ of $f$ is infinite. This answers a question of Erd\H{o}s. In fact the argument also applies to sequences $f$ taking values in the unit sphere of a real or complex Hilbert space. The argument uses three ingredients. The first is a Fourier-analytic reduction, obtained as part of the Polymath5 project on this problem, which reduces the problem to the case when $f$ is replaced by a (stochastic) completely multiplicative function ${\bf g}$. The second is a logarithmically averaged version of the Elliott conjecture, established recently by the author, which effectively reduces to the case when ${\bf g}$ usually pretends to be a modulated Dirichlet character. The final ingredient is (an extension of) a further argument obtained by the Polymath5 project which shows unbounded discrepancy in this case.