Ofir Press

How to Build Good Language Modeling Benchmarks

2024-08-07T00:00:00+00:00

Building benchmarks is important because they shine a spotlight on the weaknesses of existing language models and so can guide the community on how to improve them.

I’ve spent a lot of my career both on building benchmarks and on building systems that push forward the state-of-the-art on a given benchmark, and I believe that building good benchmarks is just as important as building new systems.

Designing a good benchmark is challenging and I’ve spent a lot of time recently thinking about what makes for a good benchmark. I’ve distilled it down to three main properties:

1. Natural:

Try to build a benchmark that has natural questions that some category of humans ask on a frequent basis. For example, the questions in our SWE-bench are made up of real bugs that users reported in popular GitHub repos. The task is to take the reported bug and the repo (as it was at the time the bug was reported) and try to fix the bug. That’s a very natural task that many people do on a daily basis (and even get paid for). Other natural tasks that we’ve recently turned into benchmarks include answering questions such as “What yoga studio near me has vinyasa classes before 8 AM on weekdays?” (see AssistantBench) and “Which paper first showed that transformer language models can’t extrapolate to long sequences?” (see CiteME).

Sometimes I see new benchmarks come out that don’t fulfill the naturalness criteria, and they often have a hard time getting excitement from the community. I find benchmarks that contain IQ test-like questions, where you have to identify patterns in diagrams to not be very exciting. Or any ‘common sense’-like benchmarks that have questions like ‘Bob threw an egg at Alice’s face. Is Alice happy, sad, or ambivalent?’ These types of benchmarks might have been interesting in the past, when our LMs were still struggling with basic tasks, but now that our language models are becoming more capable, we need to challenge them with tougher and more realistic tasks.

Another way to think about whether a benchmark fulfills the naturalness criteria is to evaluate whether it fulfills what I term the usefulness criteria: would a system that got better-than-baseline accuracy on this benchmark be useful to humans? Would it make anyone more productive? A system that can autonomously fix bugs would save lots of time for developers, even if it only managed to fix the easiest ten percent of bugs. A system that can quickly find me a yoga class that meets my needs would save me time.

I’ve also noticed that there are two simple indicators for a benchmark being unnatural, and so I try to avoid building benchmarks that have these properties:

A. The question set-up is unrealistic: For example, if a benchmark contains multiple choice questions, I believe it is unnatural. When I go to the doctor, I never say “Doctor doctor, my elbow hurts, and it is definitely happening because of one of these four options…”. Always think of your question set-up and if it seems unrealistic, try to modify it.

B. The questions are made up and not taken from actual questions asked by actual humans: If you work for Google and you’re tasked with building a challenging question-answering benchmark, a really nonoptimal thing to do would be to sit around by yourself in a room and just try to think of questions. You’d probably come up with weird questions that no real user would ever ask. A really smart thing to do would be to look at the Google Search logs and try to filter it to find questions that users entered and did not find a good answer to (for example, this might be indicated by the user going to the second result page or because the user spent more than five minutes on the initial results page).

SWE-bench contains real bug reports filed by real users on real GitHub repos. I think this makes the benchmark much more exciting to the community. Using questions that actual users asked implies that by building systems that get higher scores on the benchmark, we would be fulfilling a real-world need.

2. Automatically Evaluateable:

In a benchmark, given a model-generated answer to a question, we need to determine if the model was right or wrong. Sometimes this is easy, but depending on the question type, this could be hard or impossible. Validating the correctness of code could be a challenging task, since there are many different ways to program a given function, and that’s why benchmarks such as HumanEval and SWE-bench use unit tests to automatically validate code.

Summarization is a task that I think could be super useful for humans (“Write a 500 word summary of this patient’s medical file”) but we’ve seen very little development of new benchmarks in this space because evaluation is just so hard. There are many different ways to correctly summarize a given text, but evaluating these summaries is hard. Some have proposed using an LM to evaluate LM outputs but I don’t think that that’s the right way to go. We should either use an LM to solve a task, or use it to judge outputs, but if we use it as both the solver and the evaluator that leads to problems.

3. Challenging

If you launch an automatically evaluatable and natural benchmark, but the accuracy of the best LM at launch is 80%, people will see your benchmark as already being solved and won’t want to try and build models to improve performance on it. I think making your benchmarking challenging is critical. I think that at launch, a good benchmark should have the top LMs achieving between 1% to 35% accuracy on it.

Edit, January 2025: Due to the extremely fast development of LMs these days, I currently recommend that benchmark builders launch their benchmarks with the top accuracy being between 0.1% to 9%. Anything higher probably means that the benchmark is too easy.

Edit, May 2025: I had to make another edit. Due to the speed of development of AI I’m now asking my collaborators, not to think of benchmarks that would have AI systems achieving 0% at launch, but to think of benchmarks that would have systems achieving “-200%” at launch. Find questions that are so hard that even if the models improve 3x they’ll still get zero. Just building a benchmark where models get 0% today might not be enough anymore. You have to look at how the models have been improving over the past 3-6 months, try to predict where they’ll be in 6-12 months and build benchmarks that would not only make current models fail, but benchmarks that would make the models of next year fail as well. Anything easier than that might get saturated much more quickly than you expect.

If you find a benchmark idea and build it out and it’s natural and automatically evaluateable but you build a baseline and it gets 70% right, one thing you might want to consider doing is to use that baseline to filter-out the easier instances in the benchmark. For example, our Bamboogle benchmark had tough-to-answer 2-hop questions, and we built the dataset by filtering out all questions that Google Search answered correctly. For CiteME, we filtered out all questions that GPT-4o managed to answer correctly in a prompting-only setting (i.e. non-agentic). I think that building benchmarks by finding tasks that a strong existing approach can’t solve is a great way to go.

Beware- researchers are humans and humans have emotions. If at launch, the top model’s accuracy is less than 10%, that might seem very intimidating for most researchers, and they might not want to work on your benchmark at all. Try to plan for that. For example, when we launched SWE-bench the top model’s accuracy was 1.96%. Almost everyone I talked to at the time was intimidated by this and didn’t want to approach it. I wasn’t worried, because we immediately started working on SWE-agent after releasing SWE-bench. I remember telling the team that if we got anywhere near 10% accuracy, the community would see that SWE-bench isn’t as impossible as it seemed, and that would get the ball rolling. Eventually we launched SWE-agent at around 13% accuracy and soon afterwards a barrage of other models appeared, each getting better accuracy than the previous one.

Bonus Property:

Building a benchmark that would be hard to leak into the training data is something that I think about all the time. Could we build a benchmark such that even if the benchmark itself leaks into an LM’s training data, it won’t really help that LM in getting a good score on the benchmark? In SciCode, we had PhDs write very tough programming challenges related to their field of study. Each instance in the dataset is a description of a function and the unit tests to validate whether the model programmed it correctly or not. We intentionally do not release any of the answers to these programming challenges, to make sure these answers are never inserted into any LMs training data. This way, even if our benchmark fully leaks into an LM’s training set, it still won’t be able to produce the right answers to the questions. Achieving this property is extremely difficult, and so it’s not something I try to do with every single benchmark I build.

Other guidelines:

Have one number for your benchmark. One metric that people go for. “We get 87% on HumanEval” is the vibe you are going for. Don’t have three metrics, like accuracy, precision, and recall, have just one. Don’t divide accuracy by category, have just an overall accuracy. This is really important. You want to make use of your benchmark as easy as possible. You want people to get it right away. If you start having seventeen metrics and nineteen categories it’s going to be complicated for people to understand what you’re trying to do, and that will lower the chances of your benchmark catching on.

When you write the analysis section in your paper about your benchmark, it’s totally fine to present other metrics for each model, or to break down performance by category, but you should only do that there, and not have the categories or other metrics when you generally talk about the benchmark.
When you write a paper, always include very strong baselines, both based on strong proprietary models and on leading open source models. You should never try to make your benchmark look more impressive than it actually is by including only weak baselines, or baseline systems that use old or outdated models like GPT 3.5 or Llama 3 7B.
How many tasks should your benchmark have? I think a good minimum is to have at least 150 tasks in a benchmark. But defining a ‘task’ is hard, a benchmark that has 50 ‘tasks’ which each have 10 subtasks, could also work. Aim for 300-500 tasks so that your benchmark has more statistical resolution. In my experience, having more tasks than that doesn’t really help. I would also use the 500 task point as an upper limit, because having more tasks means it could take a very long time to run your benchmark, which could be a big downside for adoption.
Benchmarks typically get saturated within a year. And so when you build a benchmark, I don’t think it’s important to worry about questions like “Will this still be an interesting question in five years?”. Deep learning moves incredibly quickly, and there’s no way to predict where we’ll be in more than a year. For this reason, it’s also OK to have benchmarks that ask questions that will probably have totally different answers in a year or two, such as “Which yoga studios near Bushwick in NYC have a Vinyasa class before 7AM?”.

Benchmarks are great because they provide a lot of room for creativity, and they can be super impactful in guiding the community towards the future. I hope this post helps you in building the next big LM benchmark. And always remember- rules are meant to be broken. I do not think that a benchmark that does not follow all of these rules is bad. I just think that these guidelines are a good indicator for whether you’re on the right path or not.

Concluding Thoughts

Here are some questions to think about while you’re designing a new benchmark:

A benchmark is a collection of tasks, where each task is made up of 4-tuples.
     A. The request is what you want the model to actually do, i.e. in SWE-bench it would be “Fix this issue “ + issue_text.
     B. The environment is a total description of the environment that the agent will act in while solving your request. Is internet access allowed? What dependencies are installed and which ones are not? Are there any special tools you will be providing the agent with?
     C. The stopping criteria is how you decide when to end an agent’s run. For some tasks the agent will probably issue a ‘submit’ command and exit but you need to decide how to act when that never happens. Are you going to have a turn limit per task? A cost limit? A walltime limit? A combination of these? All answers are viable, you just need to decide.
     D. The scorer takes the environment as it was when the agent exited and scores it. Will you build a binary pass/fail benchmark, like we did in SWE-bench with the fail2pass and pass2pass tests? Or will you build a benchmark with a continuous score, like we did in AlgoTune, where we ask agents to speed up computer programs, and the score per task is the agent’s code total runtime divided by our baseline’s total runtime. Or will you use ELO like we did in CodeClash? There are many possiblities here.
What is the baseline scaffolding that you will use and how similar is it to the best scaffolding in common use right now? For example, if you’re asking coding questions, and your scaffolding doesn’t allow for code execution, that’s not a very good representation of reality. If you’re asking knowledge questions and don’t allow access to the internet, that’s not realistic. Try to make your scaffolding as close as good as you can. This frequently doesn’t take much effort as people think. mini-SWE-agent is able to get very competitive scores (and sometimes even surpass) Claude Code these days, even though it is orders of magnitude simpler. I talk a lot about how much easier it is to sell a benchmark that is realistic, and part of that is making the tasks realistic, but you should also make your baseline scaffolding realistic, otherwise people will mistrust your results.

Benchmarks are what moves the frontier of AI forward, there isn’t anything more important than building good new benchmarks. Good luck!

5 Tips for Finding Research Topics

2024-02-13T00:00:00+00:00

I’ve spent the past eight years doing research on neural language models. While my top-level goal of making language models more useful to humans has remained stable, my path to achieving it has changed over time. I’ve learned that in research, you not only have to learn how to execute on a given idea, but you also have to learn how to pick which direction to direct your efforts at.

This post contains five patterns that I’ve noticed that both my peers and I use to think of new ideas for research projects. It’s important to note that this advice might only apply to the research sub-field that I’m in and might only apply to people who want to perform research in a manner that is similar to the one I use. There are many ways to succeed, some of which are orthogonal to the guidelines that I use.

1. Focus first on finding a problem, not a solution.

A common mistake I see people do is start a project by thinking about a model they want to build, without even thinking about whether it’s even necessary. They’ll get really excited about a certain method and try to just build something based on that. “Let’s build a retrieval-augmented language model!”. “I want to build an LM agent that uses reinforcement learning!”. But I don’t really think that research can be done with such a vague goal. In order to ground your research, I believe that it’s incredibly useful to find a specific issue that the community would care about, and then work from there. So for example “I’ve noticed that GPT has a really hard time answering questions that have a spatial component” or “I’ve noticed that GPT has a really hard time programming solutions to programming puzzles about graphs” are good starting points. First define an issue, then try to figure out if it would interest the community, then build a dataset of example problems and then try to build a model that solves them. While building a solution, if you figure out that you do need RL or retrieval, then integrate that into your solution. But you’ll always be grounded in the results you’ll get from running your various baselines and new models on the dataset that you built. That’ll tell you whether those components are actually improving your model or not. I used to think that the majority of the work for researchers in deep learning was in building solutions, but I’m now pretty convinced that the majority of the work is in finding good problems!

2. Once you find a good problem to work on, it’s better to iterate and experiment through many potential solutions that are maybe good rather than working through one or two solutions that are “definitely” correct.

The king of deep learning is empirical results. Once you’ve decided which problem to focus on, the only thing that you can do if you want to test a certain solution is to run it and see what happens. Therefore, for me it has been very beneficial to iterate through the idea-brainstorming-to implementation-to-results cycle many times per project. I sometimes see people get stuck in endless pontification and over-analysis of ideas without ever opening an IDE. They’ll think of an idea and then analyze it on a whiteboard for weeks before they even consider sitting down to program it. In my view, that hypothetical analysis is not very useful. Since there’s not really a theory of deep learning yet, the utility of analyzing potential solutions on a whiteboard for more than a few hours is not much. As ML practitioners, our source of truth is the results we get after running an experiment. Yes, you shouldn’t just program every single idea that comes into your head, but also, once you think of something and spend an hour or two thinking it through, just implement it and see how it does.

Only many empirical experiments will build an intuition for what works and what doesn’t and will help you define the path forward. Quick iteration through a trajectory of research ideas leads to progress in deep learning. I’ve also observed that one of the most important factors is the number of iterations on research ideas and not the magnitude of each idea. So I generally recommend thinking of and working on solutions that are as simple as possible, so that you can go through the idea-brainstorming-to implementation-to-results cycle as many times as possible per project. Understanding what ideas to implement and which ones not to, how to prioritize different possible directions, and how to reject ideas before even implementing them are skills that you will learn as you spend more time doing research.

In the second year of my PhD I had an idea for a new type of retrieval-augmented LM. This model had two components, a retrieval and an LM. The LM was pretty much off-the-shelf, and the retriever was the component that I wanted to innovate on. The full model was quite complex and so the first prototype took me about three months to build. I then ran it and it didn’t improve performance. So I then thought of a second version, which was also quite complex and ended up taking me two months to program. That model didn’t work either.

I then realized that I should first verify if this idea could even work by using an oracle retriever: basically cheating at the retrieval stage to make your retriever as good as it could be. This oracle-based model took a few days to program. When I ran it, it also didn’t work, thereby making my confidence in the overall idea tank. If the model couldn’t work with an impossibly-powerful (enabled by cheating) retriever, I didn’t think it could work with a less powerful, but possible, retriever. In hindsight, when I had the original idea, I wish I would’ve been able to notice that the initial prototype would’ve taken three months to program, and instead of starting from it, it would’ve been much smarter to start from the much smaller oracle prototype, and only successively build my way up to the three-month prototype, only if the earlier prototypes showed promise. That would’ve saved me a lot of time.

3. Write a paper that many people would want to read:

Your paper should either teach us something new about an existing system, method or benchmark, or achieve better performance on an existing benchmark, or present a new benchmark. Good papers sometimes do two of those things.

What does it mean to achieve better performance?

If I have a model that can get 70% accuracy on SWE-bench, and you develop a new one that achieves 72% accuracy but is three times slower then that’s probably not an interesting new system. Improving performance doesn’t just mean improving accuracy and ignoring all other factors. When we look at a system, we have to observe both accuracy but also training time, inference time, memory usage, latency and disk space usage. For a new system to be better than an old one, it should improve on one of these metrics while keeping the other ones at the same level or better. If the improvement is substantial, it’s fine to disregard this rule, and for example, present a system that achieves 99% accuracy on SWE-bench while being 30% slower than the current state-of-the-art. But in that case, make sure to compare your method to the baseline, when the baseline is given 30% more time. You should always be comparing your methods to the strongest baselines you can find, and you should strive to make that comparison as fair as possible.

Why is community excitement important?

You should definitely work on research that excites you, but you should also try to find a topic that would also excite a large amount of the other researchers in the field. Papers only reach a small percentage of their target audience. If you’re writing a paper about a niche topic that only has 50 researchers working on it, maybe 15 of them would actually hear about your work, 5 of them would read it and likely none of them would perform follow-up work. I therefore find it incredibly useful and rewarding to work on topics that have a wide interest in the community.

4. Write a paper that would be an interesting part of the research discourse that will be happening in nine months.

In my experience, for my style of work, research projects take nine months on average. That means that when you start thinking of ideas you shouldn’t think of ideas that would be interesting if they were published today or tomorrow, because it’s going to take you much longer than that to write a paper. You should definitely make sure that you’re not writing a paper that would have been relevant one or two years ago. One simple rule of thumb is to make sure that at least some of the related works that you’re citing are from the past year or two. If your latest citation is from five years ago, there’s a big chance that you may be writing a paper that would be irrelevant to the current research discourse.

The most important skill to learn here is observing the trajectories that occur within your research sub-community and being able to predict where they will go. As you do research for more time you’ll get better at predicting where your research field is heading and how to do work that will fit into that puzzle as well as possible. It sounds impossible at first but I believe that, at least in deep learning, it’s possible to predict with high accuracy where the field will be in 9 months. It’s totally impossible to predict where the field will be in 18 months or longer, and that’s why I recommend not working on projects that would take that long.

5. Keep it simple.

And what do I mean by ‘it’? Everything. Try to work on problems that are easy to describe. Try to find simple solutions for those problems. Try to describe your solution in your paper in as simple of a way as possible. Try to write the code for your method in a simple way, and make it easy for others to run and extend your code.

Think of the counterexample here. If you worked on a problem that took two pages to describe in a paper, would any reader stick around for that whole description, let alone would anyone stick around to read about your solution? I also feel like most of the most important problems in our field can be stated in a sentence, so if it takes you six paragraphs to describe the problem you’re working on, that might be a hint that you’re working on a problem that is too niche or contrived.

It makes me happy when lots of people get excited by the work I release. Simplicity is one of the main driving factors in finding ideas that lots of people might get excited by. If you work on a super complex topic, there’s a high chance that very few people would even understand the problem you’re trying to solve, let alone your solution.

As for keeping your solution simple- the ML community has proven time and time again that the best methods that have the most long-lasting impact are always the simplest ones.

As with all rules, in some cases it does make sense to violate this rule. Tim Dettmers’ bitsandbytes is a super popular efficiency library. It’s made up of a lot of very non-simple CUDA code. But everything else about this library is extremely simple: it’s really easy to use, and the motivation behind it is very clear (“bitsandbytes helps you run big models on small GPUs”).

As a researcher that has to publish, it may be tempting to find complex solutions to complex problems, since reviewers are frequently impressed when they see many equations and proofs in a paper. But in my experience, while those types of papers may initially get accepted, over time, complex solutions do not have much lasting impact. Complex papers are harder to read, and their code is usually harder to extend; these properties substantially harm impact.

Closing note:

The strength of the research community is in its large size and diversity. There are many ways to do good research, some of which align with the tips above and some of which don’t. I hope that by sharing these lessons that I’ve learned over the years I’ve helped you improve your ability to do the best research you can.

If you enjoyed this post, you might also enjoy my post on tips for junior researchers, which focuses on how to do research and how to work with a mentor.

Thank you to Nelson Liu, Will Merrill, Samuel Ainsworth, Shunyu Yao and Naomi Saphra for comments on previous drafts of this post.

PhD Thesis Acknowledgments and Dedication

2023-07-06T00:00:00+00:00

The only important parts of a PhD thesis are the acknowledgments and dedication sections 😉 so I’ve uploaded the ones from my thesis here.

Acknowledgments

When Noah A. Smith accepted me into the Ph.D. program, he invited me into an environment that eventually produced a new and improved me: a version of me that is more open to new ideas, better at executing, smarter, and more patient. I will be forever grateful to Noah for believing in me and fully trusting me from day one and for always treating me with respect, patience, and love. Noah not only taught me how to do science, he also significantly improved my storytelling abilities, my ability to hold onto a reader’s attention, and my ability to frame my stories in an appealing way.

Although formally Mike Lewis wasn’t on my committee, he was on all of the papers in my thesis, and I consider him to have been my de facto co-advisor. Our ability to bounce ideas off of each other while improving them at each iteration is incredible.
Mike’s advising complemented my thinking style in a way that immensely improved the level of the work I did during the Ph.D. Working with Mike has shown me how much easier it is to tackle complex problems when you approach them with the right partner.

The end of my first academic year was rough for me, both academically and personally. It was then that Omer Levy invited me to be his intern at Facebook AI Research (FAIR). This totally changed the course of my Ph.D., and I am very grateful to Omer for that opportunity. Omer taught me many important skills for empirical machine learning research that I still carry to this day.

After spending half a year at FAIR as Omer’s intern, Luke Zettlemoyer invited me to stay at FAIR, as a visiting researcher, for two additional years. That allowed me to work with Mike and do resource-intensive research that would not have been possible without an industry affiliation. Throughout this time, Luke gave me total freedom to explore and do research on whatever I wanted to and always provided support. Being at FAIR for two and a half years made my research much stronger than it would have been if I hadn’t been there.

I’m grateful to Jonathan Frankle for inviting me to join MosaicML as a visiting researcher for six months and for always supporting my research.

I’m grateful to Kyunghyun Cho for hosting me at the wonderful Center for Data Science at NYU where I spent the last academic year of my PhD.

I’m grateful for my first mentee, Muru Zhang, for showing me how rewarding it can be to advise.

I’m grateful for the help of Elise deGoede Dorough and Sandy Kaplan from the Allen School at the University of Washington.

I’m grateful for all the other collaborators I’ve had during my Ph.D., who I learned so much from: Adi Haviv, Ori Ram, Peter Izsak, Sewon Min, Ludwig Schmidt, Will Merrill, Alisa Liu, and members of the BigScience project.

I’m grateful for my friend Samuel K. Ainsworth. I met Sam at visit days before starting graduate school where we bonded over our love for free food. Five years later we’re still eating free food together, hopefully for many more years.

I’m grateful for my friend Ivan Evtimov. We met during the beginning of the first year of school since Ivan would always show up to the board-game nights I organized, and we have been friends ever since.

I’m grateful for my friendship with Tim Dettmers and Gabriel Ilharco and for their smart comments and strong support for my work.

I’m grateful to my academic twin brother Jungo Kasai for five years of conversations, comments on papers, and reminders about administrative tasks that I had to complete.

I’m grateful to Ian Covert, Edward Misback, and Nathan Hatch for our countless days together playing tennis.

I’m grateful for my friends back home Ofek Doitch, Dean Stephansky, Andrey Shulika, Nimrod Fiat, Gili Yablonka, Nir Aviv, Tzvika Geft, Shai Kazaz, Gil Levi, Gregory Axler, and Lior Uzan.

I’m grateful to Ben K., Sammy, Danny, Kyra, Dasha, Ben A., Jamie, Paul, Jason, Lana, Carolyn, Shirley, Ivan, Pearly, Adi, Orville, Joe, Amitai, and everyone else who taught me how to stand on the shoulders of giants.

I’m grateful for James, Trixie, Andrea, Raven, Grini, Ian, Za, and everyone else who went to Brazil with me.

I’m grateful for my therapist who has taught me how to better understand myself and others.

I’m grateful for all the love and support I’ve received from Saba, Safta, Ima, Ori, Avia, Yossi, Liat, Idan, Ronnie, and Bara, without which I would not have been able to complete my Ph.D.

Dedication

Dedicated to my grandfather, Haim Yehiel (born in 1926 in Thessaloniki, Greece, died in 2022 in Ramat Hasharon, Israel), for inspiring me to be a nerd.

This is him at the Technion in Haifa in 1945 during an undergraduate class on welding.

The Use Case for Relative Position Embeddings

2022-11-08T00:00:00+00:00

We’re in 2022 but many of our most popular causal language models (LMs), including GPT-3, still use absolute positional embeddings. I believe we should stop using those and move to relative positional embeddings such as ALiBi. Deepmind’s Gopher and BigScience’s BLOOM already use relative positioning methods, and I’ve heard that multiple upcoming models also will, and so hopefully this post will help in encouraging the remanining holdouts to follow suit.

Imagine you’re building the next version of a causal code prediction model like Codex. When we train an LM like this, due to GPU memory limitations, we must pick a finite sequence length, say 4,000 tokens, to train the model on. If at inference time, users only want to make predictions in code files shorter than 4,000 tokens, we’re good. But if a user wants to make a prediction for token 4,001, it would be impossible with absolute position embeddings. If you use learned position embeddings, feeding 4,001 tokens to your LM will simply throw a runtime exception (there is no 4,001 position representation). If you use sinusoidal position embeddings, the model will run given 4,001 inputs, but as we show in the ALiBi paper, it will produce really bad predictions (for any token beyond the first 4,000).

Relative positional methods like ALiBi solve this. The T5 bias is another good option, although personally I prefer ALiBi because our paper shows it obtains better results and also it’s faster and doesn’t require any trainable parameters.

The rotary method has shown some strong results when evaluating sequences that are shorter than or equal to train length, but in our paper we show that it is not able to extrapolate to longer sequences. In addition, it is slower than ALiBi and the absolute positional methods. Lastly, while some people consider it a relative position embedding method, in my opinion, that’s incorrect. Rotary simply element-wise multiplies position representations by the word representations (instead of adding position reps to word reps, as is done in the absolute methods). This means that Rotary still employs position embeddings, which in my view makes it an absolute position method, not a relative one. This thread has more details on why I believe absolute position methods are not the way to go.

Can’t I just use an absolute position method and a sliding window to extrapolate? Short answer: Depending on how you implement this, it either won’t work or will be very very inefficient.

Details: Absolute position embeddings are battle tested and so when engineers want to build LMs that can handle longer sequences, one of the first ideas they have is to use a sliding window with an absolute position embedding method. So if we go back to our Codex example from before, we would train the same 4,000 token LM, but at inference, we would limit the attention sublayer to only attend to the last 4,000 tokens. So when we input 4,001 tokens we would only attend to tokens 2 to 4,001 and when we input 4,002 tokens we would only attend to tokens 3 to 4,002 and so on.

There are two ways to do this:

The simpler approach is just to re-encode everything at every timestep. So in the first feedforward pass of the LM we encode the first 4,000 words, and then in the second feedword pass when we’re looking at words 2 to 4,0001, we discard everything from before and re-encode everything even though there’s an overlap of 3,999 of the words between the two runs. This will work, but is very inefficient. You have to re-encode everything during each forward pass beyond the first 4,000.
In the second approach, we don’t re-encode previously encoded tokens. This will just lead to really bad predictions (I’ve tested this). This is because the same token will be assigned to different positions during subsequent inference runs, which means that its cached representation is invalid. See visualization below.

In both of these approaches, when we’re not extrapolating (i.e. when we’re doing inference on tokens 1 to 4,000), we do normal LM inference. So for example, when token 600 comes in, we have already computed representations for tokens 1 to 599, so we attend to those representations and only have to construct new outputs for token 600. As mentioned above, if we don’t use a relative position method like ALiBi, continuing this inference beyond the first 4,000 tokens will either just produce really bad predictions or it could work but very slowly, if we re-encode the past tokens. Using ALiBi means we get to continue doing inference much beyond token 4,000 without needing to re-encode anything.

To visualize approach 2 from above, I have a toy input sentence here with a toy language model, whose context size is 4 tokens. We see two subsequent inference passes.

Let’s look at the token ‘jumped’ in these two subsequent forward passes with the sliding window + absolute embeddings approach. ‘Jumped’ was assigned position 4 in the initial forward pass, and then if we use this sliding window approach we would have to treat ‘jumped’ as position 3 in the next forward pass, even though we need to attend to the old cached representation in which it had position 4. This weirdness that the model definitely didn’t experience during training just leads it to produce very bad predictions. This approach does not work.

Edit: Shortly after I wrote this blog post I was made aware of this new paper which reveals new evidence showing the weakness of absolute position embeddings.

To learn more about ALiBi and relative position embeddings in general, watch my lecture here:

7 Tips for Junior Researchers

2022-11-01T00:00:00+00:00

Research:

Research is hard and involves a tremendous amount of failure. It’s totally normal to feel like you didn’t progress in the past week, month, or year. When you read a paper, the finding is frequently presented as an obvious step that was the result of a seemingly short exploration, but when you actually do research you realize that good results never just fall into your hands. They’re the results of many months if not years of failed expeditions and many experiments that did not work. Failure seems extremely demoralizing at first but experienced researchers will attest that failure isn’t that bad- you learn a lot from failing (you learn what not to do, or what doesn’t work) and after enough failure, you achieve enough understanding to have a good enough idea for something that will work. Even as a senior Ph.D. candidate, I will still sometimes “fail” for months on end, running experiments and exploring ideas that lead to nothing interesting. It’s still tough for me to go through this and while I’m in that zone of trying things and constantly failing it feels depressing. But at some point things start working and it makes all that failure worthwhile.
Don’t ever lie, make up results or sweep negative findings under the rug. These things might help you in the short term but in the long term they never will, and good research is all about the long term. For example, if you run your new model with 4 different random seeds, and in one of those runs the improvement is 10%, and in the other three runs the improvements are -2%, 1% and -9%, that means that the 10% run was a fluke. If you wanted to, you could submit a paper where you don’t mention the other runs, but that would be deceitful. Sure, that paper might get accepted, but eventually someone will try your idea, and they’ll probably run it with a few different random seeds, and notice that your improvement is not statistically significant. Not only will they then not use your method, they’ll also be wary of your future papers. Junior researchers are anxious to get an initial publishable result and might put aside annoying things like statistical significance, but in the long run that will hurt them. There’s no rush- good research takes time, and it’s better to take one year to write a solid paper than it is to write four low-quality papers that each took three months to write.
Focusing is important, especially in the beginning. I’ve noticed that some junior researchers are afraid to pick one project and stick to it, instead preferring to simultaneously work multiple projects. Their explanation is that they’re not sure which of the projects will pan out, so they want to try multiple projects at once so they have a better chance of one of the projects making it big. In my experience this logic is flawed. Good research requires intense focus and if you work on three projects at once you’re not really going to focus on any of them and so the outcome will probably not be as good as it could’ve been. Instead, I recommend picking the one direction that you’re most excited about and focusing on that. You’ll see that as you keep working on it, it will evolve, pivot, and change many times into many different directions. You may even get exhausted at some point and want to take a small break to work on something else, before coming to back to the original direction. That’s ok. Just try to, at least for your first year in grad school, work on one direction at a time. Once you’ve submitted your first paper, you’ll be much wiser and have a much better understanding of the academic idea-to-submission cycle, and then if you feel towards your later years that you’d like to work on two directions at once, I believe that you’ll understand what that means and be able to handle that.

Working with a mentor:

It’s ok (and even recommended) to say “I don’t understand you”. When we start doing research we usually do it with an advisor who is much more senior than us, a professor if you’re in grad school or a PhD student if you’re an undergrad. Sometimes the advisor will say something super complicated that is totally incomprehensible to the junior person. When this happened to me as a junior researcher I sometimes was too afraid to say “I don’t understand”, since I thought that the senior person would think that I’m dumb. Now I know that we all think in different ways and something that’s obvious to us might be hard to explain to a different person. Good researchers understand this and are very open to explaining and re-explaining and re-re-explaining their thoughts. Good senior researchers also know that explaining new ideas to different people helps us to better frame our thoughts and understand how to write them in a paper.
It’s ok (and even recommended) to say “I don’t know”. We’re all doing research since we don’t know a lot of things and we’re trying to figure them out together. Saying “I don’t know” when you don’t know something doesn’t make you seem stupid – it makes you seem honest. If you just start pretending to know things you don’t and have answers for the topics you don’t have answers to, it’s not going to be very constructive.
It’s ok (and even recommended) to say “I don’t agree with you”. Progress in research is partially driven by disagreements. At any given moment there are many different paths being explored to solve each issue, and that’s how we progress towards the solution. If everyone worked on the same ideas that would be horrible. So disagreement (even within the same research group or mentor-mentee pair, or even with yourself, over time) is totally acceptable in the research world. Just be nice about it! Never say something like “that’s a dumb idea” or anything even close to that. But if you disagree with a certain direction or idea, find a respectful way to voice your concern. Doing this will let the other person try to convince you why they believe their idea is good, which is helpful for both you (now you understand what they want to do and why they believe in it) and them (you might have uncovered a potential weakness which they can now try to remedy).
Your advisor doesn’t have all the answers. Doing research is a multi-faceted endeavor, with many different questions to answer: What problems are currently relevant and exciting to the community? Which of these is the best fit for me? What is the high-level plan for solving this problem? How can I best execute that plan (implementing the model, figuring out what/where to find hardware, and so on…)? Once I’ve found a solution, what is the best framing for it? How can I best market my paper? Your advisor is going to help you with some of these questions, but might not be able to give you all the answers. Some of them will have to come from you- but one thing that might significantly help is finding another senior collaborator. If you look at papers coming out of the UW NLP (and many other ML/NLP/Vision) groups you’ll notice that a lot of them have both a professor co-author and a co-author who is a postdoc or research scientist (in addition to the main author, who is usually a PhD student). I’ve found this advising style to be incredibly useful: the professor provides high-level feedback on the story and framing, while the postdoc/research scientist can provide more low-level feedback about code and other implementation details.

Closing note

After reading an earlier draft of this post one of my friends said that I didn’t mention the most important part of succeeding in research- being lucky enough to find good mentors to work with. I’m incredibly fortunate to have been able to work with some of the nicest and smartest people in the world and unfortunately not everyone is this lucky. Some mentors won’t let you say “I don’t know”, won’t respect your opinions when brainstorming, won’t give you the freedom to work on things that interest and spark joy in you and will pressure you to meet made-up deadlines. I hope that anyone with a toxic mentor can read this post, realize that there are better alternatives, and try to seek one that would work better for them.

Thank you Noah A. Smith, Gregory Axler, Gabriel Ilharco, Mitchell Wortsman, Samuel Ainsworth, Ori Press, and Gabriel Stanovsky for comments on previous drafts of this post.

The Bamboogle Dataset

2022-10-18T00:00:00+00:00

Bamboogle is a dataset that we constructed, made up only of questions that Google answers incorrectly. The leaderboard for it is here.

In our Compositionality Gap paper, we show that language models also struggle with these questions and that our self-ask prompting method substantially improves the ability of language models to answer these questions (better than Chain-of-Thought).

For more details, check out the video above.

Bamboogle was introduced in our Compositionality Gap paper which can be found here, and the dataset itself is here.

The Compositionality Gap and the Compositional Celebrities Dataset

2022-10-17T00:00:00+00:00

As language models grow in size they know more, but do they get better at reasoning? To test GPT-3, we generated lots of questions such as “What is the calling code of the birthplace of Adele?”. We show that as GPT-3 size grows, it does not improve its reasoning abilities on these types of questions.

Compositional Celebrities

To test the reasoning abilities of LMs we built Compositional Celebrities, a dataset of 8.6k questions in 17 different categories.

These questions all require first retreiving 2 facts and then conducting some basic reasoning about them. For example, to answer “What is the calling code of the birthplace of Adele?” a model must first know that Adele was born in the UK and then should figure out that it needs to return the calling code of the UK- +44.

The reasoning required to answer them is simple, and the basic facts are commonly appearing (since they are either related to celebrities or their birth country or year), but we believe that these 2-hop questions have never previously appeared in any text that could be in the training set of an LM.

The Compositionality Gap

We can check the accuracy on these compositional questions (blue) or the accuracy for each pair of sub-questions separately (i.e. “What is the birthplace of Adele?” & “What is the calling code of the U.K.?”).

The surprising result we uncovered is that the compositionality gap doesn’t narrow with scale!

The compositionality gap is the fraction of compositional questions that GPT-3 can’t answer even though it can separately answer the two sub-questions that make up the compositional question.

As GPT-3 gets larger it’s remembering more facts but it’s not able to compose ~40% of these fact pairs, at all model sizes between 1B to 175B parameters! Maybe scale can’t solve everything?

This surprising result also occurs in the InstructGPT-3 family of models! The compositionality gap stays around 40% no matter how much we increase model size.

In the table below, we zoom in on the results for the best GPT-3 model, davinci-002. Some compositional question categories are really easy for it, like Birthplace/Domain Name (80% acc), but some are super hard, like Birth Year/Lit. Nobel Winner (1% acc). We’re not quite sure why.

And finally, the figure below presents an interesting finding: when GPT-3 (davinci-002) is very confident about two facts, it will be able to answer the compositional 2-hop question about them with much higher probability!

Our paper is available here, and the Compositional Celebrities dataset is available on GitHub.

Self-ask Prompting

2022-10-10T00:00:00+00:00

Self-ask is a new prompting method which improves the ability of language models to answer complex questions.

Normally a question answering prompt looks like this:

Question: Who lived longer, Muhammad Ali or Alan Turing?
Answer: Muhammad Ali 

Question: When was the founder of craigslist born?
Answer: December 6, 1952

Question: Who was the maternal grandfather of George Washington?
Answer: Joseph Ball 

Question: Are both the directors of Jaws and Casino Royale from the same country? 
Answer: No

In self-ask, we first have the model generate and then answer sub-questions about the main input question, before answering the input question. So our prompt would look like so:

Question: Who lived longer, Muhammad Ali or Alan Turing?
Are follow up questions needed here: Yes.
Follow up: How old was Muhammad Ali when he died?
Intermediate answer: Muhammad Ali was 74 years old when he died.
Follow up: How old was Alan Turing when he died?
Intermediate answer: Alan Turing was 41 years old when he died.
So the final answer is: Muhammad Ali 

Question: When was the founder of craigslist born?
Are follow up questions needed here: Yes.
Follow up: Who was the founder of craigslist?
Intermediate answer: Craigslist was founded by Craig Newmark.
Follow up: When was Craig Newmark born?
Intermediate answer: Craig Newmark was born on December 6, 1952.
So the final answer is: December 6, 1952

Question: Who was the maternal grandfather of George Washington?
Are follow up questions needed here: Yes.
Follow up: Who was the mother of George Washington?
Intermediate answer: The mother of George Washington was Mary Ball Washington.
Follow up: Who was the father of Mary Ball Washington?
Intermediate answer: The father of Mary Ball Washington was Joseph Ball.
So the final answer is: Joseph Ball 

Question: Are both the directors of Jaws and Casino Royale from the same country? 
Are follow up questions needed here: Yes. 
Follow up: Who is the director of Jaws? 
Intermediate Answer: The director of Jaws is Steven Spielberg. 
Follow up: Where is Steven Spielberg from? 
Intermediate Answer: The United States. 
Follow up: Who is the director of Casino Royale? 
Intermediate Answer: The director of Casino Royale is Martin Campbell. 
Follow up: Where is Martin Campbell from? 
Intermediate Answer: New Zealand. 
So the final answer is: No

This leads the model to solve test-time questions by first answering subquestions about them, and this leads to an increase in performance.

The structure of our prompt allows us to easily parse-out these subquestions and have Google Search answer them instead of the LM. We show that this further improves performance. In our paper we call this system Self-ask + Search Engine. Note that Google Search does not have a publicly available API, and so we use SerpApi, which is a cloud service that provides an easy to use API to Google Search.

Watch our demo (embedded above) for a deeper overview of Self-ask and Self-ask + Google Search.

To learn more about the Compositional Celebrities dataset that we created to evaluate self-ask, and about the surprising compositionality gap that we discovered, go to this post.

To learn more about the Bamboogle dataset that we created, made up only of questions that Google can’t answer, go here.

Our paper is available here, and our code for Self-ask + Google Search is on GitHub.

The following video provies an in-depth overview of all of the topics listed above:

Improving Transformer Models by Reordering their Sublayers

2020-05-04T00:00:00+00:00

The transformer layer is currently the primary component in natural language processing, playing a leading role in recent innovations such as BERT and GPT-3. Each transformer layer consists of a self-attention sublayer (s) followed by a feedforward sublayer (f), creating an interleaving pattern of self-attention and feedforward sublayers throughout a multilayer transformer model.

Is this interleaved pattern the best way to order these sublayers?

In this post, I’ll explain how we recently found a better way to order these sublayers. That ordering leads to superior performance on multiple language modeling benchmarks.

We started by generating random transformer models, varying the number of each type of sublayer, and their ordering, while keeping the overall model size (number of parameters) constant. Here are a few of these randomly generated models:

(Note that one self-attention sublayer has half the parameters of a feedforward sublayer, so you’ll notice that models that have more feedforward sublayers are shallower. )

We trained these models on the standard WikiText-103 word-level language modeling benchmark. While most of these randomly generated models performed worse than the interleaved model, about a third of these random models outperformed it (mostly by a small margin). Our analysis shows that models with more self-attention toward the bottom and more feedforward sublayers toward the top tend to perform better in general:

We also observed that models that had an equal number of self-attention and feedforward sublayers tended to perform better than models that had an unequal number of self-attention and feedforward sublayers. Based on this insight, we design a new family of transformer models that follow a distinct sublayer ordering pattern: sandwich transformers. Sandwich transformers are made up of n self-attention sublayers, followed by the regular interleaved transformer pattern, followed by n feedforward sublayers:

Our experiments demonstrate that a sandwich transformer outperforms the baseline interleaved transformer model. This result is made more interesting by the fact that our sandwich transformer is simply a reordering of the sublayers in the baseline model, and does not require more parameters, memory, or training time. On WikiText-103, sandwich transforms improve perplexity while also reducing the variance caused by selecting different random seeds:

Finally, we demonstrate that even though the sandwich transformer is motivated by random search experiments on WikiText-103, it can improve performance on additional domains and tasks. Sandwich transformers achieve state-of-the-art results on the enwik8 character-level language modeling dataset and on an additional word-level corpus. We conjecture that tuning transformer reorderings to specific tasks could yield even larger gains, and that further exploration of the ordering space may provide universally beneficial patterns.

Other conclusions and insights:

The transformer layer is not the smallest indivisible unit. The self-attention or feedforward sublayers can each function independently.
The transformer architecture is quite robust to sublayer order changes. A non-insignificant amount of the random orderings that we trained performed just as well (and sometimes better than) the baseline.
The ‘extreme standwich’ ordering s¹⁶f¹⁶ (shown below) works almost as well as the baseline on WikiText-103.

The optimal transformer ordering is not identical across different datasets. For example, the best sandwiching coefficient for WikiText-103 is 6, but the best coefficient for the Toronto Book Corpus language modeling dataset is 7. For character level language modeling the optimal sandwiching coefficients were also different.

The paper is available here. We also have a video presentation available here.

Neural Language Models Explained

2017-09-07T00:00:00+00:00

Language models assign probability values to sequences of words. Those three words that appear right above your keyboard on your phone that try to predict the next word you’ll type are one of the uses of language modeling. In the case shown below, the language model is predicting that “from”, “on” and “it” have a high probability of being the next word in the given sentence. Internally, for each word in its vocabulary, the language model computes the probability that it will be the next word, but the user only gets to see the top three most probable words.

Neural language models are a fundamental part of many systems that attempt to solve natural language processing tasks such as machine translation and speech recognition. Currently, all state of the art language models are neural networks.

The first part of this post presents a simple feedforward neural network that solves this task. In the second part of the post, we will improve the simple model by adding to it a recurrent neural network (RNN). The final part will discuss two recently proposed regularization techniques for improving RNN based language models.

A simple model

To begin we will build a simple model that given a single word taken from some sentence tries predicting the word following it.

We represent words using one-hot vectors: we decide on an arbitrary ordering of the words in the vocabulary and then represent the nth word as a vector of the size of the vocabulary (N), which is set to 0 everywhere except element n which is set to 1.

The model can be separated into two components:

We start by encoding the input word. This is done by taking the one hot vector representing the input word (c in the diagram), and multiplying it by a matrix of size (N,200) which we call the input embedding (U). This multiplication results in a vector of size 200, which is also referred to as a word embedding. This embedding is a dense representation of the current input word. This representation is both of a much smaller size than the one-hot vector representing the same word, and also has some other interesting properties. For example, while the distance between every two words represented by a one-hot vectors is always the same, these dense representations have the property that words that are close in meaning will have representations that are close in the embedding space.
The second component can be seen as a decoder. After the encoding step, we have a representation of the input word. We multiply it by a matrix of size (200,N), which we call the output embedding (V). The resulting vector of size N is then passed through the softmax function, normalizing its values into a probability distribution (meaning each one of the values is between 0 and 1, and their sum is 1). This distribution is denoted by p in the diagram above.

The decoder is a simple function that takes a representation of the input word and returns a distribution which represents the model’s predictions for the next word: the model assigns to each word the probability that it will be the next word in the sequence.

To train this model, we need pairs of input and target output words. For the (input, target-output) pairs we use the Penn Treebank dataset which contains around 40K sentences from news articles, and has a vocabulary of exactly 10,000 words. To generate word pairs for the model to learn from, we will just take every pair of neighboring words from the text and use the first one as the input word and the second one as the target output word. So for example for the sentence “The cat is on the mat” we will extract the following word pairs for training: (The, cat), (cat, is), (is, on), and so on.

We use stochastic gradient descent to update the model during training, and the loss used is the cross-entropy loss. Intuitively, this loss measures the distance between the output distribution predicted by the model and the target distribution for each pair of training words. The target distribution for each pair is a one-hot vector representing the target word.

The metric used for reporting the performance of a language model is its perplexity on the test set. It is defined as \(e^{-\frac{1}{N}\sum_{i=1}^{N} \ln p_{\text{target}_i}}\), where \(p_{\text{target}_i}\) is the probability given by the model to the ith target word. Perplexity is a decreasing function of the average log probability that the model assigns to each target word. We want to maximize the probability that we give to each target word, which means that we want to minimize the perplexity (the optimal perplexity is 1).

The perplexity for the simple model¹ is about 183 on the test set, which means that on average it assigns a probability of about \(0.005\) to the correct target word in each pair in the test set. It’s much better than a naive model which would assign an equal probability to each word (which would assign a probability of \(\frac {1} {N} = \frac {1} {10,000} = 0.0001\) to the correct word), but we can do much better.

Using RNNs to improve performance

The biggest problem with the simple model is that to predict the next word in the sentence, it only uses a single preceding word. If we could build a model that would remember even just a few of the preceding words there should be an improvement in its performance. To understand why adding memory helps, think of the following example: what words follow the word “drink”? You’d probably say that “coffee”, “beer” and “soda” have a high probably of following it. If I told you the word sequence was actually “Cows drink”, then you would completely change your answer.

We can add memory to our model by augmenting it with a recurrent neural network (RNN), as shown below.

This model is similar to the simple one, just that after encoding the current input word we feed the resulting representation (of size 200) into a two layer LSTM, which then outputs a vector also of size 200 (at every time step the LSTM also receives a vector representing its previous state- this is not shown in the diagram). Then, just like before, we use the decoder to convert this output vector into a vector of probability values. (LSTM is just a fancier RNN that is better at remembering the past. Its “API” is identical to the “API” of an RNN- the LSTM at each time step receives an input and its previous state, and uses those two inputs to compute an updated state and an output vector².)

Now we have a model that at each time step gets not only the current word representation, but also the state of the LSTM from the previous time step, and uses this to predict the next word. The state of the LSTM is a representation of the previously seen words (note that words that we saw recently have a much larger impact on this state than words we saw a while ago).

As expected, performance improves and the perplexity of this model on the test set is about 114. An implementation of this model³, along with a detailed explanation, is available in Tensorflow.

The importance of regularization

114 perplexity is good but we can still do much better. In this section I’ll present some recent advances that improve the performance of RNN based language models.

Dropout

We could try improving the network by increasing the size of the embeddings and LSTM layers (until now the size we used was 200), but soon enough this stops increasing the performance because the network overfits the training data (it uses its increased capacity to remember properties of the training set which leads to inferior generalization, i.e. performance on the unseen test set). One way to counter this, by regularizing the model, is to use dropout.

The diagram below is a visualization of the RNN based model unrolled across three time steps. x and y are the input and output sequences, and the gray boxes represent the LSTM layers. Vertical arrows represent an input to the layer that is from the same time step, and horizontal arrows represent connections that carry information from previous time steps.

We can apply dropout on the vertical (same time step) connections:

The arrows are colored in places where we apply dropout. A dropout mask for a certain layer indicates which of that layers activations are zeroed. In this case, we use different dropout masks for the different layers (this is indicated by the different colors in the diagram).

Applying dropout to the recurrent connections harms the performance, and so in this initial use of dropout we use it only on connections within the same time step. Using two LSTM layers, with each layer containing 1500 LSTM units, we achieve a perplexity of 78 (we dropout activations with a probability of 0.65)⁴.

The recently introduced variational dropout solves this problem and improves the model’s performance even more (to 75 perplexity) by using the same dropout masks at each time step.

Weight Tying

The input embedding and output embedding have a few properties in common. The first property they share is that they are both of the same size (in our RNN model with dropout they are both of size (10000,1500)).

The second property that they share in common is a bit more subtle. In the input embedding, words that have similar meanings are represented by similar vectors (similar in terms of cosine similarity). This is because the model learns that it needs to react to similar words in a similar fashion (the words that follow the word “quick” are similar to the ones that follow the word “rapid”).

This also occurs in the output embedding. The output embedding receives a representation of the RNNs belief about the next output word (the output of the RNN) and has to transform it into a distribution. Given the representation from the RNN, the probability that the decoder assigns a word depends mostly on its representation in the output embedding (the probability is exactly the softmax normalized dot product of this representation and the output of the RNN).

Given the RNN output at a certain time step, the model would like to assign similar probability values to similar words. Therefore, similar words are represented by similar vectors in the output embedding. (Again, if a certain RNN output results in a high probability for the word “quick”, we expect that the probability for the word “rapid” will be high as well.)

These two similarities led us to recently propose a very simple method, weight tying, to lower the model’s parameters and improve its performance. We simply tie its input and output embedding (i.e. we set U=V, meaning that we now have a single embedding matrix that is used both as an input and output embedding). This reduces the perplexity of the RNN model that uses dropout to 73, and its size is reduced by more than 20%⁵.

Why does weight tying work?

The perplexity of the variational dropout RNN model on the test set is 75. The same model achieves 24 perplexity on the training set. So the model performs much better on the training set then it does on the test set. This means that it has started to remember certain patterns or sequences that occur only in the train set and do not help the model to generalize to unseen data. One of the ways to counter this overfitting is to reduce the model’s ability to ‘memorize’ by reducing its capacity (number of parameters). By applying weight tying, we remove a large number of parameters.

In addition to the regularizing effect of weight tying we presented another reason for the improved results. We showed that in untied language models the word representations in the output embedding are of much higher quality than the ones in the input embedding. This is shown using embedding evaluation benchmarks such as Simlex999. In a weight tied model, because the tied embedding’s parameter updates at each training iteration are very similar to the updates of the output embedding of the untied model, the tied embedding performs similarly to the output embedding of the untied model. So in the tied model, we use a single high quality embedding matrix in two places in the model. This contributes to the improved performance of the tied model⁶.

To summarize, this post presented how to improve a very simple feedforward neural network language model, by first adding an RNN, and then adding variational dropout and weight tying to it.

In recent months, we’ve seen further improvements to the state of the art in RNN language modeling. The current state of the art results are held by two recent papers by Melis et al. and Merity et al.. These models make use of most, if not all, of the methods shown above, and extend them by using better optimization techniques, new regularization methods, and by finding better hyperparameters for existing models.

This model is the skip-gram word2vec model presented in Efficient Estimation of Word Representations in Vector Space. ↩
For a detailed explanation of this watch Edward Grefenstette’s Beyond Seq2Seq with Augmented RNNs lecture. ↩
This model is the small model presented in Recurrent Neural Network Regularization. ↩
This is the large model from Recurrent Neural Network Regularization. ↩
In parallel to our work, an explanation for weight tying based on Distilling the Knowledge in a Neural Network was presented in Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling. ↩
Our paper explains this in detail. ↩