What are special tokens in LLMs and why add them?

Special tokens are markers added to LLM vocabularies to structure conversations or enable custom functionality. Common examples include , , for chat formatting, or task-specific tokens like , . They help the model distinguish between different speakers, sections, or tasks. Adding special tokens is essential when fine-tuning for specific formats or when the pre-trained model lacks tokens needed for your application.

What is catastrophic forgetting when adding tokens to LLMs?

Catastrophic forgetting occurs when adding new tokens destabilizes the model, causing it to lose pre-trained knowledge. This happens because: 1) New tokens have randomly initialized embeddings (noise compared to trained embeddings), 2) New logits in the LM Head are random, causing poor predictions, 3) Large gradients from unlearned tokens backpropagate and overwrite carefully pre-trained weights. The model essentially 'forgets' what it learned during pre-training while trying to learn the new tokens.

How should I initialize new token embeddings in LLMs?

Never use random initialization. Two safe methods: 1) Mean initialization - set new token embedding to the average of all existing embeddings, placing it at the 'center' of semantic space for stability, 2) Semantic initialization - if the token has clear meaning (e.g., ), average embeddings of related tokens ('user', 'User', 'human'). Mean initialization is more robust for general cases. Both methods prevent the instability caused by random vectors that confuse the model.

Should I use LoRA or full fine-tuning when adding special tokens?

Use LoRA (PEFT) with modules_to_save for best results. LoRA freezes pre-trained weights and adds trainable adapter layers, preventing catastrophic forgetting. Set modules_to_save=['embed_tokens', 'lm_head'] to allow full training of only the embedding and output layers where new tokens reside. This gives targeted learning for new tokens while preserving base model knowledge. Avoid full fine-tuning unless you have massive data and compute; it risks overwriting pre-trained knowledge.

How do I add special tokens using Hugging Face transformers?

Step-by-step: 1) Add tokens to tokenizer: tokenizer.add_special_tokens({'additional_special_tokens': [' ', ' ']}), 2) Resize model embeddings: model.resize_token_embeddings(len(tokenizer)), 3) Smart initialization: Get new token IDs, set their embeddings to mean of existing embeddings or semantic neighbors, 4) Configure LoRA with modules_to_save=['embed_tokens', 'lm_head'] for training. Always save both tokenizer and model after modification.

How to Add Special Tokens to LLMs Without Catastrophic Forgetting

Adding special tokens to a Large Language Model (LLM) during Supervised Fine-Tuning (SFT) is a common technique for structuring conversations with tokens like <|user|> or for custom tasks. While it seems simple, this process can lead to a serious issue known as catastrophic forgetting, degrading your model's performance.

The problem is that new tokens are unknown to the pre-trained LLM. They have randomly initialized vectors in the model's embedding matrix and no corresponding logit in the output layer (LM Head). Introducing these "noisy" tokens during LLM fine-tuning can destabilize the model and erase its pre-trained knowledge.

This guide explains how to add special tokens to LLMs safely. We'll cover the right way to initialize new token embeddings and use Parameter-Efficient Fine-Tuning (PEFT) to preserve your model's core capabilities.

The Risk of Catastrophic Forgetting When Adding Tokens

When you add a new special token without a proper strategy, you risk catastrophic forgetting by introducing instability at three key points in the model architecture.

1. Randomly Initialized Token Embeddings

When you call model.resize_token_embeddings(), the vectors for new tokens are randomly initialized. These random vectors are essentially noise compared to the model's existing embeddings, which have been trained on trillions of tokens to capture rich semantic meaning. The LLM has no basis for interpreting this new, random information.

2. Unstable LM Head Logits

The same issue affects the model's output layer, the LM Head, which predicts the next token. A new, randomly initialized logit is added for your special token. This means the model is essentially guessing when it tries to generate this token, leading to poor output quality.

3. Large and Unstable Gradients

During the initial stages of SFT, the model generates a large loss signal when it encounters these unlearned tokens. This triggers large, unstable gradients that propagate backward through the network, disrupting the model's carefully pre-trained weights and erasing its general knowledge.

How to Add Special Tokens to LLMs: A Step-by-Step Guide

To safely add special tokens during LLM fine-tuning and avoid catastrophic forgetting, follow these steps.

Step 1: Update the Tokenizer and Model Vocabulary

First, you must make both the tokenizer and the model aware of the new tokens.

Update the Tokenizer: Use the tokenizer's built-in methods to add your new special tokens. This ensures it can correctly encode them.
Resize Model Embeddings: Call model.resize_token_embeddings() to expand the model's vocabulary size and allocate space in the embedding matrix for the new tokens.

Step 2: Smartly Initialize New Token Embeddings

This is the most critical step to prevent performance degradation. Never use the default random initialization. Instead, choose one of these methods to initialize new token embeddings.

Method 1: Initialize with the Mean of Existing Embeddings

This is the most robust approach. By initializing the new token's vector with the average of all existing token embeddings, you give it a neutral starting point. This places the new token at the "center of gravity" of the model's semantic space, reducing initial instability.

Method 2: Initialize with Semantically Similar Tokens

If your new token has a clear meaning, you can initialize its embedding using the vector(s) of similar existing tokens. For example, to initialize <|user|>, you could average the embeddings for "user," "User," and "human." This anchors the new token in a relevant semantic area from the start.

Step 3: Use Parameter-Efficient Fine-Tuning (PEFT)

How you train the model after initialization is just as important. PEFT methods like LoRA are the best defense against catastrophic forgetting.

LoRA (Low-Rank Adaptation) works by freezing the original pre-trained LLM weights and adding small, trainable "adapter" matrices to certain layers. This focuses the training effort only on learning the new task and the function of your special tokens, without overwriting the model's core knowledge.

Pro Tip: When using LoRA, leverage the modules_to_save parameter in your configuration. This lets you specify that the embedding layer and LM Head should be fully trained, not just adapted. This is crucial because the new token embeddings and their corresponding logits must be learned from scratch. This gives you the best of both worlds: targeted training for new tokens and strong protection for the base model.

Alternative: Staged Full Fine-Tuning

If you cannot use PEFT, a staged or "warm-up" approach is a less optimal alternative to full fine-tuning.

Stage 1 (Warm-up): Freeze all model layers except for the embed_tokens and lm_head. Train for a few steps on data containing your new tokens. This allows the new embeddings to stabilize.
Stage 2 (Full SFT): Unfreeze all layers and proceed with your complete Supervised Fine-Tuning process. This is more complex and computationally expensive than using LoRA.

Checklist for Adding Special Tokens to LLMs

Here is a quick checklist for adding special tokens during SFT:

Add Tokens: Use tokenizer.add_special_tokens to update the tokenizer.
Resize Model: Call model.resize_token_embeddings to expand the model's vocabulary.
Initialize Smartly: Initialize new token embeddings with the mean of existing embeddings for stability.
Use PEFT: Fine-tune using a PEFT method like LoRA to prevent catastrophic forgetting. Use modules_to_save for the embedding and LM head layers.
Use High-Quality Data: Ensure your SFT dataset correctly and consistently uses the new special tokens. The model learns their function from your examples.
Evaluate Thoroughly: After fine-tuning, test your model on general benchmarks (e.g., MMLU, GLUE) to ensure its core capabilities haven't regressed, in addition to testing your specific task.

By following these strategies, you can enhance your LLM for specific tasks while preserving the powerful, foundational knowledge that makes it so capable. This methodical approach ensures your LLM fine-tuning efforts build upon, rather than undermine, the model's core competence.

Key Takeaways

• Use smart initialization techniques to prevent catastrophic forgetting in LLMs.
• Implement PEFT/LoRA methods during fine-tuning for better token integration.
• Regularly evaluate model performance to detect and mitigate potential forgetting issues.

LLM Internals Hub

How to Add Special Tokens to LLMs Safely