How do you train a language model from scratch?

Training a language model involves 5 steps: 1) Prepare data - tokenize corpus into sequences, 2) Configure optimizer - use AdamW for parameter updates (learning_rate ~1e-3 to 1e-4), 3) Define loss function - CrossEntropyLoss for next-token prediction, 4) Implement training loop - forward pass → loss calculation → backpropagation → optimizer step, 5) Evaluate and iterate - generate text samples to check quality. For MoE models, ensure all expert parameters have requires_grad=True. Training 2.24M parameters on small corpus takes ~1000 iterations to see coherent text.

Why use AdamW optimizer for training language models?

AdamW is the industry standard for Transformer training because: 1) Adaptive learning rates - different parameters get different update magnitudes based on gradient history, 2) Weight decay decoupling - separates L2 regularization from gradient updates for better generalization, 3) Momentum - smooths noisy gradients using exponential moving averages, 4) Proven at scale - used in GPT, BERT, Llama training. Alternative: SGD with momentum (simpler but requires more tuning). For Transformers, AdamW with lr=1e-4, betas=(0.9, 0.999), weight_decay=0.01 is a good starting point.

What is CrossEntropyLoss and how does it work for language modeling?

CrossEntropyLoss measures how well model predictions match true next tokens. For language modeling: Model outputs logits (unnormalized scores) for each vocab token, shape (batch×seq_len, vocab_size). Target is correct next token ID, shape (batch×seq_len). CrossEntropyLoss: 1) Applies softmax to convert logits → probabilities, 2) Calculates -log(probability of correct token), 3) Averages across batch. Lower loss = better predictions. Example: if model assigns 0.8 probability to correct token, loss = -log(0.8) = 0.22. Random guessing (1/vocab_size probability) gives much higher loss.

What does a PyTorch training loop look like for LLMs?

Standard PyTorch training loop: for epoch in range(num_epochs): for batch in dataloader: 1) optimizer.zero_grad() - clear old gradients, 2) outputs = model(inputs) - forward pass, 3) loss = criterion(outputs, targets) - calculate loss, 4) loss.backward() - backpropagation (compute gradients), 5) optimizer.step() - update parameters. Additional considerations for LLMs: gradient clipping (torch.nn.utils.clip_grad_norm_), learning rate scheduling (warmup + decay), mixed precision training (torch.cuda.amp), periodic evaluation, checkpoint saving. Modern frameworks like Hugging Face Trainer abstract this.

How do you generate text from a trained language model?

Text generation process: 1) Start with prompt - tokenize input text to IDs, 2) Autoregressive generation loop: a) Forward pass - get logits for next token, b) Sampling - select token (greedy=argmax, or sample with temperature), c) Append token to sequence, d) Repeat until max_length or , 3) Decode - convert token IDs back to text. Sampling strategies: Greedy (deterministic, picks highest probability), Temperature sampling (controls randomness), Top-K/Top-P (filters unlikely tokens). For coherent output, use temperature=0.7-0.9 with top_p=0.9.

Build a Llama-Style MoE Model From Scratch (Part 2)

In the previous article, we built the architecture for our character-level language model. We assembled the core Transformer components, including RoPE, RMSNorm, and a Mixture of Experts (MoE) layer, creating a model ready for inference.

Now, it's time to train our language model.

This guide provides a step-by-step walkthrough of the entire process. We will cover how to configure the optimizer and loss function, implement a complete PyTorch training loop, and use the trained model for text generation. By the end, you'll understand how to take a static model architecture and teach it to generate coherent text from a prompt. This is the pivotal step in learning how to train a language model from scratch.

Configuring Language Model Training Essentials

Before we can begin training, we need two critical components: an optimizer to update the model's parameters and a loss function to measure its performance.

Choosing an Optimizer: Why AdamW?

An optimizer intelligently updates a model's trainable parameters based on the gradients calculated during backpropagation, systematically improving performance. For this project, we'll use the AdamW optimizer, a robust and popular choice that consistently shows excellent results when you train a Transformer model.

To configure the optimizer for our language model, we first need to identify all trainable parameters—any weight tensor flagged with requires_grad=True. This includes:

The Token Embedding layer weights
The Self-Attention module weights
The RMSNorm layer weights
The MoE Gating Network weights
All Expert Network weights
The final Output Linear Layer weights

We gather these parameters and pass them to the AdamW optimizer, which will manage the update process.

# Collect all trainable parameters
params = [p for p in model.parameters() if p.requires_grad]

# Define the AdamW optimizer
optimizer = torch.optim.AdamW(params, lr=learning_rate)

With this code, all 43 trainable tensors in our model, representing approximately 2.24 million parameters, are now managed by the AdamW optimizer. This scale is ideal for a research model—complex enough to feature a complete Transformer architecture but manageable enough for experimentation on limited hardware.

Measuring Performance with Cross-Entropy Loss

To train a neural network, we need a loss function to quantify its performance. The loss function measures the gap between the model's predictions and the ground truth, providing a single number that represents the model's error.

Since our goal is to predict the next character from a fixed vocabulary, this is a multi-class classification problem. The industry-standard solution is Cross-Entropy Loss.

PyTorch's nn.CrossEntropyLoss requires two inputs:

Model Predictions (Logits): The raw, unnormalized scores for each token, with a shape of (batch_size * seq_len, vocab_size).
Targets: The correct next-character token IDs, with a shape of (batch_size * seq_len).

The function conveniently handles the softmax conversion internally before calculating the loss. This gives us a single scalar value representing the model's error for that batch.

Here’s how we define it:

# Define the loss function
criterion = nn.CrossEntropyLoss()

During training, the optimizer's goal is to minimize this cross-entropy loss. As the loss decreases, the model becomes progressively better at predicting the next token, effectively learning the underlying patterns of the language.

Implementing the PyTorch Training Loop

With the optimizer and loss function configured, we can implement the core PyTorch training loop. This loop is the engine that drives model learning, iterating through the dataset to perform a forward pass, backpropagation, and parameter updates.

A standard language model training loop consists of five key steps, repeated for each batch of data over multiple epochs:

Forward Pass: The model processes a batch of input data (inputs) to generate predictions (outputs).
Loss Calculation: The Cross-Entropy Loss function compares the model's predictions to the correct targets to compute a single loss value.
Zero Gradients: optimizer.zero_grad() is called to clear gradients from the previous step, preventing accumulation.
Backward Pass (Backpropagation): loss.backward() calculates the gradient of the loss with respect to every trainable model parameter.
Parameter Update: optimizer.step() updates the model's weights using the gradients calculated during the backward pass.

for epoch in range(num_epochs):
    for inputs, targets in train_loader:
        # 1. Forward pass
        outputs = model(inputs)
        
        # 2. Calculate loss
        # Reshape for CrossEntropyLoss, which expects (N, C) and (N)
        # outputs: (batch_size, seq_len, vocab_size) -> (batch_size * seq_len, vocab_size)
        # targets: (batch_size, seq_len) -> (batch_size * seq_len)
        loss = criterion(outputs.view(-1, vocab_size), targets.view(-1))
        
        # 3. Backpropagation and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    # Print the loss for the last batch of the epoch
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

As this loop runs, you'll see the model's loss decrease with each epoch, indicating that it is successfully learning.

Epoch [1/10], Loss: 2.8734
Epoch [2/10], Loss: 1.5421
...
Epoch [10/10], Loss: 0.0658

The falling loss confirms that our architectural choices—including MoE, RMSNorm, and RoPE—are working together to create a stable model with good convergence properties.

Generating Text with the Trained Language Model

With our model trained, it's time to see what it has learned. The text generation process involves saving the model's state and then using it to perform a text continuation task: we provide a starting prompt and let it generate the rest, character by character.

Step 1: Saving the Trained Model State

Training is computationally expensive, so it's crucial to save your progress. Saving the model's learned parameters (its state_dict) allows you to load it later for inference without retraining.

# Save the model's state dictionary
torch.save(model.state_dict(), 'moe_char_model.pth')

Step 2: The Autoregressive Generation Loop

Our approach to text generation is autoregressive, meaning the model predicts one token at a time. At each step, the model uses the previously generated text as context to predict the next most likely character.

The autoregressive generation loop follows these steps:

Get Context: Take the most recent sequence of characters as input.
Model Prediction: Feed the context into the model to get a probability distribution over the vocabulary.
Sample Token: Sample a character from the distribution to add randomness and creativity.
Update Sequence: Append the new character to the sequence and repeat.

# Switch the model to evaluation mode
model.eval()

# Use a torch.no_grad() block to disable gradient calculations
with torch.no_grad():
    for _ in range(num_generate):
        # 1. Get the context (crop to max_seq_len)
        context = generated_sequence[:, -max_seq_len:]
        
        # 2. Get the model's predictions (logits)
        logits = model(context)
        
        # 3. We only care about the prediction for the very last token
        last_logits = logits[:, -1, :]
        
        # 4. Apply softmax to convert logits to probabilities
        probs = F.softmax(last_logits, dim=-1)
        
        # 5. Sample one token from the probability distribution
        next_token = torch.multinomial(probs, num_samples=1)
        
        # 6. Append the sampled token to the sequence
        generated_sequence = torch.cat((generated_sequence, next_token), dim=1)

By wrapping the loop in with torch.no_grad(), we tell PyTorch not to track gradients, which significantly speeds up inference and reduces memory usage.

Step 3: Decoding Token IDs to Text

After the generation loop finishes, generated_sequence contains a list of numerical token IDs. To make this human-readable, we use the int_to_char map created during data preparation to decode each token ID back to its corresponding character.

# Convert the tensor of token IDs back into a string
generated_text = ''.join([int_to_char[i] for i in generated_sequence[0].tolist()])
print(generated_text)

Analyzing the Text Generation Results

How did our model perform? The results are impressive for its scale. It successfully produces coherent English sentences, mimicking the style and vocabulary of the training data. It also generates plausible words, uses punctuation correctly, and demonstrates a basic understanding of sentence flow. This confirms that our Mixture of Experts architecture helped it effectively learn linguistic patterns.

While this model excels at mimicry rather than unconstrained creativity, our primary goal was to build and validate a working Transformer with an MoE architecture, and in that, we have succeeded.

Conclusion: From Training Loop to Text Generation

In this project, we built and trained a complete LLaMA-style character-level language model from the ground up. We covered the entire pipeline:

Model Architecture: We constructed an efficient Transformer with modern components like RMSNorm, RoPE, and a Mixture of Experts (MoE) mechanism.
Data Preparation: We used character-level tokenization and a sliding window approach to create input-target pairs for our autoregressive task.
Model Training: We implemented a full PyTorch training loop, including a forward pass, loss calculation, backpropagation, and parameter updates.
Text Generation: We used an autoregressive generation method to produce new text, demonstrating the model's learned capabilities.

By following these steps, you've gained a deep understanding of the training and inference pipeline for a modern Transformer, providing a solid foundation for more advanced AI projects.

Key Takeaways

• Use the AdamW optimizer for effective training of language models.
• Implement cross-entropy loss to evaluate model performance during training.
• Follow a structured PyTorch training loop for efficient text generation.

LLM Internals Hub