Technology

Build a Llama-Style MoE Model From Scratch (Part 2)

Learn to train a language model from scratch. This guide covers the AdamW optimizer, cross-entropy loss, and a PyTorch training loop for text generation.
Ning Si Ai
9 min read
#train language model#pytorch training loop#text generation#AdamW optimizer
Advertisement

Editor's Note: As businesses increasingly embrace remote work, the challenge of maintaining team cohesion and company culture has emerged as a significant concern. This shift raises an important question: How can organizations foster a sense of belonging and collaboration in a virtual environment? Navigating this new landscape requires innovative approaches to communication and engagement, prompting leaders to rethink traditional management practices and prioritize connection in ways that resonate with a dispersed workforce.


In the previous article, we built the architecture for our character-level language model. We assembled the core Transformer components, including RoPE, RMSNorm, and a Mixture of Experts (MoE) layer, creating a model ready for inference.

Now, it's time to train our language model.

This guide provides a step-by-step walkthrough of the entire process. We will cover how to configure the optimizer and loss function, implement a complete PyTorch training loop, and use the trained model for text generation. By the end, you'll understand how to take a static model architecture and teach it to generate coherent text from a prompt. This is the pivotal step in learning how to train a language model from scratch.

Configuring Language Model Training Essentials

Before we can begin training, we need two critical components: an optimizer to update the model's parameters and a loss function to measure its performance.

Choosing an Optimizer: Why AdamW?

An optimizer intelligently updates a model's trainable parameters based on the gradients calculated during backpropagation, systematically improving performance. For this project, we'll use the AdamW optimizer, a robust and popular choice that consistently shows excellent results when you train a Transformer model.

To configure the optimizer for our language model, we first need to identify all trainable parameters—any weight tensor flagged with requires_grad=True. This includes:

  • The Token Embedding layer weights
  • The Self-Attention module weights
  • The RMSNorm layer weights
  • The MoE Gating Network weights
  • All Expert Network weights
  • The final Output Linear Layer weights

We gather these parameters and pass them to the AdamW optimizer, which will manage the update process.

# Collect all trainable parameters
params = [p for p in model.parameters() if p.requires_grad]

# Define the AdamW optimizer
optimizer = torch.optim.AdamW(params, lr=learning_rate)

With this code, all 43 trainable tensors in our model, representing approximately 2.24 million parameters, are now managed by the AdamW optimizer. This scale is ideal for a research model—complex enough to feature a complete Transformer architecture but manageable enough for experimentation on limited hardware.

Measuring Performance with Cross-Entropy Loss

To train a neural network, we need a loss function to quantify its performance. The loss function measures the gap between the model's predictions and the ground truth, providing a single number that represents the model's error.

Since our goal is to predict the next character from a fixed vocabulary, this is a multi-class classification problem. The industry-standard solution is Cross-Entropy Loss.

PyTorch's nn.CrossEntropyLoss requires two inputs:

  • Model Predictions (Logits): The raw, unnormalized scores for each token, with a shape of (batch_size * seq_len, vocab_size).
  • Targets: The correct next-character token IDs, with a shape of (batch_size * seq_len).

The function conveniently handles the softmax conversion internally before calculating the loss. This gives us a single scalar value representing the model's error for that batch.

Here’s how we define it:

# Define the loss function
criterion = nn.CrossEntropyLoss()

During training, the optimizer's goal is to minimize this cross-entropy loss. As the loss decreases, the model becomes progressively better at predicting the next token, effectively learning the underlying patterns of the language.

Implementing the PyTorch Training Loop

Advertisement

With the optimizer and loss function configured, we can implement the core PyTorch training loop. This loop is the engine that drives model learning, iterating through the dataset to perform a forward pass, backpropagation, and parameter updates.

A standard language model training loop consists of five key steps, repeated for each batch of data over multiple epochs:

  1. Forward Pass: The model processes a batch of input data (inputs) to generate predictions (outputs).
  2. Loss Calculation: The Cross-Entropy Loss function compares the model's predictions to the correct targets to compute a single loss value.
  3. Zero Gradients: optimizer.zero_grad() is called to clear gradients from the previous step, preventing accumulation.
  4. Backward Pass (Backpropagation): loss.backward() calculates the gradient of the loss with respect to every trainable model parameter.
  5. Parameter Update: optimizer.step() updates the model's weights using the gradients calculated during the backward pass.
for epoch in range(num_epochs):
    for inputs, targets in train_loader:
        # 1. Forward pass
        outputs = model(inputs)
        
        # 2. Calculate loss
        # Reshape for CrossEntropyLoss, which expects (N, C) and (N)
        # outputs: (batch_size, seq_len, vocab_size) -> (batch_size * seq_len, vocab_size)
        # targets: (batch_size, seq_len) -> (batch_size * seq_len)
        loss = criterion(outputs.view(-1, vocab_size), targets.view(-1))
        
        # 3. Backpropagation and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    # Print the loss for the last batch of the epoch
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

As this loop runs, you'll see the model's loss decrease with each epoch, indicating that it is successfully learning.

Epoch [1/10], Loss: 2.8734
Epoch [2/10], Loss: 1.5421
...
Epoch [10/10], Loss: 0.0658

The falling loss confirms that our architectural choices—including MoE, RMSNorm, and RoPE—are working together to create a stable model with good convergence properties.

Generating Text with the Trained Language Model

With our model trained, it's time to see what it has learned. The text generation process involves saving the model's state and then using it to perform a text continuation task: we provide a starting prompt and let it generate the rest, character by character.

Step 1: Saving the Trained Model State

Training is computationally expensive, so it's crucial to save your progress. Saving the model's learned parameters (its state_dict) allows you to load it later for inference without retraining.

# Save the model's state dictionary
torch.save(model.state_dict(), 'moe_char_model.pth')

Step 2: The Autoregressive Generation Loop

Our approach to text generation is autoregressive, meaning the model predicts one token at a time. At each step, the model uses the previously generated text as context to predict the next most likely character.

The autoregressive generation loop follows these steps:

  1. Get Context: Take the most recent sequence of characters as input.
  2. Model Prediction: Feed the context into the model to get a probability distribution over the vocabulary.
  3. Sample Token: Sample a character from the distribution to add randomness and creativity.
  4. Update Sequence: Append the new character to the sequence and repeat.
# Switch the model to evaluation mode
model.eval()

# Use a torch.no_grad() block to disable gradient calculations
with torch.no_grad():
    for _ in range(num_generate):
        # 1. Get the context (crop to max_seq_len)
        context = generated_sequence[:, -max_seq_len:]
        
        # 2. Get the model's predictions (logits)
        logits = model(context)
        
        # 3. We only care about the prediction for the very last token
        last_logits = logits[:, -1, :]
        
        # 4. Apply softmax to convert logits to probabilities
        probs = F.softmax(last_logits, dim=-1)
        
        # 5. Sample one token from the probability distribution
        next_token = torch.multinomial(probs, num_samples=1)
        
        # 6. Append the sampled token to the sequence
        generated_sequence = torch.cat((generated_sequence, next_token), dim=1)

By wrapping the loop in with torch.no_grad(), we tell PyTorch not to track gradients, which significantly speeds up inference and reduces memory usage.

Step 3: Decoding Token IDs to Text

After the generation loop finishes, generated_sequence contains a list of numerical token IDs. To make this human-readable, we use the int_to_char map created during data preparation to decode each token ID back to its corresponding character.

# Convert the tensor of token IDs back into a string
generated_text = ''.join([int_to_char[i] for i in generated_sequence[0].tolist()])
print(generated_text)

Analyzing the Text Generation Results

Advertisement

How did our model perform? The results are impressive for its scale. It successfully produces coherent English sentences, mimicking the style and vocabulary of the training data. It also generates plausible words, uses punctuation correctly, and demonstrates a basic understanding of sentence flow. This confirms that our Mixture of Experts architecture helped it effectively learn linguistic patterns.

While this model excels at mimicry rather than unconstrained creativity, our primary goal was to build and validate a working Transformer with an MoE architecture, and in that, we have succeeded.

Conclusion: From Training Loop to Text Generation

In this project, we built and trained a complete LLaMA-style character-level language model from the ground up. We covered the entire pipeline:

  • Model Architecture: We constructed an efficient Transformer with modern components like RMSNorm, RoPE, and a Mixture of Experts (MoE) mechanism.
  • Data Preparation: We used character-level tokenization and a sliding window approach to create input-target pairs for our autoregressive task.
  • Model Training: We implemented a full PyTorch training loop, including a forward pass, loss calculation, backpropagation, and parameter updates.
  • Text Generation: We used an autoregressive generation method to produce new text, demonstrating the model's learned capabilities.

By following these steps, you've gained a deep understanding of the training and inference pipeline for a modern Transformer, providing a solid foundation for more advanced AI projects.

Key Takeaways

• Use the AdamW optimizer for effective training of language models.
• Implement cross-entropy loss to evaluate model performance during training.
• Follow a structured PyTorch training loop for efficient text generation.

Advertisement

Related Articles

Technology
13 min

What is Agentic RAG? A Complete Guide

Learn what Agentic RAG is and how it improves on traditional RAG. This guide shows you how to build an adaptive system with LangGraph and the Qwen model.

Ning Si Ai
Agentic RAGLangGraph+2 more
Technology
14 min

Build a Llama-Style MoE Model From Scratch (Part 1)

Learn to build a Llama-style MoE language model from scratch. This guide covers the core architecture, RMSNorm, RoPE, and the Mixture of Experts layer.

Ning Si Ai
MoE language modelLlama-style MoE+2 more
Technology
11 min

Supervised Fine-Tuning: A Guide to LLM Reasoning

Learn the complete Supervised Fine-Tuning (SFT) pipeline to enhance LLM reasoning. This guide covers the DeepSeek R1 process, from SFT to knowledge distillation.

Ning Si Ai
Supervised Fine-TuningSFT pipeline+2 more

About This Article

Topic: Technology
Difficulty: Intermediate
Reading Time: 9 minutes
Last Updated: September 9, 2025

This article is part of our comprehensive guide to Large Language Models and AI technologies. Stay updated with the latest developments in the AI field.

All Articles
Share this article to spread LLM knowledge