Streamlining GPT-2 Model Training on a GPU VM with Ansible

June 13, 2023
Rss Fetcher

Just about two months ago, I became quite enamored with language models. It started with a GPT 3.5 Plus account and then rolled back to GPT2 so I could test my hand at working with language models. And I quickly ran into hurdles.

I have quite a powerful home lab server with 32 blazing-fast AMD Threadripper cores running at 4 GHz, 256GB DDR4 Ram, 27TB of usable ZFS storage, and a single 6GB RTX 1660 GPU. And this is where the problem started.

If you know anything about processing language models with PyTorch, then you already know what I am talking about. Even the simplest model can take hours to process on 24 CPU cores and the 6GB of Ram on my Nvidia 1660 GPU is enough to process very small batches. Though the GPU processes the data 10 times faster than even 24 CPU cores, you must process everything in very small batches while loading the previous batch as the training model. It is slow and time-consuming, human time-wise, and processor-wise. Then I started researching cloud GPUs. They had the power I needed, but at $6hr for a VM with 4 RTX6000, the bills can rack up quickly.

I generally knew I could automate this process with Ansible, but a lot of preparation work must go on before you can run the model. So, I came up with this Ansible playbook that I am now sharing with the world.

I have tested this playbook at least a dozen times, and it has worked in tests from a few minutes up to a few hours for a small dataset with 200 epochs.

This playbook uses the Hugging Face run_clm.py script to process the data. This playbook assumes you have a clean dataset formatted for Hugging Face Transformers. I have a Python script to clean my data, but it is specific to my data set. If you would like help cleaning your data get ahold of me through social media or the Contact page for this site.

Instead of one of my long and winding posts explaining the deep whys of this project, we will just jump straight into this one. First is a list of everything this playbook does.

Disclaimer

But there is a disclaimer first. This is only a playbook to train a language model with gpt2 using the Hugging Face Transformers library and the tools that go with it. This script will not start up or shut down your VM. It is up to you to get the VM started and shut it down when your job is complete. There are Ansible plugins to automatically create VMs and shut them down, and feel free to add those options to this playbook. I will not as the liability is too high if someone forgets to shut down their VM after running this.

My personal version uses the Linode API to do this automatically.

The Quick Why

When you process a language model, it takes tons of GPU memory and is quite slow on CPUs. But, once the model is trained, it takes much fewer resources to query the language model. While working on and testing this, I processed language models that took nearly all 26GB of memory across all 4 GPU cores, a total of 104GB, to process, but once I downloaded them back to the local workstation, the CPUs were not too bad at querying the model and the 6GB GPU has been able to handle it also. So the point is to use the expensive GPUs to build the model and then test the model on your local workstation.

And finally, this was designed to work with Ubuntu 22.04. It may work with other Debian distributions but will not work on any distros that do not use apt-get. You can probably modify this to work with, yum, dnf, or other package managers.

Notes on how this works

This playbook can work one of two ways. You can do everything as root, and then it will use the /root folder as the remote_work_dir, but you will have to update the playbook to set /root for everything remote except the areas where it uses /tmp.

Otherwise, like the route I use, you set up a user on your local workstation and then use a stack script, another Ansible script, or any other way you see fit to create a user on the remote machine. In most of my testing, I use gpuuser. Because of how Ansible works it is better to have an ssh key between the user, gpuuser in this case, and the server with the password disabled for that user so ssh key auth, and sudo can work without a password. As I mentioned, I use Linode, so here is the bash script that automatically runs on the first bootup of the VM when I set up a Linode for this work.

#!/bin/bash
USERNAME="gpuuser"
PASSWORD="yourpassword"
PUBLIC_KEY="Your public key, the contents of id_rsa.pub on most linux systems"

# Update the system
apt-get update -y && apt-get upgrade -y

# Create a new user
useradd -m -s /bin/bash "${USERNAME}"
echo "${USERNAME}:${PASSWORD}" | chpasswd

# Add the new user to the sudo group
# After the user is added we remove the password for this user
# As we will be using ssh keys only
usermod -aG sudo "${USERNAME}"
passwd -d "${USERNAME}"

# Set up the SSH directory and authorized_keys file for the new user
# This will fail if the /root user does not have a set of authorized_keys.
#This is just for backup in case for some event the user can not authenticate
#On Linode these root ssh keys are set by selecting the proper set of 
#ssh keys during install of the VM, either GUI or API
mkdir -p "/home/${USERNAME}/.ssh"
echo "${PUBLIC_KEY}" > "/home/${USERNAME}/.ssh/authorized_keys"
chown -R "${USERNAME}:${USERNAME}" "/home/${USERNAME}/.ssh"
chmod 700 "/home/${USERNAME}/.ssh"
chmod 600 "/home/${USERNAME}/.ssh/authorized_keys"
chmod 700 /home/$USERNAME/.ssh
chmod 600 /home/$USERNAME/.ssh/authorized_keys
chown -R $USERNAME:$USERNAME /home/$USERNAME/.ssh

To The Playbook

- name: Set up and train GPT-2 model on GPU VM
  hosts: gpu_vm
  become: yes

  vars:
    training_data_local_path:  /path/to/your/cleaned/data/happy.txt
    training_data_remote_path: /tmp/preprocessed_data.txt
    output_dir: /tmp/gpt2_trained_model
    remote_work_dir: /path/to/your/workfolder
    log_file: /tmp/training_output.log
    timestamp: "{{ ansible_date_time.iso8601_basic_short }}"
    local_output_dir: "/path/to/your/workfolder/trained_output_identifier"
    num_epochs: 10
    train_batch_size: 10
    eval_batch_size: 10

  tasks:
    - name: Update packages
      apt:
        update_cache: yes
        cache_valid_time: 3600

    - name: Install required packages
      apt:
        name:
          - python3
          - python3-pip
          - git
          - nvidia-driver-510
          - nvidia-cuda-toolkit
      register: apt_result

    - name: Reboot host after installing NVIDIA driver
      reboot:
        post_reboot_delay: 30
      when: apt_result.changed
    
    - name: Wait for host to come back up
      wait_for:
        host: "{{ inventory_hostname }}"
        port: 22
        delay: 10
        timeout: 300

    - name: Install Python libraries
      pip:
        name:
          - torch
          - torchvision
          - transformers
          - datasets
        state: present
        executable: pip3

    - name: Clone Hugging Face Transformers repository
      git:
        repo: 'https://github.com/huggingface/transformers.git'
        dest: '{{ remote_work_dir }}/transformers'
        force: yes

    - name: Install Transformers library from the repository
      pip:
        name: '{{ remote_work_dir }}/transformers'
        state: present
        executable: pip3

    - name: Install Transformers library in editable mode
      pip:
        name: '{{ remote_work_dir }}/transformers'
        editable: yes
        state: present
        executable: pip3

    - name: Install Transformers library dependencies
      pip:
        requirements: '{{ remote_work_dir }}/transformers/examples/pytorch/language-modeling/requirements.txt'
        state: present
        executable: pip3

    - name: Upload preprocessed training data to VM
      synchronize:
        src: "{{ training_data_local_path }}"
        dest: "{{ training_data_remote_path }}"
        mode: push

    - name: Check if GPU is present
      shell: nvidia-smi -L
      register: gpu_check
      ignore_errors: true

    - name: Train model with GPU
      shell: >
        python3 {{ remote_work_dir }}/transformers/examples/pytorch/language-modeling/run_clm.py
        --model_name_or_path gpt2
        --train_file "{{ training_data_remote_path }}"
        --do_train
        --num_train_epochs {{ num_epochs }}
        --per_device_train_batch_size {{ train_batch_size }}
        --per_device_eval_batch_size {{ eval_batch_size }}
        --save_steps 10000
        --save_total_limit 2
        --evaluation_strategy epoch 
        --output_dir "{{ output_dir }}" > /tmp/training_output.log 2>&1
      args:
        creates: "{{ output_dir }}"
      environment:
        OMP_NUM_THREADS: 1
        PYTHONUNBUFFERED: 1
      when: gpu_check.rc == 0
      async: 0
      poll: 0
      ignore_errors: yes
      register: gpu_train
      delegate_to: "{{ inventory_hostname }}"
      run_once: true

    - name: Train model without GPU
      shell: >
        python3 {{ remote_work_dir }}/transformers/examples/pytorch/language-modeling/run_clm.py
        --model_name_or_path gpt2
        --train_file "{{ training_data_remote_path }}"
        --do_train
        --num_train_epochs 1
        --per_device_train_batch_size 1
        --per_device_eval_batch_size 1
        --save_steps 10000
        --save_total_limit 2
        --evaluation_strategy epoch 
        --output_dir "{{ output_dir }}" > /tmp/training_output.log 2>&1
      args:
        creates: "{{ output_dir }}"
      environment:
        OMP_NUM_THREADS: 24
        PYTHONUNBUFFERED: 1
      when: gpu_check.rc != 0
      async: 0
      poll: 0
      ignore_errors: yes
      register: cpu
      delegate_to: "{{ inventory_hostname }}"
      run_once: true

    - name: Download the trained model to the local machine
      synchronize:
        src: "{{ output_dir }}"
        dest: "{{ local_output_dir }}"
        mode: pull

    - name: Clean up remote files
      ansible.builtin.file:
        path: "{{ item }}"
        state: absent
      loop:
        - "{{ training_data_remote_path }}"
        - "{{ output_dir }}"

The inventory file I have been using for this is quite simple since we define the variables in the actual playbook. Make sure to change 127.0.0.1 for the IP address of the VM you built to run this playbook on.

[gpu_vm]
  127.0.0.1
 [gpu_vm:vars]
 ansible_become=yes

And that is it. This playbook should build the entire system to train the language model in about 5 minutes and then it’s off to the races until everything is done and the trained model is downloaded back to your workstation. Don’t forget to shut down the VM and you should be good to go.

Have a wonderful day.

Streamlining GPT-2 Model Training on a GPU VM with Ansible was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Disclaimer

The Quick Why

Notes on how this works

To The Playbook

Previous Post

Next Post

Solutions

Regions Covered