SoatDev IT Consulting
SoatDev IT Consulting
  • About us
  • Expertise
  • Services
  • How it works
  • Contact Us
  • News
  • June 13, 2023
  • Rss Fetcher

Just about two months ago, I became quite enamored with language models. It started with a GPT 3.5 Plus account and then rolled back to GPT2 so I could test my hand at working with language models. And I quickly ran into hurdles.

I have quite a powerful home lab server with 32 blazing-fast AMD Threadripper cores running at 4 GHz, 256GB DDR4 Ram, 27TB of usable ZFS storage, and a single 6GB RTX 1660 GPU. And this is where the problem started.

If you know anything about processing language models with PyTorch, then you already know what I am talking about. Even the simplest model can take hours to process on 24 CPU cores and the 6GB of Ram on my Nvidia 1660 GPU is enough to process very small batches. Though the GPU processes the data 10 times faster than even 24 CPU cores, you must process everything in very small batches while loading the previous batch as the training model. It is slow and time-consuming, human time-wise, and processor-wise. Then I started researching cloud GPUs. They had the power I needed, but at $6hr for a VM with 4 RTX6000, the bills can rack up quickly.

I generally knew I could automate this process with Ansible, but a lot of preparation work must go on before you can run the model. So, I came up with this Ansible playbook that I am now sharing with the world.

I have tested this playbook at least a dozen times, and it has worked in tests from a few minutes up to a few hours for a small dataset with 200 epochs.

This playbook uses the Hugging Face run_clm.py script to process the data. This playbook assumes you have a clean dataset formatted for Hugging Face Transformers. I have a Python script to clean my data, but it is specific to my data set. If you would like help cleaning your data get ahold of me through social media or the Contact page for this site.

Instead of one of my long and winding posts explaining the deep whys of this project, we will just jump straight into this one. First is a list of everything this playbook does.

Disclaimer

But there is a disclaimer first. This is only a playbook to train a language model with gpt2 using the Hugging Face Transformers library and the tools that go with it. This script will not start up or shut down your VM. It is up to you to get the VM started and shut it down when your job is complete. There are Ansible plugins to automatically create VMs and shut them down, and feel free to add those options to this playbook. I will not as the liability is too high if someone forgets to shut down their VM after running this.

My personal version uses the Linode API to do this automatically.

The Quick Why

When you process a language model, it takes tons of GPU memory and is quite slow on CPUs. But, once the model is trained, it takes much fewer resources to query the language model. While working on and testing this, I processed language models that took nearly all 26GB of memory across all 4 GPU cores, a total of 104GB, to process, but once I downloaded them back to the local workstation, the CPUs were not too bad at querying the model and the 6GB GPU has been able to handle it also. So the point is to use the expensive GPUs to build the model and then test the model on your local workstation.

And finally, this was designed to work with Ubuntu 22.04. It may work with other Debian distributions but will not work on any distros that do not use apt-get. You can probably modify this to work with, yum, dnf, or other package managers.

Notes on how this works

This playbook can work one of two ways. You can do everything as root, and then it will use the /root folder as the remote_work_dir, but you will have to update the playbook to set /root for everything remote except the areas where it uses /tmp.

Otherwise, like the route I use, you set up a user on your local workstation and then use a stack script, another Ansible script, or any other way you see fit to create a user on the remote machine. In most of my testing, I use gpuuser. Because of how Ansible works it is better to have an ssh key between the user, gpuuser in this case, and the server with the password disabled for that user so ssh key auth, and sudo can work without a password. As I mentioned, I use Linode, so here is the bash script that automatically runs on the first bootup of the VM when I set up a Linode for this work.

#!/bin/bash
USERNAME="gpuuser"
PASSWORD="yourpassword"
PUBLIC_KEY="Your public key, the contents of id_rsa.pub on most linux systems"

# Update the system
apt-get update -y && apt-get upgrade -y

# Create a new user
useradd -m -s /bin/bash "${USERNAME}"
echo "${USERNAME}:${PASSWORD}" | chpasswd

# Add the new user to the sudo group
# After the user is added we remove the password for this user
# As we will be using ssh keys only
usermod -aG sudo "${USERNAME}"
passwd -d "${USERNAME}"

# Set up the SSH directory and authorized_keys file for the new user
# This will fail if the /root user does not have a set of authorized_keys.
#This is just for backup in case for some event the user can not authenticate
#On Linode these root ssh keys are set by selecting the proper set of
#ssh keys during install of the VM, either GUI or API
mkdir -p "/home/${USERNAME}/.ssh"
echo "${PUBLIC_KEY}" > "/home/${USERNAME}/.ssh/authorized_keys"
chown -R "${USERNAME}:${USERNAME}" "/home/${USERNAME}/.ssh"
chmod 700 "/home/${USERNAME}/.ssh"
chmod 600 "/home/${USERNAME}/.ssh/authorized_keys"
chmod 700 /home/$USERNAME/.ssh
chmod 600 /home/$USERNAME/.ssh/authorized_keys
chown -R $USERNAME:$USERNAME /home/$USERNAME/.ssh

To The Playbook

- name: Set up and train GPT-2 model on GPU VM
hosts: gpu_vm
become: yes

vars:
training_data_local_path: /path/to/your/cleaned/data/happy.txt
training_data_remote_path: /tmp/preprocessed_data.txt
output_dir: /tmp/gpt2_trained_model
remote_work_dir: /path/to/your/workfolder
log_file: /tmp/training_output.log
timestamp: "{{ ansible_date_time.iso8601_basic_short }}"
local_output_dir: "/path/to/your/workfolder/trained_output_identifier"
num_epochs: 10
train_batch_size: 10
eval_batch_size: 10

tasks:
- name: Update packages
apt:
update_cache: yes
cache_valid_time: 3600

- name: Install required packages
apt:
name:
- python3
- python3-pip
- git
- nvidia-driver-510
- nvidia-cuda-toolkit
register: apt_result

- name: Reboot host after installing NVIDIA driver
reboot:
post_reboot_delay: 30
when: apt_result.changed

- name: Wait for host to come back up
wait_for:
host: "{{ inventory_hostname }}"
port: 22
delay: 10
timeout: 300

- name: Install Python libraries
pip:
name:
- torch
- torchvision
- transformers
- datasets
state: present
executable: pip3

- name: Clone Hugging Face Transformers repository
git:
repo: 'https://github.com/huggingface/transformers.git'
dest: '{{ remote_work_dir }}/transformers'
force: yes

- name: Install Transformers library from the repository
pip:
name: '{{ remote_work_dir }}/transformers'
state: present
executable: pip3

- name: Install Transformers library in editable mode
pip:
name: '{{ remote_work_dir }}/transformers'
editable: yes
state: present
executable: pip3

- name: Install Transformers library dependencies
pip:
requirements: '{{ remote_work_dir }}/transformers/examples/pytorch/language-modeling/requirements.txt'
state: present
executable: pip3

- name: Upload preprocessed training data to VM
synchronize:
src: "{{ training_data_local_path }}"
dest: "{{ training_data_remote_path }}"
mode: push

- name: Check if GPU is present
shell: nvidia-smi -L
register: gpu_check
ignore_errors: true

- name: Train model with GPU
shell: >
python3 {{ remote_work_dir }}/transformers/examples/pytorch/language-modeling/run_clm.py
--model_name_or_path gpt2
--train_file "{{ training_data_remote_path }}"
--do_train
--num_train_epochs {{ num_epochs }}
--per_device_train_batch_size {{ train_batch_size }}
--per_device_eval_batch_size {{ eval_batch_size }}
--save_steps 10000
--save_total_limit 2
--evaluation_strategy epoch
--output_dir "{{ output_dir }}" > /tmp/training_output.log 2>&1
args:
creates: "{{ output_dir }}"
environment:
OMP_NUM_THREADS: 1
PYTHONUNBUFFERED: 1
when: gpu_check.rc == 0
async: 0
poll: 0
ignore_errors: yes
register: gpu_train
delegate_to: "{{ inventory_hostname }}"
run_once: true

- name: Train model without GPU
shell: >
python3 {{ remote_work_dir }}/transformers/examples/pytorch/language-modeling/run_clm.py
--model_name_or_path gpt2
--train_file "{{ training_data_remote_path }}"
--do_train
--num_train_epochs 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--save_steps 10000
--save_total_limit 2
--evaluation_strategy epoch
--output_dir "{{ output_dir }}" > /tmp/training_output.log 2>&1
args:
creates: "{{ output_dir }}"
environment:
OMP_NUM_THREADS: 24
PYTHONUNBUFFERED: 1
when: gpu_check.rc != 0
async: 0
poll: 0
ignore_errors: yes
register: cpu
delegate_to: "{{ inventory_hostname }}"
run_once: true

- name: Download the trained model to the local machine
synchronize:
src: "{{ output_dir }}"
dest: "{{ local_output_dir }}"
mode: pull

- name: Clean up remote files
ansible.builtin.file:
path: "{{ item }}"
state: absent
loop:
- "{{ training_data_remote_path }}"
- "{{ output_dir }}"

The inventory file I have been using for this is quite simple since we define the variables in the actual playbook. Make sure to change 127.0.0.1 for the IP address of the VM you built to run this playbook on.

[gpu_vm]
127.0.0.1
[gpu_vm:vars]
ansible_become=yes

And that is it. This playbook should build the entire system to train the language model in about 5 minutes and then it’s off to the races until everything is done and the trained model is downloaded back to your workstation. Don’t forget to shut down the VM and you should be good to go.

Have a wonderful day.


Streamlining GPT-2 Model Training on a GPU VM with Ansible was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

Previous Post
Next Post

Recent Posts

  • Cursor’s Anysphere nabs $9.9B valuation, soars past $500M ARR
  • Circle IPO soars, giving hope to more startups waiting to go public
  • Why are Elon Musk and Donald Trump fighting?
  • Europe will have to be more Tenacious to land its first rover on the Moon
  • Elon Musk and Donald Trump are smack talking each other into their own digital echo chambers

Categories

  • Industry News
  • Programming
  • RSS Fetched Articles
  • Uncategorized

Archives

  • June 2025
  • May 2025
  • April 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023

Tap into the power of Microservices, MVC Architecture, Cloud, Containers, UML, and Scrum methodologies to bolster your project planning, execution, and application development processes.

Solutions

  • IT Consultation
  • Agile Transformation
  • Software Development
  • DevOps & CI/CD

Regions Covered

  • Montreal
  • New York
  • Paris
  • Mauritius
  • Abidjan
  • Dakar

Subscribe to Newsletter

Join our monthly newsletter subscribers to get the latest news and insights.

© Copyright 2023. All Rights Reserved by Soatdev IT Consulting Inc.