How I Migrated a 3 TB PostgreSQL Transactional Database

August 14, 2023
Rss Fetcher

Disclaimer: I used ChatGPT’s assistance in composing this article because English is not my native language, and I wanted to make the reading really enjoyable. The first time I tried to write this article, it was like a roller coaster talking about everything and nothing at the same time. Ha, ha.

In the fast-paced startup world, it’s not uncommon for small developer teams to handle an entire project. I experienced this firsthand as a full stack developer at a transport consultant, where I became the sole caretaker of a critical project. With my team members having moved on to other ventures, the responsibility to tackle the project’s challenges fell squarely on my shoulders.

Our project focused on managing transport system transactions and processing incoming data to create essential reports and information. Dealing with a constant influx of new transactions, it became evident that we needed a reliable and efficient database solution.

Amidst the daily hustle of managing the system, a game-changing client request landed on our desks: migrate our colossal 3 TB PostgreSQL Transactional Database from AWS to Azure DevOps. The motive behind this move was crystal clear — to consolidate our entire system within the Azure ecosystem, streamlining operations and unlocking the growth potential.

However, the clock was ticking; we faced a daunting challenge. The client set a tight deadline, granting us three weeks to execute the migration. The task’s urgency added an extra layer of pressure, as we knew the success hinged on meticulous planning, flawless execution, and a touch of innovation.

In this article, I’ll share my valuable experiences and insights gained during migration. While I am not a PostgreSQL expert, my journey as a lone developer navigating this complex task was filled with challenges, significant discoveries, and the development of best practices that I believe will be of great value to others facing similar situations. Migrating such a sizable database demanded careful planning, strategic decision-making, and a deep understanding of the intricacies involved.

Preparing for the Migration

The preparation phase proved to be one of the most intricate stages of the migration. The project’s history of unsuccessful attempts to migrate the database using pg_dump, back when it was a 2 TB database had left me uncertain about how to proceed. Although I had a general understanding of the database’s size and had developed features and complex queries for it, the sheer magnitude of over 200 tables made the task far more complex than I initially imagined.

To tackle this challenge, the first step was to map the entire database comprehensively. I meticulously documented critical columns, indexes, relationships, constraints, triggers, tables affected by them, and the essential sequences. This mapping gave us a clear picture of what I was up against, enabling me to analyze and prepare for the migration in the best possible way.

Additionally, this process took me almost a week to complete. During this time, one crucial concern that loomed over us was the continuous influx of new transactions, causing the main database to grow incessantly.

Traditional backup and restore methods could lead to desynchronization of data, and halting production during migration was not an option due to the critical nature of our services.

Being unfamiliar with incremental backups and facing time constraints, I discussed the situation with the technical leader, and together, we devised an alternative approach. The plan involved redirecting the new incoming transactions to another database while I focused on migrating the existing 3 TB database. This approach meant we had to handle the migration of the 3TB database and the new one handling incoming transactions.

Though complex, this alternative approach promised to save time and effort in migrating the smaller database separately. Despite the challenges, we completed the migration successfully, ensuring minimal disruption to the ongoing operations. The experience taught me valuable lessons in handling large-scale database migration projects efficiently and responsibly.

Choosing the Right Migration Approach

After analyzing the database, I found myself considering different migration approaches. The first option was using pg_dump/pg_restore, despite our unsuccessful attempts with this method. Another approach involved Pentaho, a data integration tool known for its versatility in handling large datasets. Lastly, a manual pagination method using an offset limit was suggested by my technical leader, which caught my attention during a demo.

Let me walk you through these three migration approaches, their respective outcomes, and the challenges we encountered. Despite the initial appeal of each method, we eventually settled on one approach that proved to be the most suitable for our unique situation.

Pentaho

The project’s history of unsuccessful attempts with pg_dump/pg_restore made the Pentaho approach the first option to try. To determine if Pentaho was a suitable candidate for the migration, I conducted a simple demo, attempting to migrate the information from a massive table with approximately 700 million records and around 40 columns.

After setting up the environment during the day, I left the migration running all night to assess its performance until the next day.

The following morning, upon checking the migration’s progress, I was disheartened to discover the occurrence of the IODataFileRead wait event. This was concerning because it indicated that the migration was struggling to fetch necessary data from the disk. Even after over 12 hours, only around 250 million records had been migrated, leaving me shocked by the lackluster results. I began to analyze the reasons behind this issue.

At that moment, my thoughts centered on the table’s size and the high number of columns. It appeared the database’s default 8 kb page size was insufficient for handling such a large table, causing the database to fetch numerous pages, potentially in random order, leading to data fetching getting stuck. As a result, the database had to retrieve thousands of pages in this manner and store them in RAM before returning them, significantly slowing down the migration process.

I attempted multiple demos with similar tables, hoping to find a way to make the Pentaho approach work, but unfortunately, it did not yield the desired results. As the clock continued ticking, time was becoming a precious commodity.

I dedicated four days to working on this approach, but despite my efforts, it became evident that there were better fits for our migration requirements. With the deadline looming and the need to explore other options, I decided to move on to the next approach. Adapting quickly and making informed decisions became paramount as a solo developer handling this critical project.

Manual pagination method using offset limit

While the demo of the Pentaho migration ran, I used some of the downtime to explore other possibilities. My technical leader suggested trying it out, although I was skeptical about its success. Surprisingly, during this time, I stumbled upon something intriguing.

I created a simple NodeJS project to fetch information from the table with 700 million records using pagination with an offset of 100,000 records and a limit value of 100,000. Surprisingly, when the query reached around 220 million records, it suddenly struggled and stuck, triggering the dreaded IODataFileRead wait event. I was perplexed by this sudden slowdown and decided to conduct a thorough analysis to understand the root cause.

During my investigation, I learned something interesting about the offset. I executed the EXPLAIN ANALYZE command on the query to gain deeper insights into PostgreSQL’s execution process. This exploration allowed me to better understand the database’s behavior and the challenges it faced while handling such large datasets.

Query example

EXPLAIN ANALYZE SELECT * FROM table_name offset 200000 limit 100000;

Using OFFSET doesn’t skip records; instead, it fetches all the records and relies on PostgreSQL to filter them. This explains why the query was slowing down, as it wasn’t fetching just the expected 100,000 records but rather all 220 million records, only to be filtered afterward. It was an intriguing observation, and it further confirmed my initial suspicions.

As we explored alternative options, we considered using the COPY command for backup in a paginated manner. However, an unfortunate obstacle presented itself. The database’s tables had primary indexes containing UUID values, rendering this approach unfeasible. The presence of UUIDs disrupted the table order, making it impractical to utilize the COPY command for the migration efficiently.

Query example

COPY (SELECT * FROM your_table WHERE id BETWEEN 1 AND 100) TO '/table-backup.csv' CSV HEADER;

This discovery led us to a critical decision point, requiring us to reevaluate our approach and search for a method that could work effectively despite the constraints imposed by the UUID primary indexes. As a solo developer, finding a suitable solution amidst these challenges demanded a blend of creativity and technical expertise.

pg_dump/pg_restore

After an entire week of working on two different approaches, the last approach I tried was using pg_dump and pg_restore. During this time, I took the initiative to research these commands to better understand how to use them effectively for optimal performance.

However, a significant challenge was dealing with Azure For PostgreSQL: flexible server. We didn’t have access as superusers. This limitation posed an issue since the ideal strategy would have been to disable triggers during the migration process to speed up the restore. Unfortunately, it wasn’t possible due to the lack of superuser access. This setback forced me to look for alternative ways to achieve the migration efficiently.

Throughout the process, I kept the client updated about the ongoing developments as we had already surpassed the initial deadline, making them aware that the task was complex. To address the challenges, the client even hired a PostgreSQL expert to assist us, but after an additional two weeks of collaborative efforts, we still hadn’t achieved a successful migration. Consequently, the client extended the deadline to accommodate the complexities we encountered.

After nearly two weeks of intensive research and extensive experimentation with multiple strategies using pg_dump and pg_restore, I have successfully crafted a comprehensive plan to expedite the migration process for the 3 TB and the smaller database.

Utilizing 0 compression for pg_dump (-Z 0): By specifying the -Z 0 parameter during the backup phase, I opted for 0 compression, prioritizing speed over disk space utilization. This decision accelerated the backup process, even though it required additional disk space to store the backup.
Employing jobs for concurrent backups and restoration (-j <num>): To further improve overall performance and reduce migration time, I utilized parallel processing by using the -j <num> parameter to handle backups and restoration concurrently for various tables. By setting <num> to the desired number of parallel jobs, I executed multiple tasks simultaneously, significantly increasing the efficiency of the migration process.
Disabling triggers during restoration (–disable-triggers): To maintain data consistency and prevent conflicts arising from trigger executions, I disabled triggers during the restoration phase using the –disable-triggers parameter. This precaution ensured the avoidance of integrity issues during data restoration.

My step-by-step plan was as follows:

Backup the database schema and migrate sequences while creating the necessary database roles on the new server.
Delete key constraints to reduce unnecessary IO operations and improve migration speed.
Remove large indexes to optimize IO operations during migration further.
Migrate the information table by table or in groups of related tables. Due to the lack of superuser permissions, I had to carefully select the key relationships to delete for a smoother migration process.
After migrating the information, restore the constraints, keeping the relationships between tables in mind.
Finally, restore the huge indexes concurrently after successfully migrating the data.

Once I had finalized the migration plan, I reviewed it with my technical leader. I began setting up the environment to initiate the actual database migration process, which allowed me to overcome the limitations posed by the lack of superuser permissions.

Setting Up the Migration Environment

To set up the migration environment, we have decided to use a Virtual Machine (VM) to take backups and perform the restoration in the final Database. The VM will be created in the same region as the destination database. To ensure secure and efficient communication between Azure resources, we will configure the necessary permissions to allow the VM to connect to the database using Azure VNET (Virtual Network).

Using Azure VNET for communication between resources is advantageous as it offers better performance and security than communication over the public internet.

Here are the steps taken to prepare the VM for the migration:

Creating the VM: We created a Virtual Machine in the same Azure region as the final database. This ensures proximity, which can enhance data transfer speeds.
Installing PostgreSQL: Once the VM was set up, we installed PostgreSQL on it. This will serve as our migration server, where we’ll store the database backup files and perform the restoration.
Configuring Permissions and Networking: To establish a secure and efficient connection between the VM and the destination database, we configured the necessary permissions and networking settings. By utilizing Azure VNET, we can ensure that communication is restricted to the virtual network, reducing exposure to the public internet and improving overall security.
Testing Connections: With PostgreSQL installed and networking set up, we conducted connection tests to verify that the VM can successfully communicate with the source and destination databases. This step is crucial to ensure that the migration process can proceed smoothly without connectivity issues.

At this point, the VM is ready to serve as the migration server, and we are prepared to proceed with the database migration. The next steps will involve taking backups from the source database, transferring them to the VM securely, and then restoring the data to the final database.

Executing the Migration

On the day of the 3 TB migration, I meticulously prepared by following the established plan. Before executing any backup or restore operations, I made sure to fine-tune the parameters in the postgresql.conf to optimize performance and avoid potential issues, such as auto vacuums. I referred to recommended parameters from Azure resources, ensuring adherence to best practices when using pg_dump and pg_restore.

Once the configuration was optimized, I created backup and restore scripts, streamlining the process and eliminating the need for manual input. I double-checked the scripts to ensure their accuracy before initiating any commands.

With everything in place, I informed the team and client about the migration process commencement and began executing each step outlined in the plan. Throughout the migration, I closely monitored the progress, ensuring each phase was executed as intended. Regularly communicating the status and updates to the team, I kept everyone informed about the migration’s progress.

By meticulously preparing and following the plan, I successfully managed the migration process, ensuring a smooth and efficient transition for the 3TB database. The combination of optimized configuration, automated scripts, and constant communication contributed to successfully completing the migration.

We communicated with the client that incoming transactions would be queued during the migration process of the smaller database. While migrating the smaller database, we managed the incoming transactions effectively and ensured that no data was lost or compromised. Once the migration of the smaller database was completed, we completed the migration, unified the information, and resumed the transaction reception in the new environment.

With all the information successfully migrated and transaction reception resumed, the system operated seamlessly in the new environment. We conducted thorough testing and validation to guarantee data integrity and consistent performance.

The client was pleased with the results, and the strategic handling of the migration process allowed us to minimize downtime and avoid any significant disruptions to their operations.

Monitoring and Data Integrity

During the migration process, ensuring data integrity and monitoring the progress are critical aspects. To achieve this, I employed three monitoring methods:

Network Monitoring: I monitored the Virtual Machine (VM) and database server using network monitoring tools to observe the data flow between them. By closely monitoring the network activity, I could ensure that the information was being migrated as expected and identify any potential network-related issues that could impact the migration.
PgAdmin Dashboard: The PgAdmin Dashboard provided real-time insights into the progress of the pg_restore process. By monitoring the dashboard, I could track the execution of restoration tasks, identify any wait events, and observe the overall health of the database during the migration.

Database Metadata Queries: I frequently used various queries to check the database metadata, which allowed me to ensure that the information was migrated successfully and verify data integrity. Some of the most important queries I used included:

Live and Dead Tuples: I monitored the number of live and dead tuples in the tables to gauge how much space was utilized and how well the data was distributed. This helped identify potential data distribution and space utilization issues during the migration.

SELECT relname, n_live_tup, n_dead_tup 
FROM pg_stat_all_tables 
WHERE n_live_tup > 0 OR n_dead_tup > 0;

 -- TO SEARCH FOR A SPECIFIC TABLE

SELECT relname, n_live_tup, n_dead_tup 
FROM pg_stat_all_tables 
WHERE relname = 'table_name';

Table Size: I regularly checked the size of the tables to track the progress of data migration. This allowed me to ensure that the tables were being populated with data as expected and helped identify any discrepancies that could indicate migration issues.

SELECT
    relname AS "relation",
    pg_size_pretty (
        pg_total_relation_size (C .oid)
    ) AS "total_size"
FROM
    pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C .relnamespace)
WHERE
    nspname NOT IN (
        'pg_catalog',
        'information_schema'
    )
AND C .relkind <> 'i'
AND nspname !~ '^pg_toast'
and relname = 'table_name'
ORDER BY
    pg_total_relation_size (C .oid) DESC

Database Size: Monitoring the overall size of the database.

SELECT pg_size_pretty(
        pg_database_size( current_database() )
 ) AS database_size;

By using these monitoring methods and running essential queries, I could confidently ensure the success of the migration process, maintain data integrity, and address any issues promptly during the migration.

Post-Migration Optimizations

I adjusted the values of the PostgreSQL configuration by fine-tuning the postgresql.conf server settings to achieve better performance. By default, Azure For PostgreSQL automatically adjusts these parameters based on the server resources. However, I fine-tuned them to optimize the database configuration further.

Personal Thoughts and Feelings

At the beginning, I have to admit, I saw the migration as an impossible task. Even with the help of the team’s senior developer and software architect, we struggled to back up the database in the past. When the client made the request, I knew it would fall on my shoulders to handle it, and I didn’t have anyone else to turn to for help since I was the only one left from the original team.

To be honest, even the other teams and my technical leaders were scared of the request. And I felt the same way too. But then I remembered some advice from a really good content creator named midudev. He said that if no one else wants to touch a project, not even with a stick, it might be an opportunity to challenge yourself, learn a lot, and become a better software engineer.

So, with that mindset, I decided to embrace the challenge in a positive way. I knew I wasn’t a PostgreSQL expert at all, but I had some experience migrating smaller databases of 5 GB or 20 GB. However, handling a massive 3 TB transactional database was a whole different story.

Despite my initial fear, I took the leap and led the migration. I have to say, this challenge brought out the best in me. I’m happy to say I successfully completed the migration and even found ways to optimize certain aspects of the database.

Throughout the process, I learned so much. The migration really helped me understand the fundamentals of databases and how they work behind the scenes. It was an incredibly nourishing experience for me as a professional.

Looking back, if I faced the same situation again, I would approach it just like before. The sense of accomplishment and the knowledge gained were truly worth it.

Now that I know more about PostgreSQL, I would definitely consider using other migration tools like Barman or pgcopydb for incremental backups and more advanced migration tasks. Manual migration can be prone to mistakes, and these tools can provide more automation and reliability.

I’ve decided to write this article to share my entire migration experience. I believe it could be helpful to someone else who finds themselves in a similar situation. Sharing my journey and the lessons I’ve learned might offer valuable insights and encouragement to others facing challenging migrations.

Conclusion

In conclusion, migrating a colossal 3TB database was a challenging and complex task. It required in-depth knowledge of database fundamentals and involved a process of trial and error. Despite the difficulties, I learned valuable lessons and gained a deeper understanding of PostgreSQL.

Monitoring the migration progress was crucial for maintaining data integrity and performance. The experience pushed me to excel and optimize the database successfully.

I am eager to explore other migration tools for future projects. Writing this article has been a reflective journey, and I hope it can inspire and assist others facing similar challenges. Overall, the migration experience has strengthened my skills as a software engineer and reinforced my dedication to embracing technical challenges.

How I Migrated a 3 TB PostgreSQL Transactional Database was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.