Tens of thousands of real-world messages: cleaned and prepared using Python for all your creative ideas
Introduction
I like to experiment with APIs and build neat stuff. I have a lot of ideas for projects relating to iMessage data. Unfortunately, accessing this data is not well documented, the database schema is confusing, and the data is inconsistent.
The iMessage data requires extra decoding and doesn’t contain Contact details, and the Address Book database (at least for me) had a lot of inconsistencies with phone numbers that require cleaning.
There must be a way to join the tables, but I can’t seem to find a solution, and even then, I need to do some post-processing. If you know how to join these tables, show me!
This guide will show how I access, merge, and clean iMessage and Contact data. We’ll produce a clean set of data with 150,000+ messages.
Table of Contents
Project Ideas
Enable Full Disk Access for Terminal.App
Extracting iMessage Data
Extracting Address Book Data
Cleaning Address Book Data
Merge iMessage + Address Book Data
Get Creative
Completed Code
Project Ideas
- Message analytics dashboard: (send/reply ratio, favorite contacts, graph conversation engagement over time)
- Sentiment Analysis: Feed the conversation data into ML APIs like IBM Watson or Google NLP APIs
- iMessage + ChatGPT chatbot: Check out my article and try it for yourself!
- Business Applications: Use iMessage to boost deliverability to iPhone users
Enable Full Disk Access for Terminal.App
iMessages are stored in a local SQLite chat.db Ensure you have full Disk access enabled for Terminal.App or the query will fail to run. To do this on MacOS Ventura (similar to earlier).
Open System Settings > Privacy and Security > Full Disk Access > Terminal.App
Extracting iMessage Data
iMessage data is stored in a local iMessage Database. Below we’ll locate the database and review the code to query recent messages
Locate the iMessage Database: Open a terminal window and try the following commands. The database should be in the same location for the most recent versions of MacOS. Replace userName with your userName.
cd /Users/userName/LibraryMessages
ls
You should see something like this. Ensure that chat.db isn’t present. If you do not see this database, open the Messages app and ensure you have iCloud Sync turned on.
Jot down the path of chat.db. We’ll use this later.
/Users/kellygold/LibraryMessages/chat.db
Query the iMessageDatabase
The iMessage database can be accessed via SQLite3. Below are a couple of queries you can use to explore the database. We’ll use this simple query to grab some recent messages. You can use the free version of TablePlus to view and check out the database or run sqlite3 from the terminal.
SELECT message.ROWID, message.date, message.text, message.attributedBody, handle.id, message.is_from_me, message.cache_roomnames
FROM message
LEFT JOIN handle ON message.handle_id = handle.ROWID
ORDER BY message.date DESC LIMIT 100;
This produces the following:
Notice that for the attributedBody field, some of the data is encoded. Luckily, there is some Python code that will decode this for us. I’ll spare you the details. The code below connects to the DB, queries the DB, decodes the data, converts to human-readable dates (skip this if you want the epoch timestamp with nanoseconds for accuracy… replacing limit with n.
import sqlite3
import datetime
import json
# Function to read messages from a sqlite database
def read_messages(db_location, n, self_number='Me', human_readable_date=True):
# Connect to the database and execute a query to join message and handle tables
conn = sqlite3.connect(db_location)
cursor = conn.cursor()
query = """
SELECT message.ROWID, message.date, message.text, message.attributedBody, handle.id, message.is_from_me, message.cache_roomnames
FROM message
LEFT JOIN handle ON message.handle_id = handle.ROWID
"""
if n is not None:
query += f" ORDER BY message.date DESC LIMIT {n}"
results = cursor.execute(query).fetchall()
# Initialize an empty list for messages
messages = []
# Loop through each result row and unpack variables
for result in results:
rowid, date, text, attributed_body, handle_id, is_from_me, cache_roomname = result
# Use self_number or handle_id as phone_number depending on whether it's a self-message or not
phone_number = self_number if handle_id is None else handle_id
# Use text or attributed_body as body depending on whether it's a plain text or rich media message
if text is not None:
body = text
elif attributed_body is None:
continue
else:
# Decode and extract relevant information from attributed_body using string methods
attributed_body = attributed_body.decode('utf-8', errors='replace')
if "NSNumber" in str(attributed_body):
attributed_body = str(attributed_body).split("NSNumber")[0]
if "NSString" in attributed_body:
attributed_body = str(attributed_body).split("NSString")[1]
if "NSDictionary" in attributed_body:
attributed_body = str(attributed_body).split("NSDictionary")[0]
attributed_body = attributed_body[6:-12]
body = attributed_body
# Convert date from Apple epoch time to standard format using datetime module if human_readable_date is True
if human_readable_date:
date_string = '2001-01-01'
mod_date = datetime.datetime.strptime(date_string, '%Y-%m-%d')
unix_timestamp = int(mod_date.timestamp())*1000000000
new_date = int((date+unix_timestamp)/1000000000)
date = datetime.datetime.fromtimestamp(new_date).strftime("%Y-%m-%d %H:%M:%S")
messages.append(
{"rowid": rowid, "date": date, "body": body, "phone_number": phone_number, "is_from_me": is_from_me,
"cache_roomname": cache_roomname, 'group_chat_name' : mapped_name})
conn.close()
return messages
We’ll also use another query to retrieve the iMessage group chat mappings/names, not much to see here, but this code links an ID to the Display name.
Here’s the sqlite query:
SELECT room_name, display_name FROM chat;
And here’s the Python code to do the same:
def get_chat_mapping(db_location):
conn = sqlite3.connect(db_location)
cursor = conn.cursor()
cursor.execute("SELECT room_name, display_name FROM chat")
result_set = cursor.fetchall()
mapping = {room_name: display_name for room_name, display_name in result_set}
conn.close()
return mapping
Here’s the Python file so far. You should be able to run this and have a JSON response with messages.
Note: Remove the last two lines after you test it’s working.
import sqlite3
import datetime
import subprocess
import os
import json
def get_chat_mapping(db_location):
conn = sqlite3.connect(db_location)
cursor = conn.cursor()
cursor.execute("SELECT room_name, display_name FROM chat")
result_set = cursor.fetchall()
mapping = {room_name: display_name for room_name, display_name in result_set}
conn.close()
return mapping
# Function to read messages from a sqlite database
def read_messages(db_location, n, self_number='Me', human_readable_date=True):
# Connect to the database and execute a query to join message and handle tables
conn = sqlite3.connect(db_location)
cursor = conn.cursor()
query = """
SELECT message.ROWID, message.date, message.text, message.attributedBody, handle.id, message.is_from_me, message.cache_roomnames
FROM message
LEFT JOIN handle ON message.handle_id = handle.ROWID
"""
if n is not None:
query += f" ORDER BY message.date DESC LIMIT {n}"
results = cursor.execute(query).fetchall()
# Initialize an empty list for messages
messages = []
# Loop through each result row and unpack variables
for result in results:
rowid, date, text, attributed_body, handle_id, is_from_me, cache_roomname = result
# Use self_number or handle_id as phone_number depending on whether it's a self-message or not
phone_number = self_number if handle_id is None else handle_id
# Use text or attributed_body as body depending on whether it's a plain text or rich media message
if text is not None:
body = text
elif attributed_body is None:
continue
else:
# Decode and extract relevant information from attributed_body using string methods
attributed_body = attributed_body.decode('utf-8', errors='replace')
if "NSNumber" in str(attributed_body):
attributed_body = str(attributed_body).split("NSNumber")[0]
if "NSString" in attributed_body:
attributed_body = str(attributed_body).split("NSString")[1]
if "NSDictionary" in attributed_body:
attributed_body = str(attributed_body).split("NSDictionary")[0]
attributed_body = attributed_body[6:-12]
body = attributed_body
# Convert date from Apple epoch time to standard format using datetime module if human_readable_date is True
if human_readable_date:
date_string = '2001-01-01'
mod_date = datetime.datetime.strptime(date_string, '%Y-%m-%d')
unix_timestamp = int(mod_date.timestamp())*1000000000
new_date = int((date+unix_timestamp)/1000000000)
date = datetime.datetime.fromtimestamp(new_date).strftime("%Y-%m-%d %H:%M:%S")
mapping = get_chat_mapping(db_location) # Get chat mapping from database location
try:
mapped_name = mapping[cache_roomname]
except:
mapped_name = None
messages.append(
{"rowid": rowid, "date": date, "body": body, "phone_number": phone_number, "is_from_me": is_from_me,
"cache_roomname": cache_roomname, 'group_chat_name' : mapped_name})
conn.close()
return messages
def print_messages(messages):
print(json.dumps(messages))
# ask the user for the location of the database
db_location = input("Enter the location of the database: ")
# ask the user for the number of messages to read
n = input("Enter the number of messages to read: ")
# Remove the 2 lines below after testing -- they are for testing only
output = read_messages(db_location, n)
print_messages(output)
# Remove the 2 lines above after testing -- they are for testing only
Extracting Address Book Data
Locate the Address Book Database. The file path should be something like the following the image below.
Note: I had issues accessing the file at this path. Try adjusting the permissions of parent directories or copy the DB file to another location.
~/Library/"Application Support"/AddressBook/Sources/*/AddressBook-v22.abcddb
We want to get the Names and Phone Numbers for our use case. Here is the sqlite query and python code.
Here’s the sqlite3 command that you can run in the terminal in JSON mode:
sqlite3 ~/Library/"Application Support"/AddressBook/Sources/*/AddressBook-v22.abcddb<<EOF
.mode json
SELECT DISTINCT
ZABCDRECORD.ZFIRSTNAME [FIRST NAME],
ZABCDRECORD.ZLASTNAME [LAST NAME],
ZABCDPHONENUMBER.ZFULLNUMBER [FULL NUMBER]
FROM
ZABCDRECORD
LEFT JOIN ZABCDPHONENUMBER ON ZABCDRECORD.Z_PK = ZABCDPHONENUMBER.ZOWNER
ORDER BY
ZABCDRECORD.ZLASTNAME,
ZABCDRECORD.ZFIRSTNAME,
ZABCDPHONENUMBER.ZORDERINGINDEX ASC;
EOF
Interested in extracting more fields? Try this query instead (On one line to save space).
select * FROM ZCNCDUNIFIEDCONTACTINFO;SELECT DISTINCT ZABCDRECORD.ZFIRSTNAME [FIRST NAME], ZABCDRECORD.ZLASTNAME [LAST NAME], ZABCDRECORD.ZMAIDENNAME [MAIDEN NAME], ZABCDRECORD.ZMIDDLENAME [MIDDLE NAME], ZABCDRECORD.ZNICKNAME [NICKNAME], ZABCDRECORD.ZSUFFIX [SUFFIX], ZABCDRECORD.ZTITLE [TITLE], ZABCDRECORD.ZJOBTITLE [JOB TITLE], ZABCDRECORD.ZORGANIZATION [ORGANIZATION], ZABCDRECORD.ZLASTSAVEDVERSION [LAST SAVED VERSION], ZABCDRECORD.ZSYNCANCHOR [SYNC ANCHOR], ZABCDPOSTALADDRESS.ZCITY [CITY], ZABCDPOSTALADDRESS.ZSTATE [STATE], ZABCDPOSTALADDRESS.ZSTREET [STREET], ZABCDPOSTALADDRESS.ZZIPCODE [ZIPCODE], ZABCDPOSTALADDRESS.ZCOUNTRYCODE [COUNTRY CODE], ZABCDPOSTALADDRESS.ZCOUNTRYNAME [COUNTRY NAME], ZABCDNOTE.ZTEXT [TEXT(NOTE)], ZABCDNOTE.ZRICHTEXTDATA [RICH TEXT DATA (NOTE BLOB)], ZABCDPHONENUMBER.ZFULLNUMBER [`FULL NUMBER`], ZABCDPHONENUMBER.ZAREACODE [AREACODE], ZABCDPHONENUMBER.ZEXTENSION [EXTENSION], ZABCDPHONENUMBER.ZLABEL [LABEL (PHONE NUMBER)], ZABCDPHONENUMBER.ZISPRIMARY [ISPRIMARY (PHONE NUMBER)], ZABCDPHONENUMBER.ZORDERINGINDEX [ORDERINGINDEX (PHONE NUMBER)], ZABCDRELATEDNAME.ZNAME AS 'Related Name', ZABCDRELATEDNAME.ZLABEL AS 'Related Name – Label', ZABCDURLADDRESS.ZURL AS 'URL' FROM ZABCDRECORD LEFT JOIN ZABCDPOSTALADDRESS ON ZABCDRECORD.Z_PK = ZABCDPOSTALADDRESS.ZOWNER LEFT JOIN ZABCDNOTE ON ZABCDRECORD.Z_PK = ZABCDNOTE.ZCONTACT LEFT JOIN ZABCDPHONENUMBER ON ZABCDRECORD.Z_PK = ZABCDPHONENUMBER.ZOWNER LEFT JOIN ZABCDRELATEDNAME ON ZABCDRECORD.Z_PK = ZABCDRELATEDNAME.ZOWNER LEFT JOIN ZABCDURLADDRESS ON ZABCDRECORD.Z_PK = ZABCDURLADDRESS.ZOWNER ORDER BY ZABCDRECORD.ZLASTNAME, ZABCDRECORD.ZFIRSTNAME, ZABCDPHONENUMBER.ZORDERINGINDEX ASC
Python code
def get_address_book(address_book_location):
conn = sqlite3.connect(address_book_location)
cursor = conn.cursor()
cursor.execute("SELECT DISTINCT ZABCDRECORD.ZFIRSTNAME [FIRST NAME], ZABCDRECORD.ZLASTNAME [LAST NAME], ZABCDPHONENUMBER.ZFULLNUMBER [FULL NUMBER] FROM ZABCDRECORD LEFT JOIN ZABCDPHONENUMBER ON ZABCDRECORD.Z_PK = ZABCDPHONENUMBER.ZOWNER ORDER BY ZABCDRECORD.ZLASTNAME, ZABCDRECORD.ZFIRSTNAME, ZABCDPHONENUMBER.ZORDERINGINDEX ASC")
result_set = cursor.fetchall()
json_output = json.dumps([{"FIRST NAME": t[0], "LAST NAME": t[1], "FULL NUMBER": t[2]} for t in result_set])
print(json_output)
conn.close()
return json_output
Cleaning Address Book Data
The phone numbers are in different formats when you run this code, such that even the phonenumbers library fails! This is because some numbers contain the country code, some contain plus signs, and some don’t. It’s nuts!
Here’s what I mean. Running the query above, here are the various formats for phone numbers I see. To match the phone numbers with the iMessage data, they need to match.
(000) 111-2222
+1 (000) 111-2222
+10001112222
0001112222
For the iMessage DB query, I inspected the numbers, and it seems that for all numbers in my contacts (other than shortcode numbers and emails), the number is in the following format:
+10002223333
Let’s adjust the address book function to clean up the phone numbers. Here’s the Python code that cleans up the phone numbers:
def get_address_book(address_book_location):
conn = sqlite3.connect(address_book_location)
cursor = conn.cursor()
cursor.execute("SELECT DISTINCT ZABCDRECORD.ZFIRSTNAME [FIRST NAME], ZABCDRECORD.ZLASTNAME [LAST NAME], ZABCDPHONENUMBER.ZFULLNUMBER [FULL NUMBER] FROM ZABCDRECORD LEFT JOIN ZABCDPHONENUMBER ON ZABCDRECORD.Z_PK = ZABCDPHONENUMBER.ZOWNER ORDER BY ZABCDRECORD.ZLASTNAME, ZABCDRECORD.ZFIRSTNAME, ZABCDPHONENUMBER.ZORDERINGINDEX ASC")
result_set = cursor.fetchall()
#Convert tuples to json
json_output = json.dumps([{"FIRST NAME": t[0], "LAST NAME": t[1], "FULL NUMBER": t[2]} for t in result_set])
json_list = json.loads(json_output)
conn.close()
for obj in json_list:
# Get the phone number from the object
phone = obj["FULL NUMBER"]
if phone is None:
continue
# Remove all non-numeric characters from the phone number
phone = "".join([c for c in phone if c.isnumeric()])
#if the phone number is 10 digits, add "+1" to the beginning, if it's 11 digits, add "+"
if len(phone) == 10:
phone = "+1" + phone
elif len(phone) == 11:
phone = "+" + phone
# Add the phone number to the object
obj["NUMBERCLEAN"] = phone
new_json_output = json.dumps(json_list)
return new_json_output
Merge iMessage + Address Book Data
Finally, we want to add the Contact Names from our Address Book Data. There must be a way to join these two tables, but I couldn’t figure it out. If you have any ideas, let me know! Instead, I have this Python code to create a unified data set:
def combine_data(recent_messages, addressBookData):
#convert addressBookData to a list of dictionaries
addressBookData = json.loads(addressBookData)
#loop through each message
for message in recent_messages:
phone_number = message["phone_number"]
for contact in addressBookData:
# if contact does not have property NUMBERCLEAN, skip it
if "NUMBERCLEAN" not in contact:
continue
else:
contact_number = contact["NUMBERCLEAN"]
#if the phone number from the message matches the phone number from the contact add the names to the message
if phone_number == contact_number:
message["first_name"] = contact["FIRSTNAME"]
message["last_name"] = contact["LASTNAME"]
return recent_messages
The output of the function should give you the following:
...
{
"rowid": 153672,
"date": "2023-03-03 22:53:57",
"body": "On my way! back",
"phone_number": "+1532223323",
"is_from_me": 0,
"cache_roomname": null,
"group_chat_name": "",
"first_name": "Alexa",
"last_name": "Gallegos"
},
{
"rowid": 153671,
"date": "2023-03-03 21:29:06",
"body": "Your **** verification code is 38017.",
"phone_number": "+18773223333",
"is_from_me": 0,
"cache_roomname": null,
"group_chat_name": ""
},
...
Contact Names now appear as first_name and last_name in the message object.
Get Creative
With a unified data set and a flexible piece of code, you can retrieve a large set of personal, real-world conversation data. This is over a150,000 messages with distribution: 50.42% Received Messages and 49.78% Sent Messages.
I mentioned a few ideas I have for this data. Check out my iMessage + ChatGPT bot, and leave me ideas for additional projects! I’d love to hear what awesome things can be done!
Would you like to see the same for Gmail?
Completed Code
Here it is, all in one copypasta block for you. Enjoy!
import sqlite3
import datetime
import json
def get_chat_mapping(chatdb_location):
conn = sqlite3.connect(chatdb_location)
cursor = conn.cursor()
cursor.execute("SELECT room_name, display_name FROM chat")
result_set = cursor.fetchall()
mapping = {room_name: display_name for room_name, display_name in result_set}
conn.close()
return mapping
# Function to read messages from a sqlite database
def read_messages(chatdb_location, n, self_number='Me', human_readable_date=True):
# Connect to the database and execute a query to join message and handle tables
conn = sqlite3.connect(chatdb_location)
cursor = conn.cursor()
query = """
SELECT message.ROWID, message.date, message.text, message.attributedBody, handle.id, message.is_from_me, message.cache_roomnames
FROM message
LEFT JOIN handle ON message.handle_id = handle.ROWID
"""
if n is not None:
query += f" ORDER BY message.date DESC LIMIT {n}"
results = cursor.execute(query).fetchall()
# Initialize an empty list for messages
messages = []
# Loop through each result row and unpack variables
for result in results:
rowid, date, text, attributed_body, handle_id, is_from_me, cache_roomname = result
# Use self_number or handle_id as phone_number depending on whether it's a self-message or not
phone_number = self_number if handle_id is None else handle_id
# Use text or attributed_body as body depending on whether it's a plain text or rich media message
if text is not None:
body = text
elif attributed_body is None:
continue
else:
# Decode and extract relevant information from attributed_body using string methods
attributed_body = attributed_body.decode('utf-8', errors='replace')
if "NSNumber" in str(attributed_body):
attributed_body = str(attributed_body).split("NSNumber")[0]
if "NSString" in attributed_body:
attributed_body = str(attributed_body).split("NSString")[1]
if "NSDictionary" in attributed_body:
attributed_body = str(attributed_body).split("NSDictionary")[0]
attributed_body = attributed_body[6:-12]
body = attributed_body
# Convert date from Apple epoch time to standard format using datetime module if human_readable_date is True
if human_readable_date:
date_string = '2001-01-01'
mod_date = datetime.datetime.strptime(date_string, '%Y-%m-%d')
unix_timestamp = int(mod_date.timestamp())*1000000000
new_date = int((date+unix_timestamp)/1000000000)
date = datetime.datetime.fromtimestamp(new_date).strftime("%Y-%m-%d %H:%M:%S")
mapping = get_chat_mapping(chatdb_location) # Get chat mapping from database location
try:
mapped_name = mapping[cache_roomname]
except:
mapped_name = None
messages.append(
{"rowid": rowid, "date": date, "body": body, "phone_number": phone_number, "is_from_me": is_from_me,
"cache_roomname": cache_roomname, 'group_chat_name' : mapped_name})
conn.close()
return messages
def print_messages(messages):
print(json.dumps(messages))
def get_address_book(address_book_location):
conn = sqlite3.connect(address_book_location)
cursor = conn.cursor()
cursor.execute("SELECT DISTINCT ZABCDRECORD.ZFIRSTNAME [FIRST NAME], ZABCDRECORD.ZLASTNAME [LAST NAME], ZABCDPHONENUMBER.ZFULLNUMBER [FULL NUMBER] FROM ZABCDRECORD LEFT JOIN ZABCDPHONENUMBER ON ZABCDRECORD.Z_PK = ZABCDPHONENUMBER.ZOWNER ORDER BY ZABCDRECORD.ZLASTNAME, ZABCDRECORD.ZFIRSTNAME, ZABCDPHONENUMBER.ZORDERINGINDEX ASC")
result_set = cursor.fetchall()
#Convert tuples to json
json_output = json.dumps([{"FIRSTNAME": t[0], "LASTNAME": t[1], "FULLNUMBER": t[2]} for t in result_set])
json_list = json.loads(json_output)
conn.close()
for obj in json_list:
# Get the phone number from the object
phone = obj["FULLNUMBER"]
if phone is None:
continue
# Remove all non-numeric characters from the phone number
phone = "".join([c for c in phone if c.isnumeric()])
#if the phone number is 10 digits, add "+1" to the beginning, if it's 11 digits, add "+"
if len(phone) == 10:
phone = "+1" + phone
elif len(phone) == 11:
phone = "+" + phone
# Add the phone number to the object
obj["NUMBERCLEAN"] = phone
new_json_output = json.dumps(json_list)
return new_json_output
#combine recent messages and address book data
def combine_data(recent_messages, addressBookData):
#convert addressBookData to a list of dictionaries
addressBookData = json.loads(addressBookData)
#loop through each message
for message in recent_messages:
phone_number = message["phone_number"]
for contact in addressBookData:
# if contact does not have property NUMBERCLEAN, skip it
if "NUMBERCLEAN" not in contact:
continue
else:
contact_number = contact["NUMBERCLEAN"]
#if the phone number from the message matches the phone number from the contact add the names to the message
if phone_number == contact_number:
message["first_name"] = contact["FIRSTNAME"]
message["last_name"] = contact["LASTNAME"]
return recent_messages
# ask the user for the location of the database
chatdb_location = input("Enter the absolute path of the chat database: ")
#chatdb_location = "/Users/<userName>/Library/Messages/chat.db"
# ask the user for the location of the address book database:
address_book_location = input("Enter the absolute path of the address book database : ")
address_book_location = "~/Library/Application Support/AddressBook/Sources/*/AddressBook-v22.abcddb"
# ask the user for the number of messages to read
n = input("Enter the number of messages to read: ")
recent_messages = read_messages(chatdb_location, n)
#print_messages(recent_messages)
addressBookData = get_address_book(address_book_location)
#print(addressBookData)
combined_data = combine_data(recent_messages, addressBookData)
print_messages(combined_data)
Extract iMessage Data For Analytics and Conversational AI projects was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.