WhatsApp's export format is deliberately simple - a plain text log and a folder of media files - but that simplicity hides a significant amount of structural complexity. Understanding exactly how the data is laid out, where edge cases lurk, and how WaChat to PDF transforms the raw export into a structured JSON format is valuable for anyone who wants to work with WhatsApp data programmatically.
This guide is written for developers, data scientists, and technically inclined users who want to go beyond the PDF output and work directly with the underlying data. It covers the raw export format, the parsing challenges it presents, and the clean structured format that the WaChat to PDF JSON export produces.
The Raw Export Format
A WhatsApp export ZIP contains two things: a _chat.txt file and (if you chose 'Include Media') a folder containing all media files. The _chat.txt is a UTF-8 encoded plain text file with one or more lines per message. Each message begins with a timestamp and sender on the first line, followed by the message body which may span multiple lines. Media references appear as inline text strings pointing to filenames in the media folder.
The filename convention for media files encodes information about the file type and creation date. For example, a file named IMG-20240315-WA0001.jpg is an image created on 15 March 2024, and PTT-20240315-WA0001.opus is a voice note (Push-To-Talk) from the same date. Understanding this naming convention is useful when correlating media files back to their context in the conversation.
Parsing the Timestamp
Timestamp format is one of the most significant sources of variation in WhatsApp exports. iOS exports typically use a 12-hour clock with AM/PM and format dates as DD/MM/YYYY in most locales (or MM/DD/YYYY in US locale). Android exports may use 24-hour time and different date separators. The locale of the device at the time of export determines the format, and the same conversation exported from an iPhone and an Android phone may have different timestamp formats.
Timestamps in the export are in local device time - the timezone that was active on the phone at the time the message was processed. There is no explicit timezone indicator in the export file, which means timestamps cannot be reliably converted to UTC without additional context. This is an important caveat for any analysis that compares timestamps across parties or converts times for international legal proceedings.
Sender Attribution
In individual chats, the sender line contains either the contact's saved name (if they are in the device's contacts) or their phone number in international format (+44 followed by the number, for example). In group chats, every message is attributed to a sender name or number, and system messages (group joins, exits, admin notifications) appear with no sender - just the timestamp and the message text.
A subtle issue arises when the same person is saved under different names on different devices. If you export the same group chat from two different phones, messages from the same sender may appear under different names depending on how that person was saved in each exporter's contacts. WaChat to PDF normalises the sender to what appears in the export and does not attempt to resolve contacts across devices.
Message Types in the Export
The _chat.txt contains several distinct message types, all formatted as plain text lines. Standard text messages appear as the message body directly. Media messages appear as a filename reference such as 'IMG-20240315-WA0001.jpg (file attached)'. Deleted messages appear as the placeholder text 'This message was deleted' or 'You deleted this message' (the exact text varies by WhatsApp version and locale). Call log entries appear as 'Missed voice call' or 'Voice call, X:XX' with a duration.
System messages - group name changes, participant additions and removals, admin changes, encryption notifications - appear at the message position where they occurred in the conversation timeline. They have no sender attribution. Identifying and separating these from actual user messages is one of the core parsing tasks, as they are meaningful for understanding the group's history but should be distinguished from actual communications for most analytical purposes.
The WaChat to PDF JSON Structure
WaChat to PDF's AI-ready JSON export transforms the raw text into a clean array of message objects. Each object has: a sender field (string, normalised to the name as it appeared in the export), a timestamp field (ISO 8601 string with date and time), a messageType field (one of 'text', 'image', 'video', 'audio', 'document', 'sticker', 'contact', 'location', 'call', 'system', 'deleted'), a content field (the message text for text messages, or a descriptive label for other types), and a mediaRef field (the original filename for media messages, or null otherwise).
Timestamps in the JSON are converted to ISO 8601 format for consistency across different export locales, though the timezone offset reflects the device's local time at export since no absolute timezone is available. The JSON structure is designed to be easy to load into any language that supports JSON parsing - a single call to JSON.parse() (JavaScript) or json.load() (Python) produces a ready-to-use array.
Edge Cases to Handle When Parsing
Multi-line messages are the trickiest parsing challenge. A message body that contains line breaks is formatted with continuation lines in the _chat.txt that look like new messages but lack a timestamp prefix. A robust parser must look ahead to determine whether a line that begins without a timestamp is a continuation of the previous message or a structural anomaly in the file.
Other edge cases worth handling: messages containing colons (which appear in the sender:content separator format), emoji that look like timestamp digits in some fonts, contacts with no name (just a phone number containing '+'), unsaved numbers with international format variations, and group system messages in different locales (a Spanish-language WhatsApp produces different system message text than an English one). WaChat to PDF's parser handles all of these cases - the JSON output is clean and consistently structured regardless of the input format.
Want clean, structured data from your WhatsApp export? Get the pro plan JSON export.
upload_fileConvert Your Chat Free