Opening a WhatsApp export and finding a wall of question marks, boxes, or seemingly random symbols where your messages should be is a disorienting experience. These symptoms almost always trace back to a character encoding mismatch - a situation where the software reading the file assumes a different encoding standard than the one used to write it.
Character encoding is the system that maps text characters to the binary bytes stored in a file. WhatsApp exports use UTF-8, the universal standard that supports virtually every language and emoji. Problems arise when a tool in the chain - a text editor, email client, or conversion tool - misidentifies the encoding and reads the bytes according to a different scheme.
Why Encoding Issues Happen
UTF-8 has been the near-universal standard for text files for over a decade, and WhatsApp exports use it consistently. However, some older tools and operating system components default to legacy encodings such as Windows-1252 (Western European), Latin-1, or Shift-JIS (Japanese) when no encoding is explicitly declared. When these tools read a UTF-8 file, multi-byte UTF-8 sequences - used for accented characters, non-Latin scripts, and all emoji - are misinterpreted and rendered as garbage.
The transfer process itself can occasionally introduce encoding corruption. Opening a _chat.txt in Microsoft Word or Excel and saving it can silently re-encode the file. Some email clients apply character set transformations to attachments. Older FTP clients have been known to apply line-ending conversions that corrupt binary sequences within the file. Any of these intermediary steps can produce a file that looks structurally correct but contains damaged character data.
Symptoms of Encoding Problems
The most obvious symptom is the replacement character - a diamond with a question mark inside it (�), which is the Unicode standard's way of flagging a byte sequence it cannot decode. Seeing rows of these characters where message text should be is a clear sign of a UTF-8 file being read with a mismatched encoding. Question marks in place of expected characters, or sequences of characters that bear no resemblance to the language of the conversation, point to the same cause.
Mojibake is a related phenomenon where a file is read in the wrong encoding and produces garbled but consistent-looking text - for example, Arabic text read as Windows-1252 might produce a coherent-looking sequence of Western symbols that are simply wrong. Emojis in the file can be entirely absent, replaced by boxes, or rendered as their Unicode code point text (e.g., 'U+1F600') rather than the visual glyph.
Languages Most Affected
Languages that use non-Latin scripts are disproportionately affected because their characters require multi-byte UTF-8 sequences. Arabic, Hebrew, Persian, and Urdu are common victims, as are Chinese (Simplified and Traditional), Japanese, Korean, Thai, Hindi, and other scripts. If a conversation mixes a Latin-script language with any of these, you may see the Latin text display correctly while all non-Latin text is garbled.
Emoji-heavy chats are also vulnerable since all emoji require multi-byte encoding. A chat that is otherwise entirely in English but contains frequent emoji may show correct text but blank boxes or question marks wherever an emoji appears. Right-to-left languages (Arabic and Hebrew) have an additional complication: even when the encoding is correct, some tools fail to apply the right-to-left text direction, rendering the characters in the wrong order.
How to Check the Encoding
Visual Studio Code is the easiest tool for checking and correcting file encoding. Open the _chat.txt file in VS Code - the encoding is displayed in the status bar at the bottom right of the window. If it shows 'UTF-8', the file itself is correctly encoded and the problem lies with whatever tool you used to view it. If it shows 'Windows 1252' or another legacy encoding, the file has been re-encoded and you need to convert it back.
Notepad++ on Windows provides similar functionality: open the file, check the Encoding menu at the top, and look for what encoding Notepad++ detected. The 'Encode in UTF-8' option in that same menu will convert the file to UTF-8 without a BOM (Byte Order Mark), which is the correct format for WhatsApp exports.
Fixing Encoding Before Converting
If the file has been saved in the wrong encoding, you need to re-encode it to UTF-8. In VS Code, click the encoding label in the status bar, select 'Reopen with Encoding', choose the encoding that matches what the file actually is (e.g., 'Western (Windows 1252)' if the file was saved on Windows), verify the text now looks correct, then save it as UTF-8 using 'Save with Encoding → UTF-8'.
For command-line users, the iconv utility on macOS and Linux can convert between encodings with a single command. Python's codecs module offers a programmatic approach for batch processing multiple files. Once the _chat.txt is correctly encoded as UTF-8, you can re-zip it with the media folder and <a href='/upload'>upload your export to WaChat</a> for conversion.
How WaChat to PDF Handles Encoding
WaChat to PDF reads all input files as UTF-8 and implements fallback detection for common encoding variants. The parser handles Arabic and Hebrew text with appropriate right-to-left rendering, supports the full Unicode emoji range, and preserves characters from all major world scripts. For the vast majority of exports, no manual encoding intervention is required - the converter handles it automatically.
If a chat exported in a language with non-Latin script displays correctly in the WaChat to PDF preview, your encoding is fine. If characters appear as boxes or question marks in the preview, contact support with a short sample of the raw _chat.txt so the team can identify the specific encoding variant and advise on the best path forward.
Ready to convert? Upload your export to WaChat to PDF and see the preview.
upload_fileConvert Your Chat Free