They are the nightmare of every PHP programmer: unknown characters suddenly showing up on your pages. The basic workaround is to htmlentity-encode everything. But one day you’ll get fed up with this dirty work-around and will want to get to the bottom of the problem. That’s what happend to me yesterday.
ISO-8859-1 (or Latin 1) is a single byte character encoding, containing most standard characters. E.g. ISO-8859-1 does not contain characters like: ~, ™, €. Why doesn’t it contain these? Because there’s no room for them, there is only room for 256 chars.
However, ISO-8859-1 did have some rarely used control characters. Microsoft decided these we’re useless and created ANSI, replacing the control characters for some commonly used ones. So ISO-8859-1 is a sub-set of ANSI.
UTF-8 is a 2 byte character set, which contains all ANSI characters and many more. For this reason UTF-8 is your choice when developing advanced PHP apps. However, how do you make sure all contents is stored in UTF-8? And how do you get this to work in a browser?
Once you are programming, content originates from different sources, all stored in a specific encoding:
- The document you are writing (e.g. index.php) is encoded
- The MySQL database has an encoding
- The content that is send to the server by POST/GET is encoded
- Files that are included/read/crawled are encoded
- The output to the browser is encoded
Now you want all of those different pieces to be UTF-8 or convert them to UTF-8 if they are not.
When editing in Notepad++ documents are stored in ANSI by default, so make sure you save your docs (PHP, HTML) as UTF-8. There is even an option to convert your existing docs. Automatically, the contents sent to the browser is now also UTF-8 encoded, since Apache does not touch this.
With UTF-8 rolling from your PHP-script, we need to make sure your browser knows it is UTF-8, otherwise it will start guessing, probably assuming it is ISO-8859-1, resulting in question marks. This is done by one of the following options:
- php header
- apache config
To make sure the data from your database is UTF-8, we first need to make sure that the data that is IN your database is UTF-8. Using PHPMyAdmin you can change the encoding of a database, its tabels and the field. Set all of them to utf8_general_ci (UTF-8, case insensitive). Wished you would have done this on setup, right? So the data goes in as UTF-8, is stored as UTF-8 and comes out as UTF-8, great!
Note: PostGreSQL automatically converts your encoding when script and database are not identical, so no worries there.
If you’ve propperly set up your php script to output UTF-8, you’ll get UTF-8 on your POSTs and GETs. However, it could be someone posts from another website to your website. If the other website is ISO encoded, you’ll have to convert it. For this reason (and not only this reason) ALWAYS check your POSTed data. How is decribed in the next section, as it also (and foremost) applies to foreign files and resources.
Includes, file_get_contents, crawlers
When obtaining data from other sources, you’ll most often get into contact with encoding problems. When crawling other sites you could simply trust the HTTP header, but better is to really try to detect the encoding. The function to detect encoding is:
Using another function, we can convert to UTF-8:
Now there is one slight problem, this function does not (yet?) detect ANSI but will in both cases return ISO-8859-1, so if your data is ANSI and contains one of the ANSI characters not contained in ISO-8859-1 (e.g. €-sign), you’ll apply the wrong encoding conversion, ending up with question marks. So, better sure than sorrow, always convert ISO-8859-1 to UTF-8 as if it is ANSI:
This is the same reason using utf8_encode won’t work: €-signs will not be understood by utf8_encode, as it is not ISO-8859-1, but ANSI.
By developing a small test-case these results were found. We’ve tested 3 cases:
- An ANSI encoded PHP file, output set to UTF-8
- An UTF-8 without BOM encoded PHP file, output set to UTF-8
- An ANSI encoded PHP file, output set to ISO-8859-1
Allthough obviously errornous, the first option is most often used, since ANSI is the default encoding of editors and programmers are often trying to figure out how to output UTF-8.
After some tests with the encoding and conversion of its own contents, some external files are included to try to output the correct result.