Getting propper UTF-8 output in PHP

Posted: May 29th, 2009 | Author: | Filed under: PHP | Comments Off on Getting propper UTF-8 output in PHP

utf8They are the nightmare of every PHP programmer: unknown characters suddenly showing up on your pages. The basic workaround is to htmlentity-encode everything. But one day you’ll get fed up with this dirty work-around and will want to get to the bottom of the problem. That’s what happend to me yesterday.

Encoding basics

ISO-8859-1 (or Latin 1) is a single byte character encoding, containing most standard characters. E.g. ISO-8859-1 does not contain characters like: ~, ™, €. Why doesn’t it contain these? Because there’s no room for them, there is only room for 256 chars.

However, ISO-8859-1 did have some rarely used control characters. Microsoft decided these we’re useless and created ANSI, replacing the control characters for some commonly used ones. So ISO-8859-1 is a sub-set of ANSI.

UTF-8 is a 2 byte character set, which contains all ANSI characters and many more. For this reason UTF-8 is your choice when developing advanced PHP apps. However, how do you make sure all contents is stored in UTF-8? And how do you get this to work in a browser?

Encoding sources

Once you are programming, content originates from different sources, all stored in a specific encoding:

  • The document you are writing (e.g. index.php) is encoded
  • The MySQL database has an encoding
  • The content that is send to the server by POST/GET is encoded
  • Files that are included/read/crawled are encoded
  • The output to the browser is encoded

Now you want all of those different pieces to be UTF-8 or convert them to UTF-8 if they are not.

The document

When editing in Notepad++ documents are stored in ANSI by default, so make sure you save your docs (PHP, HTML) as UTF-8. There is even an option to convert your existing docs. Automatically, the contents sent to the browser is now also UTF-8 encoded, since Apache does not touch this.

The browser

With UTF-8 rolling from your PHP-script, we need to make sure your browser knows it is UTF-8, otherwise it will start guessing, probably assuming it is ISO-8859-1, resulting in question marks. This is done by one of the following options:

  • php header
  • apache config
  • .htaccess

The database

To make sure the data from your database is UTF-8, we first need to make sure that the data that is IN your database is UTF-8. Using PHPMyAdmin you can change the encoding of a database, its tabels and the field. Set all of them to utf8_general_ci (UTF-8, case insensitive). Wished you would have done this on setup, right? So the data goes in as UTF-8, is stored as UTF-8 and comes out as UTF-8, great!

Note: PostGreSQL automatically converts your encoding when script and database are not identical, so no worries there.

POST/GET contents

If you’ve propperly set up your php script to output UTF-8, you’ll get UTF-8 on your POSTs and GETs. However, it could be someone posts from another website to your website. If the other website is ISO encoded, you’ll have to convert it. For this reason (and not only this reason) ALWAYS check your POSTed data. How is decribed in the next section, as it also (and foremost) applies to foreign files and resources.

Includes, file_get_contents, crawlers

When obtaining data from other sources, you’ll most often get into contact with encoding problems. When crawling other sites you could simply trust the HTTP header, but better is to really try to detect the encoding. The function to detect encoding is:

….

Using another function, we can convert to UTF-8:
Now there is one slight problem, this function does not (yet?) detect ANSI but will in both cases return ISO-8859-1, so if your data is ANSI and contains one of the ANSI characters not contained in ISO-8859-1 (e.g. €-sign), you’ll apply the wrong encoding conversion, ending up with question marks. So, better sure than sorrow, always convert ISO-8859-1 to UTF-8 as if it is ANSI:

This is the same reason using utf8_encode won’t work: €-signs will not be understood by utf8_encode, as it is not ISO-8859-1, but ANSI.

The Test

By developing a small test-case these results were found. We’ve tested 3 cases:

  • An ANSI encoded PHP file, output set to UTF-8
  • An UTF-8 without BOM encoded PHP file, output set to UTF-8
  • An ANSI encoded PHP file, output set to ISO-8859-1

Allthough obviously errornous, the first option is most often used, since ANSI is the default encoding of editors and programmers are often trying to figure out how to output UTF-8.

After some tests with the encoding and conversion of its own contents, some external files are included to try to output the correct result.


Comments are closed.