How to detect the page encoding with PHP?

8

I wanted to create a function that would save the data in the database always in the correct encoding (my bank is UTF-8) according to the detected encoding .

Is there any native PHP function to do this? Is there any other way?

    
asked by anonymous 19.12.2013 / 12:57

5 answers

9

Assuming your server is serving pages encoded as UTF-8 , the default behavior of most user agents ( browsers etc) will be to use this same encoding when sending data back to the server of forms / POST, for example). You can also accept other encodings via the accept-charset parameter. That way you will not have to "detect" anything, you are instructing the client side to send data already in the desired encoding.

See also this answer in the English OS. One of the important points is that a browser following standards will respect this requirement of encoding , but it is always possible that a client maliciously) send data with different encoding. In this case, it's up to you to determine whether you need to try to fix the problem that the client created, or leave the burden on it ... Common users using modern browsers will certainly not have this kind of problem (but it does not cost anything to run some tests, according to your target audience).

Update: based on your responses and @Guerra, I think you do not need to detect anything, simply using utf8_decode should be enough (since your users will always send in UTF- and your connection to the bank always expects ISO 8859-1, regardless of the encoding your bank uses).

But if you want a robust solution, here's what I suggest:

function fixEncoding($in_str)
{
   $cur_encoding = mb_detect_encoding($in_str) ;

   if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
   {
       return utf8_decode($in_str);
   }
   elseif($cur_encoding == "ISO 8859-1" && mb_check_encoding($in_str,"ISO 8859-1"))
   {
       return $in_str;
   }
   else
   {
       // Não testado:
       // return iconv($cur_encoding, "ISO 8859-1", $in_str);
       throw new Exception('Codificação não suportada.');
   }
}
    
19.12.2013 / 13:16
9

Your question is somewhat vague about the specific problem you are encountering, so here are some things to consider for a correct iteration with user data, data to and from the server, and iteration with the database. data, based on the stated basis that your database is working with UTF-8 Charset.

Notes: This may not answer your question, but it seems relevant enough to help me when dealing with coding problems. Much more information can be added. Just type in your desired comment.


Browser Statements

  • HTML Pages

    HTML pages always require an indication in the header through a META tag, the charset that the browser must use to present and receive data:

    HTML 5 example

    <!doctype html>
    <html>
      <head>
        <meta charset="UTF-8">
      </head>
      ...
    

    HTML 4 Example

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" 
    "http://www.w3.org/TR/html4/strict.dtd">
    
    <html>
      <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      </head>
      ...
    
  • PHP Files

    The main file responsible for displaying the HTML and performing the user interaction functions (usually index.php), should contain an indication at the beginning of the same, before sending any headers to the browser, indicating the charset to use:

    /* Setting charset for proper language
     * support, DB interaction, etc.
     */
    header('Content-Type: text/html; charset=UTF-8');
    

    This will ensure that the information sent to the browser and the information collected from it will be in UTF-8.

  • Server posts via HTML > PHP

    If PHP and the HTML page header are to indicate the same Charset, as seen above, a normal post from a form on the page will send the browser information to the server in UTF-8.

    However, there is a way to indicate that the form should send the data to the server in a specific Charset:

    <form action="mytargetfile.php" accept-charset="UTF-8">
    

    This is not necessary because the "normal" procedure is to apply the mentioned in the points above. But it can be used without problems.

  • Server posts via Ajax > PHP

    The posts made via Ajax send the information respecting the indications of the HTML page. This same information should arrive at a destination file that has the charset indication to use.

    However, here too you can indicate which Charset to use for sending data:

    $.ajax({
      data: parameters,
      type: "POST",
      url: ajax_url,
      contentType: "application/x-javascript; charset:UTF-8",
      success: callback
    });
    

    The indication of the type of content will, of course, vary according to the content to be sent, but this is followed by the indication of the Charset to use.


Taking care of files

When we edit or create a file, we must always keep in mind that it must be coded with the Charset equal to the information that will pass through it.

Thisisasmalldetail,butitensuresthattheinformationisbeingmanagedwellinrelationtoitscoding.


IterationwiththeDatabase

HereitisimportanttonotethattheconnectionthatweopentothedatabasetosaveorreaddatamustbeusingthesameCharsetasthedataandthecoderesponsibleforthisoperationareusing:

ExampleofconnectingtothedatabaseviaPDOindicatingCharset:

<?php /** * Instances a new database connection * @return PDO instance of PDO connection */ protected function InitConnetion() { $dbh = new PDO( 'mysql:host="meuServidor";dbname="minhaBD";', "utilizador", "password", array( PDO::ATTR_PERSISTENT => false, PDO::MYSQL_ATTR_USE_BUFFERED_QUERY => true, PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION, PDO::MYSQL_ATTR_INIT_COMMAND => "SET NAMES utf8" ) ); return $dbh; } ?>

Notice that I am applying "utf8" instead of "utf-8" because the file that the database has with the instructions of this Charset is called utf8. Depending on the server configuration the file can be called "utf-8", "utf8" or "bananas". When you indicate a name that does not exist, you receive an error, and you know you will have to change it.

    
19.12.2013 / 14:20
3

The best way to convert ISO 8859-1 character to UTF8 I found was this:

function fixEncoding($in_str)
{
  $cur_encoding = mb_detect_encoding($in_str) ;
  if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
    return $in_str;
  else
    return utf8_encode($in_str);
}

But in the case of HTML files just use this header:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 

I strongly recommend reading this article (English) I found it very useful to understand the life encoding that PHP sometimes gives in the bag .

For other formats, the most appropriate method would be iconv but you would have to do some tests to try to do it dynamically in relation to the current encoding see iconv php

Source: Here

    
19.12.2013 / 13:06
1

Based on the answer from @Guerra I was able to find the solution. My html page is with UTF-8 Charset set and my MySQL Bank also. Which is strange because when the function detects the character as UTF-8 I need to use ut8_decode to correctly enter accent on the bank.

As far as I understand utf8_decode would turn into ISO-8859-1, can anyone give a better explanation in the comments?

  function fixEncoding($in_str)
  {
       $cur_encoding = mb_detect_encoding($in_str) ;

       if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
       {
           return utf8_decode($in_str);
       }
       else
       {
           return $in_str;
       }
  }
    
19.12.2013 / 13:25
0

Portuguese language programmers: our charset is UTF8!

Briefly, this fact for PHP programmers entails two cautions:

  • Pages, data, PHP scripts, everything should be encoded in UTF8. Be suspicious of the architecture, the library, the environment, whatever you are not representing in UTF-8.

  • Stay tuned to PHP, it is not "natively UTF8", this can cause upheavals. To overcome this problem, check out the tips and details in this answer .

  • Edit (after Bacco comment)

    It's not a matter of "personal preference", it's a matter of respect, just as traffic signs are respected, regardless of whether we like them or not.

    Respect for the following conventions, "de " and "de facto ":

    13.04.2017 / 14:59