Encoding Help Needed.

Here’s the plan: There’s an XML file that contains plain text data.

[Attempt 1]: Use PHP and some simple parsing to convert the data to look pretty.
[Results]: Crazy characters like “Joan Miró” and “peoples’”

[Attempt 2]: Use C# to parse the XML and output the file.
[Results]: Crazy characters like “Joan Miró” and “peoples’”

After a bunch of research, I found that the TextWriter class can encode the file:
TextWriter tw = new StreamWriter(“fileName” + “.txt”);
TextWriter tw = new StreamWriter(“fileName” + “.txt”, false, Encoding.Default);
TextWriter tw = new StreamWriter(“fileName” + “.txt”, false, Encoding.ASCII);
TextWriter tw = new StreamWriter(“fileName” + “.txt”, false, Encoding.UTF8);

All of them didn’t work. I’ve been using Ultra Edit for a while and it can do multiple file conversions. So I give it a try…ASCII to Unicode, UTF-8 to Unicode, UTF-8 to ASCII, Unicode to ASCII, DOS to UNIX, UNIX/MAC to DOS.

It all comes down to set the C# encoding to Encoding.Default and then converting the file from UTF-8 to ASCII. There’s no other way. It sucks. Any suggestions?

UPDATE
This was what I originally had at the top of the XML file.
<?xml version=1.0 encoding=utf-8?>

This was what I now have at the top of the XML file.
<?xml version=1.0 encoding=utf-16?>

I also had to change the TextWriter initialization from:
TextWriter tw = new StreamWriter(“fileName” + “.txt”);

to:
TextWriter tw = new StreamWriter(“fileName” + “.txt”, false, Encoding.Default);

Thanks, Abhi and kashif for the input.

BTW: “PHP assumes your XML is in ISO-8859-1!” Even if you have it set as UTF-8. This is also why PHP isn’t going to work for these files, BOO. If there is a real PHP solution, let me know. We’re also trying another approach using a third party PHP DB interface.

4 thoughts on “Encoding Help Needed.

  1. Hey Tom,

    So the problem is that you don’t know which encoding the file uses?

    I just had a look at our code from last semester and we did it the same way. But when I looked at RSSBandit which we also used they did something different :

    writer = new XmlTextWriter(fileName, System.Text.Encoding.GetEncoding(“ISO-8859-1”));

    Have you tried ISO-8859-1?

    Also if you open the xml file in mozilla/firefox (and I think IE as well) it shows you the encoding of the file.

    -Abhi

  2. Hey Tom, This is Jose. I met you when I joined the cougarcs before you graduated. Talking about your problem with PHP, the comment about PHP assuming ISO-8859-1 is right. To solve this problem, you have to have PHP5 and above and use the function iconv, but you have to know the type of encoding you got in your database. The way to use is the following. $utf8 = iconv(“CurrentEncoding”,”NewEncoding”,$data). For example, convert from utf16 to utf8 you will do $utf8 = iconv(“UTF-16″,”UTF-8”,$data). If you want to see the original characters you will have to change the encoding for the browser to be able to see the origianl characters. I hope this help. BTW check my blog a wwww.onedeveloper.net/blog

Leave a Reply

Your email address will not be published. Required fields are marked *

Solve this: Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.