BOM-problems (UTF-8)

Discussions about editing and cleaning data
Forum rules
New release: CSPro 8.0
Post Reply
Anne
Posts: 104
Joined: January 11th, 2012, 12:55 am

BOM-problems (UTF-8)

Post by Anne »

I'm having troubles with the Byte Order Mark (BOM) in the beginning of the data files: I like to use other tools together with CSPro to manipulate my data files, but many text edititing tools don't like the BOM. And according to Wikipedia, BOM is not needed for UTF-8 (CSPro data files are UTF-8). I have found a tool to remove it, and it seems that CSPro can still read the files, but my question is: Is it OK to remove it, or will I run into problems later?

(btw: Wikipedia says: "The Unicode Standard allows that the BOM "can serve as signature for UTF-8 encoded text where the character set is unmarked".[42] Some software developers have adopted it for other encodings, including UTF-8, in an attempt to distinguish UTF-8 from local 8-bit code pages. However RFC 3629, the UTF-8 standard, recommends that byte order marks be forbidden in protocols using UTF-8..." Why did US Census Bureau go against the recommendations in this case?)

Anne
Gregory Martin
Posts: 1777
Joined: December 5th, 2011, 11:27 pm
Location: Washington, DC

Re: BOM-problems (UTF-8)

Post by Gregory Martin »

The Unicode Standard that you quote also explains why we decided to use the BOM: "can serve as signature for UTF-8 encoded text where the character set is unmarked."

CSPro data files have no metadata attached to them, so the file is unmarked. We needed a way to differentiate between ANSI files created in versions of CSPro before 5 and UTF-8 files created from 5 on. So, for example, a web page might have code like this:

<meta http-equiv="content-type" content="text/html; charset=UTF-8" />

That marks the file, so there's no need to write the BOM in that case. But with our files, there is no way that we could add such text to our data files without screwing up many other applications written to work with flat text files.

You can remove the BOM safely as long as you're using only numbers and Latin letters in your file. If you use accented characters or characters from other character sets (Chinese, Arabic, etc.), removing the BOM and then passing the file back to CSPro will really mess with the way alphas are interpreted by CSPro.
Anne
Posts: 104
Joined: January 11th, 2012, 12:55 am

Re: BOM-problems (UTF-8)

Post by Anne »

Thank you..

This makes sense, so I'll stop hating the BOM from now. Luckily my current data is plain english and numbers :)

(a couple of weeks from now, I'll start working on the french data set, so I guess I'm gonna have to figure out another way to do it then.. Shouldn't be impossible: use bomremover.exe to remove the BOM, process my data using my favourite tools, google for a bomadder.exe or similar, and then the data is OK for CSPro) ;)

Anne
Post Reply