concatenation behaviour

Discussions about tools to complement CSPro data processing
Post Reply
iip
Posts: 32
Joined: January 19th, 2012, 11:30 pm

concatenation behaviour

Post by iip »

Hi,

We have problem during concatenation in CSPRO5 where some dat files are generated but empty (this case appear because puncher canceled data entry), all files has BOM in it, example:

f1 (3 bytes/only BOM)
f2 (3 bytes/only BOM)
f3 (3 bytes/only BOM)

concatenation data is fx with 9 bytes with contain EF BB BF 0D 0A 0D 0A 0D 0A, in my opinion it should only contain BOM (EF BB BF) right? because those are empty files

Regards,

-iip-
Gregory Martin
Posts: 1793
Joined: December 5th, 2011, 11:27 pm
Location: Washington, DC

Re: concatenation behaviour

Post by Gregory Martin »

You're right ... this is a something that we didn't think about when doing the Unicode conversion. I've fixed the problem and it will come out in the 5.0.3 release (next month probably).

Thanks for reporting this bug!
Anne
Posts: 104
Joined: January 11th, 2012, 12:55 am

Re: concatenation behaviour

Post by Anne »

About the BOM character: according to wikipedia, it is not needed in UTF-8 files (only UTF-16). Why is it in use by CSPro?

As I'm making an application where I do not know how many interviewers - and hence how many data files - I have, I wrote this nice bat-file to concatenate data rather than using the concatenate tool in CSPro, but my batch program doesn't work because of the BOM character, and so far, I haven't found an easy way to remove it.

Guess I'll make a CSPro batch to do it instead, but I really don't like it :)
Gregory Martin
Posts: 1793
Joined: December 5th, 2011, 11:27 pm
Location: Washington, DC

Re: concatenation behaviour

Post by Gregory Martin »

The problem we had when designing for the Unicode version is that CSPro uses simple text files as our data files, so we don't have any useful way of storing metadata about the file. For example, if we used a binary format, we could have a flag that indicated if the data was ANSI or UTF-8, but we don't have that.

So we had to figure out a way to identify between data files created in older versions of CSPro and data created in the newer version. The answer was adding the BOM to the data files. For example, if we didn't have it, and we encountered a character like 'ü' in the data file, we wouldn't necessarily know if it was a German accented letter, or if it was the beginning of a UTF-8 character sequence. The BOM helps us interpret all characters correctly.
sofiajoe
Posts: 1
Joined: October 4th, 2014, 1:04 am

Re: concatenation behaviour

Post by sofiajoe »

What I need is to make an application to export the data to use in the productionRunner tool (This tool basically just runs through all .pff and .bat files you specify). So I have aldready made the application (the .exf file), and the pff file. And in the .exf file, I have already specified what fields I want and what universe, and amongst the fields I chose, all of the ID elements are included.
Post Reply