regex matching "?

Discussions about CSEntry
Post Reply
Anne
Posts: 104
Joined: January 11th, 2012, 12:55 am

regex matching "?

Post by Anne »

Hi,
I'm reading tab separated files of not perfect quality: There are some times a tab within a column like this:

"This is a column containing a tab <tab> blahblah"

And in these cases, the field is always enclosed with double quotes so it should be easy to find the problem using this regex:
\".*\t.*\"

But of curse, it needs to be enclosed in quotes in the regexmatch function:

if regexmatch(theLine, "\".*\t.*\"") then ..

but this gives compilation error. I also tried to enclose it in single quotes, but CSPro still didn't seem to understand what I wanted. Any suggestion on how to solve this?
Anne
Posts: 104
Joined: January 11th, 2012, 12:55 am

Re: regex matching "?

Post by Anne »

After twisting my brains about this, I did find a solution to the problem: I know that if there are quotes in the line I'm reading, I know it is because there is a tab (or a line break) in the quoted string, hence I don't need the regex..

But I'm still curious about how it could be done using regex..

(And I also noticed that my regex is very greedy: If there where two quoted strings in the line, it wouldn't work, but that wouldn't have been a problem as there is only one text field in the file. All the rest are numbers
josh
Posts: 2401
Joined: May 5th, 2014, 12:49 pm
Location: Washington DC

Re: regex matching "?

Post by josh »

Using \" doesn't work but using single quotes should work so you could do:

regexmatch(line, '".*\t.*"')

Alternatively you can use two consecutive double quotes inside the string. CSPro interprets that as just one double quote.

regexmatch(line, """.*\t.*""");

I'm not sure what you mean about it being too greedy. It should return true if there is AT LEAST one quoted string with a tab in it. I suppose if you had two quoted strings with no tabs in them but there was a tab in between them it would match that because when you have two sets of quotes it will match the outer quotes rather than the first pair. For example:

"first string" 123124<tab>3534"second string"

If you used a regex like this '"[^"]*\t[^"]*"' it should only match adjacent pairs of double quotes and thus reject the above.
Anne
Posts: 104
Joined: January 11th, 2012, 12:55 am

Re: regex matching "?

Post by Anne »

Thank you Josh,

As I really didn't feel like working with the things I should have done, I came up with the perfect regex for this task:
'(^|\t)"[^"]*\t[^"]*"(\t|$)'

(interpretation:
(^|\t) = start of line or tab
"[^"]* = a " followed by random number of random non-" characters
\t = tab
[^"]*" = random number of random non-" characters followed by a "
(\t|$) = tab or end of line

or I could have just used your suggestion without wasting time on this, but wasting time is more fun :D
Post Reply