Hi,
I'm reading tab separated files of not perfect quality: There are some times a tab within a column like this:
"This is a column containing a tab <tab> blahblah"
And in these cases, the field is always enclosed with double quotes so it should be easy to find the problem using this regex:
\".*\t.*\"
But of curse, it needs to be enclosed in quotes in the regexmatch function:
if regexmatch(theLine, "\".*\t.*\"") then ..
but this gives compilation error. I also tried to enclose it in single quotes, but CSPro still didn't seem to understand what I wanted. Any suggestion on how to solve this?
regex matching "?
Re: regex matching "?
After twisting my brains about this, I did find a solution to the problem: I know that if there are quotes in the line I'm reading, I know it is because there is a tab (or a line break) in the quoted string, hence I don't need the regex..
But I'm still curious about how it could be done using regex..
(And I also noticed that my regex is very greedy: If there where two quoted strings in the line, it wouldn't work, but that wouldn't have been a problem as there is only one text field in the file. All the rest are numbers
But I'm still curious about how it could be done using regex..
(And I also noticed that my regex is very greedy: If there where two quoted strings in the line, it wouldn't work, but that wouldn't have been a problem as there is only one text field in the file. All the rest are numbers
Re: regex matching "?
Using \" doesn't work but using single quotes should work so you could do:
regexmatch(line, '".*\t.*"')
Alternatively you can use two consecutive double quotes inside the string. CSPro interprets that as just one double quote.
regexmatch(line, """.*\t.*""");
I'm not sure what you mean about it being too greedy. It should return true if there is AT LEAST one quoted string with a tab in it. I suppose if you had two quoted strings with no tabs in them but there was a tab in between them it would match that because when you have two sets of quotes it will match the outer quotes rather than the first pair. For example:
"first string" 123124<tab>3534"second string"
If you used a regex like this '"[^"]*\t[^"]*"' it should only match adjacent pairs of double quotes and thus reject the above.
regexmatch(line, '".*\t.*"')
Alternatively you can use two consecutive double quotes inside the string. CSPro interprets that as just one double quote.
regexmatch(line, """.*\t.*""");
I'm not sure what you mean about it being too greedy. It should return true if there is AT LEAST one quoted string with a tab in it. I suppose if you had two quoted strings with no tabs in them but there was a tab in between them it would match that because when you have two sets of quotes it will match the outer quotes rather than the first pair. For example:
"first string" 123124<tab>3534"second string"
If you used a regex like this '"[^"]*\t[^"]*"' it should only match adjacent pairs of double quotes and thus reject the above.
Re: regex matching "?
Thank you Josh,
As I really didn't feel like working with the things I should have done, I came up with the perfect regex for this task:
'(^|\t)"[^"]*\t[^"]*"(\t|$)'
(interpretation:
(^|\t) = start of line or tab
"[^"]* = a " followed by random number of random non-" characters
\t = tab
[^"]*" = random number of random non-" characters followed by a "
(\t|$) = tab or end of line
or I could have just used your suggestion without wasting time on this, but wasting time is more fun
As I really didn't feel like working with the things I should have done, I came up with the perfect regex for this task:
'(^|\t)"[^"]*\t[^"]*"(\t|$)'
(interpretation:
(^|\t) = start of line or tab
"[^"]* = a " followed by random number of random non-" characters
\t = tab
[^"]*" = random number of random non-" characters followed by a "
(\t|$) = tab or end of line
or I could have just used your suggestion without wasting time on this, but wasting time is more fun