regex - How to count occurrences of separator in string excluding those in quotes -


i have file comma separated values this:

1.42104e+16,220899,1,,e-page remote auto,,,"allied martian banks, p.l.c.",moon,mn,, 1.42105e+16,637039,1,,e-page remote auto,,,bank of jupiter,europa,io,, 

i count number of commas excluding in quotation marks such "allied martian banks, p.l.c.".

i know that:

length(i.data_record)-length(replace(i.data_record,',','')) 

would return number of commas, count comma in 1st line compared 2nd when, purposes, should counted having same number.

is there quick , simple way of ignoring commas in quotation marks?

i understand create loop , start breaking string in bits, counting them, , whenever find quotation mark ignore commas until find quotation mark, know if there's simpler, more streamlined way of achieving without resorting loops.

many thanks!

eliminate delimited content first, count afterwards:

regexp_count (     regexp_replace (         regexp_replace (             i.data_record           , '(^|,)"[^"]*"(,|$)'           , '\1\2'         )       , '(^|,)"[^"]*"(,|$)'       , '\1\2'     )   , ','  )  

the nesting of regexp_replace calls unfortunately necessary in order handle consecutive quote-delimited fields correctly: separating comma consumed regexp pattern , willnot taken account subsequent match.

oracle's regexen not support lookahead operator natural way handle situation.

given performance hit of regexp_... calls might better off use

length(i.data_record) - length ( replace ( regexp_replace ( i.data_record, '(^|,)"[^"]*"(,|$)', '\1\2' ),',','' ) ) 

caveat

this solution not handle dquotes within field values, represented "" or \".

the former case can handled elegantly: instead of interpreting "" inside quote-delimited field, consider whole field content juxtaposition of 1 or more dquote-delimited strings not contain dquotes. while wouldn't follow route in processing data (all dquotes lost), may employ perspective sake of counting:

regexp_count (     regexp_replace (         regexp_replace (             i.data_record           , '(^|,)("[^"]*")+(,|$)'  -- changed           , '\1\3'                  -- changed         )       , '(^|,)("[^"]*")+(,|$)'   -- changed       , '\1\3'                   -- changed     )   , ','  )  

test cases

-- works select regexp_count ( regexp_replace ( regexp_replace ( '1,"data,and more so","more data,and more so"', '(^|,)"[^"]*"(,|$)', '\1\2' ), '(^|,)"[^"]*"(,|$)', '\1\2' ), ',' ) dual; select regexp_count ( regexp_replace ( regexp_replace ( '1,"data,and more so",2,"more data,and more so"', '(^|,)"[^"]*"(,|$)', '\1\2' ), '(^|,)"[^"]*"(,|$)', '\1\2' ), ',' ) dual;  select regexp_count ( regexp_replace ( regexp_replace ( '1,"""data"",and more so",2,"more data,and more so"', '(^|,)("[^"]*")+(,|$)', '\1\3' ), '(^|,)("[^"]*")+(,|$)', '\1\3' ), ',' ) dual;  -- fails select regexp_count ( regexp_replace ( '1,"data,and more so","more data,and more so"', '(^|,)"[^"]*"(,|$)', '\1\2' ), ',' ) dual; select regexp_count ( regexp_replace ( '1,"data,and more so",2,"more data,and more so"', '(^|,)"[^"]*"(,|$)', '\1\2' ), ',' ) dual; 

Comments