java - Regular expression to split by forward slash -


i have parse tree includes information. extract information need, using code splits string based on forward slash (/), not perfect code. explain more details here:

i had used code in project earlier , worked perfectly. parse trees of new dataset more complicated , code makes wrong decisions sometimes.

the parse tree this:

(top~did~1~1 (s~did~2~2 (npb~i~1~1 i/prp ) (vp~did~3~1 did/vbd not/rb (vp~read~2~1 read/vb (npb~article~2~2 the/dt article/nn ./punc. ) ) ) ) )  

as see, leaves of tree words right before forward slashes. these words, have used code before:

parse_tree.split("/"); 

but now, in new data, see instances these:

1) (top source/nn http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/x ./. )

where there multiple slashes due website addresses (in case, last slash separator of word).

2) (npb~sister~2~2 your/prp$ sister/nn //punc: )

where slash word itself.

could please me replace current simple regular expression expression can manage these cases?

to summarize need, need regular expression can split based on forward slash, must able manage 2 exceptions: 1) if there website address, must split based on last slash. 2) if there 2 consecutive slashes, must split based on second split (and first slash must not considered separator, word).

i achieved requested following article:

http://www.rexegg.com/regex-best-trick.html

just summarize, here on strategy:

1st, need create regex in format:

notthis | neitherthis | (iwantthis) 

after that, capture group $1 contain slashes interested in perform splits.

you can replace them less occur, , after perform split in replaced term.

so, having strategy in mind, here's code:

regex:

\\/(?=\\/)|(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)|(\\/) 

explanation:

notthis term double slashes lookahead( take 1st slash)

\\/(?=\\/) 

neitherthis term basic url check lookahead not capture last \/

(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/) 

iwantthis term slash:

(\\/) 

in java code can put doing this:

pattern p = pattern.compile("\\/(?=\\/)|(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)|(\\/)");  matcher m = p.matcher("(top~did~1~1 (s~did~2~2 (npb~i~1~1 i/prp ) (vp~did~3~1 did/vbd not/rb (vp~read~2~1 read/vb (npb~article~2~2 the/dt article/nn ./punc. ) ) ) ) )\n(top source/nn http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/x ./. )\n(npb~sister~2~2 your/prp$ sister/nn //punc: )"); stringbuffer b= new stringbuffer(); while (m.find()) {     if(m.group(1) != null) m.appendreplacement(b, "superman");     else m.appendreplacement(b, m.group(0)); } m.appendtail(b); string replaced = b.tostring(); system.out.println("\n" + "*** replacements ***"); system.out.println(replaced);  string[] splits = replaced.split("superman"); system.out.println("\n" + "*** splits ***"); (string split : splits) system.out.println(split); 

output:

*** replacements ***                                                                                                                                                                                   (top~did~1~1 (s~did~2~2 (npb~i~1~1 isupermanprp ) (vp~did~3~1 didsupermanvbd notsupermanrb (vp~read~2~1 readsupermanvb (npb~article~2~2 thesupermandt articlesupermannn .supermanpunc. ) ) ) ) )       (top sourcesupermannn http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htmsupermanx .superman. )                                                                                     (npb~sister~2~2 yoursupermanprp$ sistersupermannn /supermanpunc: )                                                                                                                                             *** splits ***                                                                                                                                                                                         (top~did~1~1 (s~did~2~2 (npb~i~1~1                                                                                                                                                                   prp ) (vp~did~3~1 did                                                                                                                                                                                  vbd not                                                                                                                                                                                                rb (vp~read~2~1 read                                                                                                                                                                                   vb (npb~article~2~2                                                                                                                                                                                dt article                                                                                                                                                                                             nn .                                                                                                                                                                                                   punc. ) ) ) ) )                                                                                                                                                                                        (top source                                                                                                                                                                                            nn http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm                                                                                                                              x .                                                                                                                                                                                                    . ) (npb~sister~2~2                                                                                                                                                                                   prp$ sister                                                                                                                                                                                            nn / punc: )  

Comments