i have parse tree includes information. extract information need, using code splits string based on forward slash (/
), not perfect code. explain more details here:
i had used code in project earlier , worked perfectly. parse trees of new dataset more complicated , code makes wrong decisions sometimes.
the parse tree this:
(top~did~1~1 (s~did~2~2 (npb~i~1~1 i/prp ) (vp~did~3~1 did/vbd not/rb (vp~read~2~1 read/vb (npb~article~2~2 the/dt article/nn ./punc. ) ) ) ) )
as see, leaves of tree words right before forward slashes. these words, have used code before:
parse_tree.split("/");
but now, in new data, see instances these:
1) (top source/nn http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/x ./. )
where there multiple slashes due website addresses (in case, last slash separator of word).
2) (npb~sister~2~2 your/prp$ sister/nn //punc: )
where slash word itself.
could please me replace current simple regular expression expression can manage these cases?
to summarize need, need regular expression can split based on forward slash, must able manage 2 exceptions: 1) if there website address, must split based on last slash. 2) if there 2 consecutive slashes, must split based on second split (and first slash must not considered separator, word).
i achieved requested following article:
http://www.rexegg.com/regex-best-trick.html
just summarize, here on strategy:
1st, need create regex in format:
notthis | neitherthis | (iwantthis)
after that, capture group $1 contain slashes interested in perform splits.
you can replace them less occur, , after perform split in replaced term.
so, having strategy in mind, here's code:
regex:
\\/(?=\\/)|(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)|(\\/)
explanation:
notthis term double slashes lookahead( take 1st slash)
\\/(?=\\/)
neitherthis term basic url check lookahead not capture last \/
(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)
iwantthis term slash:
(\\/)
in java code can put doing this:
pattern p = pattern.compile("\\/(?=\\/)|(?:http:\\/\\/)?www[\\w\\.\\/\\-]*(?=\\/)|(\\/)"); matcher m = p.matcher("(top~did~1~1 (s~did~2~2 (npb~i~1~1 i/prp ) (vp~did~3~1 did/vbd not/rb (vp~read~2~1 read/vb (npb~article~2~2 the/dt article/nn ./punc. ) ) ) ) )\n(top source/nn http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm/x ./. )\n(npb~sister~2~2 your/prp$ sister/nn //punc: )"); stringbuffer b= new stringbuffer(); while (m.find()) { if(m.group(1) != null) m.appendreplacement(b, "superman"); else m.appendreplacement(b, m.group(0)); } m.appendtail(b); string replaced = b.tostring(); system.out.println("\n" + "*** replacements ***"); system.out.println(replaced); string[] splits = replaced.split("superman"); system.out.println("\n" + "*** splits ***"); (string split : splits) system.out.println(split);
output:
*** replacements *** (top~did~1~1 (s~did~2~2 (npb~i~1~1 isupermanprp ) (vp~did~3~1 didsupermanvbd notsupermanrb (vp~read~2~1 readsupermanvb (npb~article~2~2 thesupermandt articlesupermannn .supermanpunc. ) ) ) ) ) (top sourcesupermannn http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htmsupermanx .superman. ) (npb~sister~2~2 yoursupermanprp$ sistersupermannn /supermanpunc: ) *** splits *** (top~did~1~1 (s~did~2~2 (npb~i~1~1 prp ) (vp~did~3~1 did vbd not rb (vp~read~2~1 read vb (npb~article~2~2 dt article nn . punc. ) ) ) ) ) (top source nn http://www.alwatan.com.sa/daily/2007-01-31/first_page/first_page01.htm x . . ) (npb~sister~2~2 prp$ sister nn / punc: )
Comments
Post a Comment