Python: Excluding Numbers From A Text Corpus -

i'm tring make corpus of unique lexemes (with frequencies) have been extracted persian text file. turned text list based on created list of unique tokens. wanted exclude numeral token before writing final list corpus wrote:

fileobj = open ('textoftexts.txt', 'r', encoding = 'utf8') corpobj = open ('mycorpus.txt', 'w', encoding = 'utf8') folist = fileobj.read().split (' ') digilist = re.findall (r'\b[+-]?\d+\b', fileobj.read()) #list of numeral tokens unitokens = list (set(folist))    #list of unique tokens  unilex in unitokens:     if unilex not in digilist: #this if-block's supposed 2 exclude numbers         unicount = folist.count (unilex) # counts frequencies of tokens         corpobj.write (unilex + '\t' + unicount + '\n') fileobj.close() corpobj.close()

i've tested contents of digilist. regex correctly lists integers exist in main text file. text corpus has still got integers frequencies inside it. doing wrong? how can drop integers down before enter corpus?

p.s.: wrote if-block didn't desired job well:

if unilex in digilist:     continue else:     ...

Fun enginering

Search This Blog

Python: Excluding Numbers From A Text Corpus -

Comments

Post a Comment