Calculate monogram in python -


i have 2 text files. 1 of them whole text (text1) , other number of unique words in text1. need calculate monogram , write in file. i've tried this:

def countwords(mytext):     import codecs     file = codecs.open(mytext, 'r', 'utf_8')     count = 0     mytext = file.readlines()     line in mytext:        words = line.split()          word in words:             count = count + 1          file.close()     return(count)  def calculatemonogram(path, lex):       fid = open(path, 'r', encoding='utf_8')      mypath = fid.read().split()      fid1 = open(lex, 'r', encoding='utf_8')      mylex = fid1.read().split()      word1 in mylex:          if word1 in mypath:              x = dict((word1, mypath.count(word1)) word1 in mylex)          value in x:              monogram = '\t' + str(value / countwords(lex))              table.write(monogram) 

you can use collections.counter , re.sub:

import re import collections open("input.txt") f1, open("sub_input.txt") f2:   pattern = "[^a-za-z]"   frequencies = collections.counter([re.sub(pattern, "", word.strip()) line in f1.readlines() word in line.split()])   print [frequencies[word] line in f2.readlines() word in line.split()] 

the above prints [4, 2] input.txt:

asd, asd. lkj lkj  sdf sdf .asd  wqe qwe kl dsf asd,. wqe 

and sub_input.txt:

asd sdf 

breaking down in case code unclear:

  • collections.counter(iterable) constructs unordered collection elements iterable dictionary keys , number of times occur dictionary values.
  • the regex pattern [^a-za-z] matches character not in range a-z or a-z. re.sub(pattern, substitute, string substitutes substrings matched pattern substitute in string. in case, replacing non-letter characters empty string.

Comments