i have 2 text files. 1 of them whole text (text1
) , other number of unique words in text1
. need calculate monogram , write in file. i've tried this:
def countwords(mytext): import codecs file = codecs.open(mytext, 'r', 'utf_8') count = 0 mytext = file.readlines() line in mytext: words = line.split() word in words: count = count + 1 file.close() return(count) def calculatemonogram(path, lex): fid = open(path, 'r', encoding='utf_8') mypath = fid.read().split() fid1 = open(lex, 'r', encoding='utf_8') mylex = fid1.read().split() word1 in mylex: if word1 in mypath: x = dict((word1, mypath.count(word1)) word1 in mylex) value in x: monogram = '\t' + str(value / countwords(lex)) table.write(monogram)
you can use collections.counter , re.sub:
import re import collections open("input.txt") f1, open("sub_input.txt") f2: pattern = "[^a-za-z]" frequencies = collections.counter([re.sub(pattern, "", word.strip()) line in f1.readlines() word in line.split()]) print [frequencies[word] line in f2.readlines() word in line.split()]
the above prints [4, 2]
input.txt
:
asd, asd. lkj lkj sdf sdf .asd wqe qwe kl dsf asd,. wqe
and sub_input.txt
:
asd sdf
breaking down in case code unclear:
collections.counter(iterable)
constructs unordered collection elements iterable dictionary keys , number of times occur dictionary values.- the regex pattern
[^a-za-z]
matches character not in rangea-z
ora-z
.re.sub(pattern, substitute, string
substitutes substrings matchedpattern
substitute
instring
. in case, replacing non-letter characters empty string.
Comments
Post a Comment