i graphing out letter frequency in large academic documents. part of process, sorting letters large clippings of documents alphabetical order. using python's
built in sorted function, , started wonder if make faster. wrote following function:
def count_sort(l): items = {'a':0,'b':0,'c':0,'d':0,'e':0,'f':0,'g':0,'h':0,'i':0,'j':0,'k':0,'l':0,'m': 0,'n':0,'o':0,'p':0,'q':0,'r':0,'s':0,'t':0,'u':0,'v':0,'w':0,'x':0,'y':0,'z' :0} item in l: items[item] += 1 sort_l = [] key in items: sort_l += key*items[key] return sort_l
when testing code versus sorted
on 10000
letter long string of text, 20x
faster.
with such performance boost, why isn't sorting algorithm in standard libs
?
you have rediscovered counting sort algorithm.
to quote wikipedia:
for problem instances in maximum key value smaller number of items, counting sort can highly space-efficient, storage uses other input , output arrays count array uses space o(k).
the counting sort algorithm becomes more , more efficient(relatively) greater difference between total number of items being sorted, , number of unique items being sorted.
you can see why must looking @ own code, or wikipedia example code:
# calculate histogram of key frequencies: x in input: count[key(x)] += 1 # calculate starting index each key: total = 0 in range(k): # = 0, 1, ... k-1 oldcount = count[i] count[i] = total total += oldcount # copy output array, preserving order of inputs equal keys: x in input: output[count[key(x)]] = x count[key(x)] += 1 return output
you have 2 loops in function: first iterate on letters sorting, , second iterate on items dictionary. posited previously, makes sense of items dictionary considerably smaller list sorting, becomes inefficient if number number of unique elements increases relative number of items being sorted.
like @brenbarn answered, when know characters expect , you're willing ignore other characters. while seems counting sort highly efficient in example gave, problem of sorting letters hardly common sorting problem.
below have fixed function print letters iterating through list rather iterating through keys in dictionary(as python's dictionaries not ordered)
def count_sort(l): letters = [chr(i) in range(97, 122)] items = dict() letter in letters: items[letter] = 0 item in l: items[item] += 1 sort_l = list() letter in letters: sort_l.extend(letter*items[letter]) return sort_l
Comments
Post a Comment