Odelimitedcharactersinthemforexamplethumbgreenappleactiveassignmentweeklymetapho I am using this quick-and-dirty 125k-word dictionary I put together from a small subset of Wikipedia.īefore: thumbgreenappleactiveassignmentweeklymetaphor.Īfter: thumb green apple active assignment weekly metaphor.īefore: thereismassesoftextinformationofpeoplescommentswhichisparsedfromhtmlbuttherearen Which you can use with s = 'thumbgreenappleactiveassignmentweeklymetaphor' # Backtrack to recover the minimal-cost string. Return min((c + wordcost.get(s, 9e999), k+1) for k,c in candidates) # Returns a pair (match_cost, match_length).Ĭandidates = enumerate(reversed(cost)) # been built for the i-1 first characters. # Find the best match for the i first characters, assuming cost has """Uses dynamic programming to infer the location of spaces in a string Wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words)) Words = open("words-by-frequency.txt").read().split() # Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability). Instead of directly using the probability we use a cost defined as the logarithm of the inverse of the probability to avoid overflows. The most likely sentence is the one that maximizes the product of the probability of each individual word, and it's easy to compute it with dynamic programming. Once you have fixed the model, you can use dynamic programming to infer the position of the spaces. It is reasonable to assume that they follow Zipf's law, that is the word with rank n in the list of words has probability roughly 1/( n log N) where N is the number of words in the dictionary. Then you only need to know the relative frequency of all words. A good first approximation is to assume all words are independently distributed. The best way to proceed is to model the distribution of the output. (If you want an answer to your original question which does not use word frequency, you need to refine what exactly is meant by "longest word": is it better to have a 20-letter word and ten 3-letter words, or is it better to have five 10-letter words? Once you settle on a precise definition, you just have to change the line defining wordcost to reflect the intended meaning.) The idea Here is a 20-line algorithm that exploits relative word frequency to give accurate results for real-word text. A naive algorithm won't give good results when applied to real-world data.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |