One of the classical problems in Natural Language Processing is parts-of-speech tagging i.e. given a sentence of length n, find the sequence of part-of-speech tags for each word in the sentence. For example,

There are several approaches to parts-of-speech tagging, and I’ve managed to build a fairly simple tagger based on something called Hidden Markov Models. There are several resources explaining HMMs out there already so I won’t try explaining that in detail here.

Basically what you need to know is that we model the tag sequence to be a first-order Markov process. This means that the likelihood of a tag occurring is determined by what the previous tag was. So we’re going to have to obtain a map that looks like this:

{:noun {:verb 0.7
        :article 0.1
        :noun 0.2}
 :article {:noun 0.8
           :verb 0.1
           :article 0.1}
 :verb {:noun 0.4
        :verb 0
        :article 0.6}

The probability of a verb following a noun is 0.7, an article following a noun is 0.1, and so on. These are called the transition probabilities.

The next thing we need to obtain is called the emission probabilities, which is the probability of a tag emitting a word. When represented in a Clojure map, this looks something like:

{:noun {"man" 0.04
        "samrat" 0.001
 :verb {"walks" 0.002}
 } ;; and so on

The Brown tagged corpus

To obtain these two maps, we will use the Brown corpus which is a pretty substantial collection of sentences tagged painstakingly by humans way back in the 60s. A sentence in the Brown corpus looks like this:

 The/at three/cd men/nns stepped/vbd out/rp to/in the/at side/nn
 to/to wait/vb for/in Captain/nn-tl Clemens'/np$ signal/nn ./.

Here, “The” is tagged with “at”, “three” with “cd”, and so on. To understand what these tags mean, you could open up this reference but that is unnecessary for now.

Instead of dealing with the raw Brown corpus, we will use a file containing the processed information we need. The contents of this file looks like this:

2 WORDTAG nns 1880s
9 WORDTAG in underneath
1 WORDTAG vbn agitated
13510 1-GRAM cd 
3282 1-GRAM bed 
49 1-GRAM rb-hl
265 2-GRAM vb wrb 
20 2-GRAM jj-tl jj 
5 3-GRAM in jj-tl jj 
6 3-GRAM cc rbr vbn 
1 3-GRAM beg pp$ nn 

The code used to obtain these counts is here. I won’t discuss that here either. You can find the counts file here.

The code(finally!)

The namespace declaration:

(ns pos-tagger.hmm
  (:require [clojure.string :as str]
            [clojure.core.match :refer [match]]))

Next is the code to read brown.counts file mentioned above:

(defn read-brown-counts
  (with-open [rdr ( counts-file)]
    (reduce (fn [[ngram-counts word-tag-counts] line]
              (let [split-line (str/split line #"\s")]
                (match split-line
                       [c "3-GRAM" t u v] [(assoc ngram-counts [t u v] (Integer/parseInt c))
                       [c "2-GRAM" t u] [(assoc ngram-counts [t u] (Integer/parseInt c))
                       [c "1-GRAM" t] [(assoc ngram-counts [t] (Integer/parseInt c))
                       [c "WORDTAG" t word] [ngram-counts
                                             (assoc-in word-tag-counts [t word] (Integer/parseInt c))])))
            [{} {}]
            (line-seq rdr))))

This is simply loading what was in the counts file into memory. This gives us a vector of two maps which we’ll use next:

(let [counts (read-brown-counts "brown.counts")]
  (def ngram-count (first counts))
  (def word-tag-count (second counts))
  (def tag-space (conj (keys word-tag-count) "START")))

tag-space is the set of all possible tags. The "START" tag is something the code producing brown.counts adds to the beginning of sentences. It is useful to have that because it makes the job of calculating the transition probability of the first tag($t(\text{START}\Rightarrow{t_1})$, which is really called the initial state probability) easier.

With this we can obtain the transition and emission probabilities we wanted:

(defn transition-probability
  (try (/ (ngram-count ngram)
          (ngram-count (butlast ngram)))
       (catch Exception _ 0)))

(defn emission-probability
  "Probability of tag emitting word."
  [[tag word]]
  (try (/ (get-in word-tag-count [tag word])
          (apply + (vals (word-tag-count tag))))
       (catch Exception _ 0)))

Now, using these two functions it is possible to calculate the probability of any tag sequence, $t_1,t_2,..t_n$ for a given sentence, $x_1,x_2,..x_n$:

However, what we want to find is the most likely tag sequence, which is simply the tag sequence for which $P(t_1,t_2,..t_n\vert{x_1,x_2,..x_n})$ is the maximum. Now, if given a sentence, an index(posn) and the tag at posn, the following function gives the probability of the most likely tag sequence upto posn:

(def viterbi
  "Returns the probability of the most likely tag sequence upto posn."
   (fn [sentence tag posn]
     (if (= posn 1)
       (* (emission-probability [tag (first sentence)])
          (transition-probability ["START" tag]))
       (* (emission-probability [tag (nth sentence (dec posn))])
          (apply max (map (fn [prev-tag]
                            (* (transition-probability [prev-tag tag])
                               (viterbi sentence prev-tag (dec posn))))

The function is memoized in order to avoid doing repeated computations which occur due to the recursive calls to itself.

To find the most likely tag for a posn, we find the $argmax$ of viterbi:

(defn viterbi-tag
  "Return the most likely tag for word at posn."
  [sentence posn]
  (apply max-key (fn [tag] (viterbi sentence tag posn))

Finally, to get the tag sequence:

(defn tag-sequence
  (map (partial viterbi-tag sentence)
       (range 1 (inc (count sentence)))))