TRIE木全体を1つの論理式で表す

あらすじ

reTRIEval treeの名の通り、恐るべき検索効率を誇るTRIE木。これをPrologで実装しようとした所、最終的には「単語集合を1つの節に変換するRubyのスクリプト」ができあがったので、記録として書いておきます。なお、この記事はTRIE木もRubyもPrologも知っているという人を対象に書かれています。

アイディア

例えば、{cat, cake, bcas}という単語集合からTRIEを作る場合は、以下のような節になります*1。
意味としては、「(第1字=c and 第2字=a and (第3字=t or (第3字=k and 第4字=e))) or (第1字=b and 第2字=c and 第3字=a and 第4字=s)」といったところでしょうか。

word(T0) :-
(T0 = [H1 | T1], (
(H1 = 99, (
  (T1 = [H2 | T2], (
  (H2 = 97, (
    (T2 = [H3 | T3], (
    (H3 = 116, (
      T3 = []
    ))
      ;
    (H3 = 107, (
      (T3 = [H4 | T4], (
      (H4 = 101, (
        T4 = []
      ))
      ) )
    ))
    ) )
  ))
  ) )
))
  ;
(H1 = 98, (
  (T1 = [H2 | T2], (
  (H2 = 99, (
    (T2 = [H3 | T3], (
    (H3 = 97, (
      (T3 = [H4 | T4], (
      (H4 = 115, (
        T4 = []
      ))
      ) )
    ))
    ) )
  ))
  ) )
))
) ).

Ruby script

100行足らずなので、ここに貼っておく。使い方は以下のようなかんじ。

ruby gentrie.rb <list.txt >trie.pl

#coding:utf-8

# =usage
# ruby gentrie.rb [clause_name] <input_file >output_file

class Trie
	class Node
		def initialize
			@value = []
			@children = {}
			@term = false
		end

		attr_accessor :value
		attr_accessor :children
		attr_accessor :term
	end

	def initialize
		@root = createNode('* root *')
	end

	def insert str
		def insert_rec(array, node)
			if array.size == 0
				node.term = true
				return
			end

			head = array.shift
			rest = array
			if node.children[head] == nil
				node.children[head] = createNode(head)
			end
			insert_rec(rest, node.children[head])
		end

		insert_rec(str.split(//), @root)
	end

	def createNode(value)
		n = Node.new
		n.value.push value
		return n
	end

	def to_clause(sep = false, clause = 'word')
		def myprint(lv, str)
			print '  '*lv
			puts str
		end
		def to_clause2(node, lv)
			if node.term
				myprint(lv, "T%d = []%s"
					% [lv, (node.children.size > 0 ? ";" : '')])
			end
			return if node.children.size == 0

			myprint(lv, "(T%d = [H%d | T%d], (" % [lv, lv+1, lv+1])
			node.children.keys.each_with_index do |k, idx|
				myprint(lv+1, ';') if idx != 0
				ch = node.children[k]

				myprint(lv, "(H%d = %d, (" % [lv+1, k.ord])
				to_clause2(ch, lv+1)
				myprint(lv, '))')
			end
			myprint(lv, ") )%s" % (lv > 0 ? "" : "."))
		end

		if sep
			# 最初の1文字ごとに節を分割する
			@root.children.keys.each do |k|
				puts "#{clause}(T0) :-"
				to_clause2(@root.children[k], 0)
			end
		else
			puts "#{clause}(T0) :-"
			to_clause2(@root, 0)
		end
	end
end

clause_name = ARGV.shift || 'word'
t = Trie.new

# input
while l = gets
	t.insert l.chomp
end

# convert
t.to_clause(false, clause_name)

実際にやってみる

サイズ

772867語、UTF-8でのべ15MBくらいの、平仮名の単語のリストが、136MBくらいの節になりました。

実行例

「つくば」で始まる単語を列挙してみました。

$ swipl -G1500M -f trie.pl

?- append("つくば", _, W), word(W), string_to_list(Str, W).
W = [12388, 12367, 12400],
Str = "つくば" ;
W = [12388, 12367, 12400, 12356],
Str = "つくばい" ;
W = [12388, 12367, 12400, 12358],
Str = "つくばう" ;
W = [12388, 12367, 12400, 12358, 12385, 12421, 12358],
Str = "つくばうちゅう" ;
W = [12388, 12367, 12400, 12358, 12385, 12421, 12358, 12379, 12435|...],
Str = "つくばうちゅうせんたー" ;

…… 中略

W = [12388, 12367, 12400, 12406, 12427, 12540, 12409, 12426, 12540],
Str = "つくばぶるーべりー" ;
false.

このTRIEで辞書情報をどうやって格納するか

割と簡単な話。実装してないけど、以下のようにすればできるんじゃないかと思う。

word(Str, Feature) :-
% 中略
(H4 = 'a'
 (T4 = [], Feature = ['動詞', '連用形', 'カ行下二段変格活動', ....]);
 .....
)))).

*1:機械的に生成したので、インデントが汚いけどお許しください