Abstract
Pattern discovery is a ubiquitous problem in many disciplines. It is especially prominent in recent years due to our greatly improved data-generation capabilities in science and technologies. The model and methods I present here is motivated by the "motif-finding" and "module-finding" problems in biology, the "market-basket problem" in data mining, and text analysis in studying chinese history books. In these problems, a common challenge is to discover which "items" (or, key words in text mining, and regulatory elements in biology) tend to co-occur with which others, i.e., to find association rules among the items. In market-basket problems, the observations are customers' transactions (i.e., "bastkets"), each contains multiple items. We can imagine that each basket is composed by a few "themes" selected by the custermer and each theme is a set of items that are bought together (an analogy is stamp-collecting: a person's collection of stamps can be organized as "sets"). Our goal is to discover these themes from only the transactions. Inspired by a dictionary model proposed by Bussemaker, Li and Siggia (2000), we propose a "theme dictionary model", which prescribes a probabilistic rule for generating each transaction. We then used both the EM and Monte Carlo strategies to aid our inference of the themes.
In text analysis and biological sequence analysis, an added difficulty is that the "items" are some phrases and sequence patterns, which are not all known in advance. In this case, we can combine a motif finding strategy with the theme dictionary model to complete the analysis. Existing motif-finding methods are mostly "bottom-up" approaches, i.e., to build up the dictionary starting with single-letter words and then concatenate some existing words that appear to occur next to each other in sentences more frequently than chance. Our new approach is a top-down strategy, which uses a tree structure to represent the relationship among all possible existing words and uses the EM algorithm to estimate the usage frequency of each word. It automatically trims down most of the incorrect "words" by letting their usage frequencies converge to zero.
I will demonstrate its applications in a few examples including an analysis of a Chinese novel, some Chinese history books, and publications in PNAS in the past 50 years. This is based on a joint work with Ke Deng, Zhi Geng, Chunlin Ji, and Peter Bol.