probability distribution estimation task
Estimating the (Joint) Probability Distribution. A set of data (of type
T) is often assumed to be a sample taken from a population according to a
probability distribution. A probability distribution/density function assigns a
non-negative probability/density to each object of type T. Probably the most
general data mining task (Hand et al. 2001) is the task of estimating the (joint)
probability distribution D over type T from a set of data items or a sample
drawn from that distribution.
As mentioned above, in the most typical case we would have T = Tuple(T1,
. . ., Tk), where each of T1, . . ., Tk is Boolean, Discrete(S) or Real. We talk about
the joint probability distribution to emphasize the difference to the marginal
distributions of each of the variables of type T1, . . ., Tk: the joint distribution
captures the interactions among the variables.
Representing multi-variate distributions is a non-trivial task. Two approaches
are commonly used in data mining. In the density-based clustering paradigm,
mixtures of multi-variate Gaussian distributions are typically considered (Hand
et al. 2001). Probabilistic graphical models, most notably Bayesian networks,
represent graphically the (in)dependencies between the variables: Learning their
structure and parameters is an important approach to the problem of estimating
the joint probability distribution.
data mining task