Cluster Analysis using R

Course Content:

Cluster analysis provides a suite of techniques for grouping a set of observations or clustering variables into groups. Groups or clusters are formed where observations in the same group are more similar than observations from different groups. The methods are useful for discovering new sub-populations in populations and as a way for optimally partitioning sets of observations. These methods have grown in recent years with a wide variety of applications for analysing social/market segregation through to optimally grouping students in classrooms according to skill mastery.

The course used neighbourhood statistics and simulated data to illustrate how cluster analysis techniques can be applied to urban socio-economic issues.

Course objectives:

  • To offer intensive training in cluster analysis using a variety of R packages including: stats, cluster, mclust, pgmm and poLCA (plus clustMD, flexmix and longclust if time allows)

Topics covered in this course:

The workshop required personal work and interaction amongst the participants and instructors. Each component of the workshop consisted of a lecture followed by a computer practical using R using real case studies in the natural and social sciences. The one-day training programme consisted of the following components:

i. Introductions - outline; cluster analysis overview; R; R cluster packages; case study data sets; bibliography and resources.
ii. Exploring data with visualisation methods– for spatially exploring univariate and bivariate data, multidimensional scaling.
iii. Classical clustering methods - hierarchical clustering and different linkages, partitioning methods including k-means and k-medoids
iv. Model-based clustering – finite mixture models, Information criteria, mixture of factor analysers for high-dimensional data
v. Basic latent class analysis – clustering for categorical data
vi. Model-based clustering for mixed data (Time permitting) – finite mixture clustering method for datasets with both continuous and categorical data
vii. Mixture of regressions (Time permitting) – groups of regressions (rather than single population regression)
viii. Longitudinal data clustering (Time permitting) – finite mixture longitudinal clustering

About the Trainer:

Dr Nema Dean’s research interests are in developing new clustering and classification methods. Past work has involved research on finite mixture model based methods and variations that incorporate variable selection and semi-supervised updating. She is currently working on creating hybrid clustering methods using both parametric and classical algorithmic approaches. She has also developed new mixture model clustering methods for discrete and space-restricted data. Social network analysis and dynamic treatment regimes are also current areas of interest. Application areas she worked on include: housing markets, cDNA microarrays, electronic educational testing, food authenticity studies and many others. Dr Dean is on the committee for an International Federation of Classification Societies initiative to promote good benchmarking practices in clustering research.

Request the training materials

View all courses available