Tutorial for SDM18: The Canonical Polyadic Tensor Decomposition and Variants for Mining Multi-Dimensional Data

Tamara G. Kolda and Daniel M. Dunlavy, Sandia National Laboratories

The SIAM International Conference for Data Mining (SDM18) will be held May 3-5, 2018 in San Diego, CA. Click here for more information. The tutorial will be held on Friday, May 4th, in two parts: 1:15-3:15pm and 3:30-5:10pm.

Slides and Labs: The PDF slides (8 MB) are now available here. All the labs can be downloaded from Google Drive. We have split out some of the data files for convenience. If you download the gas files separately, place them in the labs/data folder. So, you can download one of the following:

IMPORTANT: This tutorial features interactive lab exercises but we need you to sign up with the organizers in advance so that we can send you additional information on software installation and data download in advance of the workshop. You are expected to

Free licenses are available for MATLAB, but you must sign up with the organizers in advance (no later than May 1st if you need MATLAB).

Basic information

Description: Multi-dimensional or multi-way datasets are becoming increasingly common in science and engineering applications. Data structures that live in three or more dimensions often exhibit informative hidden structures that can be discovered and understood through tensor decompositions. The purpose of this tutorial is to dive deep into the canonical polyadic tensor decomposition (also known as CANDECOMP, PARAFAC, or just CP), giving attendees the mathematical and algorithmic tools to understand existing methods and have a strong foundation for developing their own tools. The tutorial begins with the basics and builds up to very recent developments. It is appropriate for anyone at the graduate school level or higher with a basic understanding of numerical methods. A unique feature of our proposed tutorial will be hands-on exercises using the Tensor Toolbox for MATLAB to apply tensor decompositions to real-world open source datasets. Through these exercises, we hope to give attendees a glimpse into the application of these methods and the open problems that still exist (like choosing the rank of the tensor decomposition). We expect that most attendees will already have access to MATLAB through their universities, but we also intend to work with Mathworks to get temporary licenses for participants. We will work with one dataset that is nearly 2 GB, so we will invite participants to download the datasets ahead of time.

Instructors:

Length: The course will be two two-hour segments, for a total of four hours. The first two hours are focused on mathematical background that generally only persons already working in tensor decompositions know. This lays the groundwork for the second two hours which is focused on more advanced situations, such as missing data, alternative decompositions, larger data sets, and advanced algorithms. Attendees will benefit from a detailed review of the mathematical background that is never presented in ordinary talks.

Audience

What background will be required of the audience? Students are expected to have a very basic familiarity with numerical algorithms. Experience with numerical linear algebra and optimization is helpful but not required; all definitions will be presented during the tutorial. This is generally intended for a broad audience of scientists and engineers without prior experience in tensors analysis.

Why is this topic important/interesting to the SIAM data mining community? Tensor decompositions are ubiquitous in data mining, but there are few books available on the topic.

What is the benefit to participants? Tutorial attendees should expect to get the following: (1) Experience in applying the CP tensor decomposition to interesting data sets, (2) Understanding of the mathematical formulation of the CP and algorithms to compute it, and (3) Ideas for open problems to solve.

Coverage

The tutorial will be divided into two two-hour parts, but the exact division of the content will be adapted on demand to the participants. Here is an outline of the tutorial topics:

Items marked with an asterisk will be material that comes primarily from our own research, though it will also involve introduction to general related concepts (statistical likelihood calculations, matrix sketching, etc.). The labs will involve real-world data sets from chemometrics, gas monitoring systems, and so on. We do include a very brief MATLAB primer for students that are unfamiliar with it.

Instructor Biographies

Tamara G. Kolda is a Distinguished Member of Technical Staff at Sandia National Laboratories. Her research interests include multilinear algebra and tensor decompositions, data mining, network/graph algorithms and analysis, numerical optimization, parallel computing, and the design of scientific software. Dr. Kolda is a SIAM Fellow, an ACM distinguished member, and a recipient of several awards including three best paper prizes and a 2003 Presidential Early Career Award for Scientists and Engineers (PECASE). Dr. Kolda wrote one of the key review papers on Tensor Decompositions and Applications (SIAM Review, 2009) which has been cited over 3600 times. She frequently gives keynote and invited talks on tensor decompositions, including an invited talk at MLConf in San Francisco and the SIAM Invited Address at the 2018 Joint Mathematics Meeting in San Diego. See www.kolda.net for more.

Daniel M. Dunlavy is a Principal Member of Technical Staff in the Center for Computing Research at Sandia National Laboratories in Albuquerque, NM. His research interests include tensor decompositions, numerical optimization, numerical linear algebra, machine learning, data mining, text analysis, parallel computing, and cyber security.

Many thanks to Jed Duerch and Kina Kincher-Winoto for major contributions to this tutorial.