SDM 2015 Tutorial

Methods and Applications of Network Sampling


Abstract

Network data appears in various domains, including social, communication, and information sciences. Analysis of such data is crucial for making inferences and predictions about these networks, and moreover, for understanding the different processes that drive their evolution. However, a major bottleneck to perform such an analysis is the massive size of real-life networks, which makes modeling and analyzing these networks simply infeasible. Further, many networks, specifically those that belong to social and communication domains, are not visible to the public due to privacy concerns, and other networks, such as the Web, are only accessible via crawling. Therefore, to overcome the above challenges, researchers use network sampling overwhelmingly as a key statistical approach to select a sub-population of interest that can be studied thoroughly.

In this tutorial, we aim to cover a diverse collection of methodologies and applications of network sampling. We will begin with a discussion of the problem setting in terms of objectives (such as, sampling a representative subgraph, sampling graphlets, etc.), population of interest (vertices, edges, motifs), and sampling methodologies (such as Metropolis-Hastings, random walk, and snowball sampling). We will then present a number of applications of these methods, and will outline both the resulting opportunities and possible biases of different methods in each application.

Slides will be available soon!

Tutorial Outline (preliminary, subject to change)

This tutorial will begin by illustrating the problem setting with various examples of applications in data collection, structure analysis, parameter estimation, classification, A/B testing, and information diffusion. It will then present the different objectives, population of interest, and sampling methods.

About the Instructors

Mohammad A. Hasan is an Assistant Professor of Computer Science at Indiana University Purdue University, Indianapolis (IUPUI). Before that, he was a Senior Research Scientist at eBay Research Labs, San Jose, CA. He received a Ph.D. degree in Computer Science from Rensselaer Polytechnic Institute (RPI) in 2009, and an MS degree in Computer Science from the University of Minnesota, Twin Cities in 2002. His research interest focuses on developing novel algorithms in data mining, data management, information retrieval, machine learning, social network analysis, and bioinformatics. One of his particular interests is to develop algorithms for sampling small substructures from large networks. He developed methods for: (1) sampling frequent subgraphs from a graph database, (2) sampling triangles and graphlets from a large network, and (3) Sampling interesting subgraph patterns using interactive feedbacks, all using Markov Chain Monte Carlo (MCMC) sampling algorithm. His doctoral dissertation won the ACM SIGKDD doctoral dissertation award in 2010. He is also a recipient of NSF CAREER award in 2012.

Nesreen K. Ahmed is a Ph.D. candidate in the Computer Science Department at Purdue University working with Dr. Jennifer Neville towards her Ph.D. degree. She received an MS in statistics and computer science from Purdue University. Her dissertation is focused on statistical sampling and estimation for large streaming graphs. Her research focuses on developing fast parallel algorithms for sampling, estimation, feature learning, and relational knowledge discovery of network data. She has worked as a research and data science intern at Facebook, Intel, Adobe labs, and the data mining and computer modeling center of excellence in Cairo.

Jennifer Neville is an associate professor at Purdue University with a joint appointment in the Departments of Computer Science and Statistics. She received her PhD from the University of Massachusetts Amherst in 2006. In 2012, she was awarded an NSF Career Award, in 2008 she was chosen by IEEE as one of "AI's 10 to watch", and in 2007 was selected as a member of the DARPA Computer Science Study Group. Her research focuses on developing data mining and machine learning techniques for relational domains, including citation analysis, fraud detection, and social network analysis.