Joint Fudan-HKBU Workshop
on Data Science

11 - 13 May 2015
Hong Kong Baptist University

Fudan Speakers:
Weiguo Gao (School of Mathematical Sciences)
Wei Lin (School of Mathematical Sciences and Centre for Computational Systems Biology)
Shuai Lu (School of Mathematical Sciences)
Xiaoyang Sean Wang (School of Computer Science)
Zongmin Wu (School of Mathematical Sciences)
Jungong Xue (School of Mathematical Sciences)
Shuqin Zhang (School of Mathematical Sciences)
Xinsheng Zhang (Department of Statistics)
HKBU Speakers:
William Cheung (Department of Computer Science)
Xiaowen Chu (Department of Computer Science)
Haiping Lu (Department of Computer Science)
Michael Ng (Department of Mathematics)
Henry Ngan (Department of Mathematics)
Celine Song (Department of Journalism)
Tiejun Tong (Department of Mathematics)
Can Yang (Department of Mathematics)
Tieyong Zeng (Department of Mathematics)
11 May (Monday)
RRS905, Sir Run Run Shaw Building, Ho Sin Hang Campus
09:30-09:40 Welcoming Remarks and Group Photo
09:40-10:20 Zongmin Wu
Density Functions Estimation and Wasserstein Distance

In probability and statistics, the estimation of density functions based on observed sampling data is a fundamental problem. However, in traditional analysis to estimate the density functions, the most widely used measure of discrepancy between the density estimator \hat{f}(x) from the true density f(x) is the mean integrated square error (abbreviated MISE), actually, it is the expectation of L2-norm, and it requires the continuity of the density functions. However, in most applications, the density functions are usually not continuous, moreover, they are even not in L2-space. Furthermore, the most of traditional metric such as L1-norm, Sobolev norm, Besov norm cannot depict the distance between two density functions properly. Therefore, we choose one kind of statistical distance named Wasserstein distance to measure the difference of two density functions. We get two results about this problem, they are: Based on the classical Bernstein approximation, a scheme to estimate the density functions or distribution functions measured by Wasserstein metric is presented. Considering the kernel method is of wide applicability, we discussed the kernel method, compared with the MISE, the convergence in probability based on the Wasserstein distance is presented.

10:20-10:40 Coffee Break
10:40-11:20 Xinsheng Zhang
Statistical Inference for High-dimensional Data

In the last decades, high-dimensional data analysis has been an active area. Methodology for high dimensional data is one of the most important research topics in statistics. For high dimensional data, the number of explanatory variables is much larger than the sample size. The conventional methods of statistical inference are no longer valid. In this talk I will briefly review recent development on the statistical methodology and theory for the analysis of high- dimensional data. I will also introduce our recent results on the comparison of Kendall's tau Matrix from two independent samples under the high dimensional setting.

11:20-12:00 Tiejun Tong
Shrinkage-Based Diagonal Hotelling's Tests for High-Dimensional Small Sample Size Data

High-dimensional small sample size data such as microarrays bring novel tools and also statistical challenges to genetic research. In addition to detecting differentially expressed genes, testing the significance of gene sets or pathway analysis has been recognized as an equally important problem. Owing to the "large p small n" paradigm, the traditional Hotelling's T2 test suffers from the singularity problem and therefore is not valid in this setting. In this paper, we propose a shrinkage-based diagonal Hotelling's test for both one-sample and two-sample cases. We also suggest several different ways to derive the approximate null distribution under different scenarios of p and n for our proposed shrinkage-based test. Simulation studies show that the proposed method performs comparably to existing competitors when n is moderate or large, but it is better when n is small. In addition, we analyze four gene expression data sets and they demonstrate the advantage of our proposed shrinkage-based diagonal Hotelling's test.

12:00-14:00 Lunch at HSH Campus Staff Canteen (Invitation Only)
14:00-14:40 Xiaoyang Sean Wang
Data Management Issues in Big Data Analytics

In contract to traditional data analytics, a distinctive feature of big data analytics is in its desire to include a large variety of data types in one analysis job. This gives rise to several data management challenges that have not been adequately dealt with in the past. This talk will summarize what these problems might be and describe a research agenda with technical considerations that aim to solve these problems. In fact, the problems boil down to a data integration one, but need careful study with respect to the volume and velocity properties of the big data as well as its rich semantics that may arise in an ad hoc manner.

14:40-15:20 Xiaowen Chu
Fermi, Kepler, Maxwell: the Evolution of GPU Memory Hierarchy

Memory access efficiency is a key factor for fully exploiting the computational power of Graphics Processing Units (GPUs). However, many details of the GPU memory hierarchy are not released by the vendors. We propose a novel fine-grained benchmarking approach and apply it on three generations of NVIDIA GPUs, namely Fermi, Kepler and Maxwell, to expose the previously unknown characteristics of their memory hierarchies. Specifically, we investigate the structures of different cache systems, such as data cache, texture cache, and the translation lookaside buffer (TLB). We also investigate the achieved throughput and memory access latency of GPU global and shared memory. Our micro-benchmark results offer a better understanding on the mysterious GPU memory hierarchy, which can help in the software optimization and the modelling of GPU architectures.

15:20-15:40 Coffee Break
15:40-16:20 Shuai Lu
Oracle-type Posterior Contraction Rates in Bayesian Inverse Problems

We discuss Bayesian inverse problems in Hilbert spaces. The focus is on a fast concentration of the posterior probability around the unknown true solution as expressed in the concept of posterior contraction rates. This concentration is dominated by a parameter which controls the variance of the prior distribution. Previous results determine posterior contraction rates based on known solution smoothness. Here we show that an oracle-type parameter choice is possible. This is done by relating the posterior contraction rate to the root mean squared estimation error. In addition we show that the excess probability, which usually is bounded by using the Chebyshev inequality, has exponential decay, at least for a priori parameter choices. These results implement the exponential concentration of Gaussian measures in Hilbert spaces.

16:20-17:00 Tieyong Zeng
Convex Variational Model for Restoring Blurred Images with Rician Noise

In this talk, a new convex variational model for restoring images degraded by blur and Rician noise is proposed. The new method is inspired by previous works in which the non-convex variational model obtained by maximum a posteriori estimation has been presented. Based on the statistical property of Rician noise, we put forward to adding an additional data-fidelity term into the non-convex model, which leads to a new strictly convex model under mild condition. Due to the convexity, the solution of the new model is unique and independent of the initialization of the algorithm. We utilize a primal-dual algorithm to solve the model. Numerical results are presented in the end to demonstrate that with respect to image restoration capability and CPU-time consumption, our model outperforms some of the state-of-the-art models in both medical and natural images.

18:00 Dinner at Siu Pong Hall (Invitation Only)
12 May (Tuesday)
1/F Shiu Pong Hall, Ho Sin Hang Campus (9 Broadcast Drive)
09:30-10:10 Wei Lin
Detection of Causality Delay from Time Series

Time delay is ubiquitous in nature and manmade systems. In this talk, we propose a new approach called causal delayed spectrum (CDS) to detect time delay, where we emphasize the importance of incorporating the time delay with the cross correlation mapping causal analysis. We benchmark its performance in detecting time delay(s) with simple paradigmatic models and also apply it to study the change of Heart beat-Blood Pressure coupling dynamics during normal aging and analyze the California Anchovy-Shark data. This is a joint work with my PHD students, Siyang Leng and Chenyang.

10:10-10:50 Can Yang
IMAC: A Flexible Statistical Approach to Integrating Multilayered Annotation for Characterizing Functional Roles of Genetic Variants that Underlie Human Complex Phenotypes

Recent international projects, such as the Encyclopedia of DNA Elements (ENCODE) project, the Roadmap project and the Genotype-Tissue Expression (GTEx) project, have generated vast amounts of genomic annotation data measured at the multiple layers, e.g., epigenome and transcriptome. These multilayered annotation data offer us unprecedented opportunities to characterize functional roles of genetic variants that underlie human complex phenotypes, such as height, weight, blood pressure and disease status. To establish the causal link from genotypes to organismal phenotypes, there is a great need to perform integrative analysis of multilayered annotation data.
A big challenge in integrative analysis is how to put multilayered information into a unified model and automatically select most relevant genomic features from a potentially huge set of genomic features. In this talk, we introduce a flexible statistical approach, named IMAC, to integrating multilayered annotation for characterizing functional roles of genetic variants that underlie human complex phenotypes. IMAC enabled us to automatically perform feature selection from a large number of annotated genomic features and naturally incorporate the selected features for prioritization of genetic risk variants. IMAC not only demonstrated a remarkably computational efficiency (e.g., it took about 2~3 minutes to handle millions of genetic variants and thousands of functional annotations), but also allowed rigorous statistical inference of the model parameters and false discovery rate control in risk variant prioritization. With the IMAC approach, we performed integrative analysis of genome-wide association studies on multiple complex human traits and genome-wide annotation resources, e.g., expression QTL and splicing QTL. The analysis results revealed interesting regulatory patterns of risk variants. These findings undoubtedly deepen our understanding of genetic architectures of complex traits.
The underlying statistical principle in IMAC design is fairly general, the key idea can be leveraged to other Big Data involved applications.

10:50-11:10 Coffee Break
11:10-11:50 Shuqin Zhang
Community Identification in Networks

Networks are commonly used to model complex systems in many areas. As a fundamental problem in network analysis, community identification has attracted much attention from different research fields. In this talk, we will introduce our proposed community identification methods both in one single network and in multiple networks. Theoretical analysis and numerical experiments are given to show the performance of our methods.

11:50-12:30 Haiping Lu
Multilinear Subspace Learning: Compact Feature Learning from Big Data via Tensor Representation

In this big data era, it is important to learn compact features for efficient processing. Most big data are multidimensional and can be represented as tensors. Based on our recent book, this talk focuses on learning compact features from big data via tensor representation. In particular, we study multilinear subspace learning (MSL), a dimension reduction technique for tensors adapted from tensor decompositions. MSL directly maps input tensors to a low-dimensional subspace, without reshaping into high-dimensional vectors. It preserves data structure, obtains more compact features, and processes big data more efficiently. The mapping can be done through tensor-to-tensor projection and tensor-to-vector projection, which are adaptations of Tucker decomposition and the canonical polyadic decomposition (or PARAFAC/CANDECOMP), respectively. We will examine MSL algorithms and MSL feature characteristics, explore various MSL applications, and outline future research directions in learning compact features via tensors for big data science.

12:30-14:30 Lunch at Siu Pong Hall (Invitation Only)
14:30-15:10 Jungong Xue
Complex Nonsymmetric Algebraic Riccati Equations Arising in Markov Modulated Fluid Flows

Motivated by the transient analysis of stochastic fluid flow models, we introduce a class of complex nonsymmetric algebraic Riccati equations. The existence and uniqueness of the extremal solutions to these equations are proved. Numercial methods for the extremal solutions are discussed.

15:10-15:50 Celine Song
Discussing Food Safety on Weibo: Participation, Expression and Virality

To understand the structural, sentimental features of online discussion networks in China, we analyze tweets about food safety posted on SINA Weibo over the course of 70 days with a focus on participation structure, the content of participant contributions and responses, and the emotions that are present. Despite growing scholarly attention to sentiments in social media research, researchers have predominantly looked at the dimension of valence~Wpositive affect and negative affect~Wwhich may miss important nuances in mood expression. This study takes a psychological approach to understanding how emotion affects information propagation. Our findings suggest the relationship between emotion and social transmission is more complex than valence alone. Content that evokes high-arousal negative emotions (e.g., anger and fear) is shown to be more viral than that of deactivating negative emotions (e.g., sadness). This study also examines the emotion homophily phenomenon and reports on the relationship between moods and participatory patterns of individuals in terms of conversational engagement.

15:50-16:30 William Cheung
Mining Interaction, Mobility and Information Propagation Patterns in Network Data Using the Probabilistic Modeling Approach

The recent advent of ubiquitous computing and sensor technologies has enabled digital traces of human mobility and online interaction to be easily collected. This can in turn support behavioral studies in different domains including marketing, media analysis, healthcare, etc. Mining hidden behavioral patterns from the digital traces is non-trivial due to the patterns stochastic structural/spatio-temporal variations, even for the same type of activities. In this talk, I will present our recent works on using the generative modeling approach for modeling and mining salient interaction, mobility and information propagation patterns in temporal network data. I will also show how some of the methods can be applied to problems like followee recommendation in social network and activity characterization in smart homes.

16:30 Coffee Break
17:00 Research Exchange and Dinner (Invitation Only)
13 May (Wednesday)
RRS905, Sir Run Run Shaw Building, Ho Sin Hang Campus
09:30-10:10 Weiguo Gao
Accelerating ADMM for Group LASSO with Overlap

We discuss the algorithms for group lasso with overlap in this talk. Matrix-vector multiplications can be carried out efficiently. We further show that this problem can be formulated in the inner-outer regime. We propose an accelerating technique by solving the inner problem which can improve the overall performance. Convergence is guaranteed by a rigorous proof.

10:10-10:50 Henry Ngan
A Comparative Study of Outlier Detection for Large-scale Traffic Data by One-class SVM and Kernel Density Estimation

This talk aims at presenting a comparative study of outlier detection (OD) for large-scale traffic data. The traffic data nowadays are massive in scale and collected in every second throughout any modern city. In this research, the traffic flow dynamic is collected from one of the busiest 4-armed junction in Hong Kong in a 31-day sampling period (with 764,027 vehicles in total). The traffic flow dynamic is expressed in a high dimension spatial-temporal (ST) signal format (i.e. 80 cycles) which has a high degree of similarities among the same signal and across different signals in one direction. A total of 19 traffic directions are identified in this junction and lots of ST signals are collected in the 31-day period (i.e. 874 signals). In order to reduce its dimension, the ST signals are firstly undergone a principal component analysis (PCA) to represent as (x,y)-coordinates. Then, these PCA (x,y)-coordinates are assumed to be conformed as Gaussian distributed. With this assumption, the data points are further to be evaluated by (a) a correlation study with three variant coefficients, (b) one-class support vector machine (SVM) and (c) kernel density estimation (KDE). The correlation study could not give any explicit OD result while the one-class SVM and KDE provide average 59.61% and 95.20% DSRs, respectively.

10:50-11:40 Michael Ng
Discussing Transfer Learning Problems

In this talk, I share some of my recent results in transfer learning. Some algorithms and applications are discussed. Experimental results are given to illustrate the effectiveness of these algorithms.

11:40 Coffee Break
12:00 Research Exchange and Lunch (Invitation Only)
Campus Map
Organized by:
The Centre for Mathematical Imaging and Vision (CMIV), Hong Kong Baptist University