Data Science

ABOUT

The Data Science Group was established in 2019 and focuses on a) methodological research questions related to data science and algorithmic artificial intelligence, b) foundational questions regarding the best achievable performance limits in various tasks of data modeling, analysis and inference, and c) interdisciplinary research questions from a wide variety of research domains, including Social Sciences, Geo-Sciences, Bioinformatics and Biomedical Engineering, Environmental and Transportation Engineering, among others. Current activity areas include frequentist and Bayesian spatio-temporal modelling, the development of information-theoretic tools, high-dimensional and functional data analysis, time-series forecasting and online predictive and anomaly detection algorithms. Researchers from the Data Science Group participate in the Statistical Learning Lab, which fosters multidisciplinary collaborations, develops software products, research publications and patents and contributes to the education and training of students and young researchers.  

RESEARCH AND DEVELOPMENT ACTIVITIES

Space-time models

Data Science Group members are actively involved in the development of Bayesian and frequentist models for spatio-temporal data, including models with spatially varying coefficients. Conditional autoregressive priors, Moran’s eigenvector filtering, dimension reduction algorithms, penalized LASSO-type estimates and bootstrap-based uncertainty quantification are among the methodological tools that are used to estimate parameters and conduct inference on such models. Typical applications of space-time models in Geosciences include a) the evaluation of outputs from Regional Climate Models and b) the analysis of time series of remotely sensed imagery. Similar models have been developed to analyze the dynamics of regional economic variables.

Mathematics of information: Information is a core notion in many engineering and scientific disciplines. For example, much of modern statistics may be characterised as the process of extracting information from data. Over the past 60 years, information, along with its mathematical description via the Boltzmann-Shannon entropy, have played a crucial role in science and technology, both as a central metaphors providing intellectual guidelines, as well as a mathematically specific, technical, and precisely measurable quantities. Members of the Data Science group have been exploring the development of rigorous tools for the mathematical description of information, as well as the analysis of practical questions arising in a variety of applications, ranging from moder digital communications networks to finance, neuroscience and bioinformatics.  

Incident/Anomaly Detection : Group members develop real-time algorithms for incident/anomaly detection, focusing mostly on network activity (e.g. vehicular networks).  Decision trees and (nonparametric) quantile regressions are among the methodological tools that have been employed as key components of such algorithms. Typical applications include incident detection on vehicular networks based on loop detector data, fraud detection in bank transactions, etc.

Functional Data Analysis: Functional data analysis (FDA) focuses on data that can be curves, surfaces or anything else varying over a continuum. For instance, plasma thermograms are curves associated with a person’s health status; a group of such curves can be analyzed using FDA-analogues of conventional statistical procedures. The Data Science group develops segment-wise supervised classification schemes for multivariate functional data, which may reject noisy domains of the functional data and assign larger weights to segments that contain useful information for the classification groups of each study. A typical application of the proposed algorithms is disease identification/diagnosis.

Forecasting Multivariate Time Series: The group’s activities include research on linear (ARIMA) and nonlinear (smooth-transition or threshold autoregressions) parametric time series models that are estimated using a) fast penalized LASSO-type estimators or, b) Bayesian posteriors based on shrinkage priors (e.g. horseshoe). Such models are used in (real-time) forecasting applications, including for example, short-term forecasts of a) vehicular counts in a transportation network, b) energy outputs from wind- or solar-panel farms and c) emissions from heavy-duty diesel engines.

Deep Generative Learning:  Generative models based on Deep Neural Nets have shown unprecedented capabilities in sampling data from complex but unknown distributions. Researchers in the Data Science Group develop novel algorithms for training generative models, focusing on Generative Adversarial Networks (GANs). GANs have been used in data augmentation schemes to generate synthetic data that follow the same distributional characteristics as the original dataset (which may contain sensitive information or limited number of cases). The proposed methodology has been applied to identify dyslexia in children, using measurements from specialized eye trackers.

Uncertainty Quantification in Stochastic Systems and Multilevel Models: Uncertainty quantification (UQ) is essential for the reliable and robust modelling of complex stochastic dynamics. The Data Science Group develops novel information-theoretic tools to address the challenges concerning the feasibility and efficiency of UQ in stochastic systems, which were used to analyze biological reaction networks. Furthermore, the Data Science group members conduct research on computationally intensive methods (e.g. bootstrap-based confidence intervals) for uncertainty quantification in models that capture nested data structures (multilevel models for categorical responses).

Sparse Inference in Dynamical Systems: The Data Science group members develop methods and algorithms for learning the structure and the parameters of sparse dynamical systems from temporal measurements. Sparsity means that the dynamics of each state variable are typically driven by a relatively small number of variables. Sparsity is critical for learning large systems from finite data and constitutes a form of complexity penalization and regularization thus favoring simpler solutions. Applications include but are not limited to biological networks of proteins, neuroscience, physics and finance.
Bayesian Networks: Bayesian networks, a special case of directed acyclic graphs, are frequently employed for modelling domain knowledge in Decision Support Systems, particularly in medicine. Researchers from the Data Science Group develop algorithms for Bayesian network structure learning, which can be used to automatically construct Decision Support Systems.
Feature Selection for Big and High-Dimensional Data: Feature (also called variable) selection for regression and classification, is an essential problem that has been given considerable attention during the last three decades. Specifically, feature selection is critical in order to construct models which possess a) satisfactory predictive capability and b) interpretable structure. The Data Science Group designs feature-selection algorithms that can be utilized in challenging applications that analyze Big data, in which the number of features may be larger than the number of cases (which can be in the order of millions).  The developed algorithms can be implemented in numerous and diverse research domains; a typical application is the analysis of SNP data (Bioinformatics).
Compositional Data Analysis : Compositional data are met in many different fields, such as economics, archaeometry, ecology, geology and political sciences. Researchers from the Data Science Group have worked on the improvement of existing predictive models (e.g. regression and classification) for compositional data, which can be seen as vectors of proportions that sum to one.
Heart Rate Estimation: Estimation of hazard rates is important in research areas such as reliability and survival analysis. The classical method of estimation involves modelling the failure times by appropriate probability distributions such as exponential, gamma, Weibull, etc. The Data Science Group focuses on nonparametric kernel methods, which are flexible as they are not based on distributional assumptions.
Statistical Analysis of Transportation Emissions: The group’s activity emphasizes predictive modeling of high-frequency emissions rates from real-world experiments, similar to the ones conducted to define European and US emissions standards. Such models typically focus on pollutants which are hard to measure, such as particle numbers (PN). The implemented methodology a) combines parametric nonlinear time series models with robust quantile regressions for vehicle-specific predictions, and b) estimates multi-level specifications to compute predictions for homogeneous families of vehicles.   
Predictive Models in Biomedical Engineering and Epidemiology: The group’s activities include a) predictive modeling for heart volumes which are used to perform virtual heart transplants, b) spatial statistical specifications for the growth of thrombus in abdominal aortic aneurysms, c) nonparametric functional classifiers for disease identification and d) spatio-temporal models for epidemics.
Deep Learning in Speech Processing: Deep Neural Networks have taken the engineering community by storm. Data-rich areas such as image processing and speech processing have been transformed during the last years. The Data Science Group combines its expertise on speech processing and applies deep learning techniques to applications such as voice conversion, speech synthesis and speech enhancement.
Parametric Sensitivity Analysis in Biochemical Reaction Networks: Data Science group members have developed parametric sensitivity analysis tools as an approach to mathematically and computationally understand and evaluate the behavior of complex phenomena in biochemical reaction networks.
Space-time Econometrics: Data Science group members constructed space-time models which aim to explain the dynamics of regional productivity in the EU. Currently, our group develops spatial and spatiotemporal models for the adoption of agricultural innovation in the US
Supervised and Unsupervised Modeling: Researchers from the Data Science Group participate in projects that require the design, implementation, validation and deployment of a predictive model. Applications are diverse, including energy production forecasting for wind farms, outbreak forecasting, Raman spectroscopy, materials' properties prediction and more.
Education and Training: The group contributes to the education and training of undergraduate, graduate and post-graduate students as well as of PhD candidates and Postdoctoral researchers.

Data Science

RESEARCH AND DEVELOPMENT PROGRAMS

A. ONGOING PROJECTS


PUBLICATIONS

  • 2019

      • F. Alevizos, D. Bagkavos and D. Ioannides (2019) Efficient estimation of a distribution function based on censored data. Statistics & Probability Letters 145, 359-364.G.
      • Borboudakis and I. Tsamardinos (2019) Forward-backward selection with early dropping. The Journal of Machine Learning Research, 20 (1), 276-314.
      • D. Kyriakis, A. Kanterakis, T. Manousaki, A. Tsakogiannis, M. Tsagris, I. Tsamardinos, L. Papaharisis, D. Chatziplis, G. Potamias and C. S. Tsigenopoulos. Scanning of genetic variants and genetic mapping of phenotypic traits in Gilthead sea bream through ddRAD sequencing. Frontiers in Genetics, 10:675.
      • P.J. Paine, S.P. Preston, M. Tsagris and A.T.A. Wood (2019) Spherical regression models with general covariates and anisotropic errors. Statistics and Computing, 1-13
      • Y. Pantazis, M. Tsagris and A.T.A. Wood (2019) Gaussian Asymptotic Limits for the α-transformation in the Analysis of Compositional Data. Sankhya A, 1-20.
      • Y. Pantazis and I. Tsamardinos (2019) A unified approach for sparse dynamical system inference from temporal measurements. Bioinformatics, btz065.
      • M. Tsagris (2019) Bayesian network learning with the PC algorithm: an improved and correct variation. Applied Artificial Intelligence 33(2), 101-123.
      • M. Tsagris, A. Alenazi, K.M. Verrou and N. Pandis (2019) Hypothesis testing for two population means: parametric or non-parametric test? Journal of Statistical Computation and Simulation, DOI: 10.1080/00949655.2019.1677659
      • M. Tsagris and I. Tsamardinos I. (2019). Feature selection with the R package MXM. F1000Research, 7:1505.
      • I. Tsamardinos, G. Borboudakis, P. Katsogridakis, P. Pratikakis and V. Christophides (2019) A greedy feature selection algorithm for Big Data of high dimensionality. Machine Learning, 108 (2), 149-202.
      • B. Wang, Y. Zheng, D. Fang, Y. Kamarianakis and J. Wilson (2019) Split bootstrap hierarchical modeling of antibiotics abuse in China. Statistics in Medicine, 38, 2282-2291.

PEOPLE

RESEARCHERS
  •        Aggeliki Doxa 
  •        Evangellia Kaligianaki
  •        Yiannis Kamarianakis 
  •        Ioannis Kontoyiannis 
  •        Michalis Loulakis 
  •        Yannis Pantazis 
  •        Antonis Papapantoleon 
  •        Michail Tsagris
  •        Ioannis Tsamardinos 
  •        Ivi Tsantili

CONTACT US

For any information regarding the Group, please contact:

Data Science Group
Institute of Applied and Computational Mathematics
Foundation for Research and Technology - Hellas
Nikolaou Plastira 100, Vassilika Vouton,
GR 700 13 Heraklion, Crete
GREECE

Tel: +30 2810 391800
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it. (Mrs. Maria Papadaki)

Tel.: +30 2810 391805
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it. (Mrs. Yiota Rigopoulou)