Statistical Analysis and Modeling Group

Welcome to the Statistical Analysis and Modeling Group! The Group is a part of the Institute of Computer Science of the Polish Academy of Sciences. The Group's research activities concern probabilistic and statistical modeling of natural phenomena and statistical inference for constructed models.

The Group maintains strong links with the Faculty of Mathematics and Information Sciences of the Warsaw University of Technology where several of its members teach courses and pursue joint research.

People Research Publications

Research grants

Algorithmic models of prediction: formal properties and philosophical implications (grant NCN OPUS)

Dariusz Kalociński (principal investigator), Łukasz Dębowski (investigator)

Research topics

Below is a list of most important research topics.

Uplift modelling

Uplift modelling concerns modelling of individual treatment effects (e.g., marketing campaign or medical therapy) by taking into account a control group not subjected to the treatment. Specially tailored methods for such cases include adaptations of the Support Vector Machine methods and the Committees of Classifiers approaches. The theory of linear models for the uplift case is also being developed.

Variable selection and interaction detection for high-dimensional data

Variable selection is studied for high-dimensional data. The problem is also investigated for the model misspecification case. Moreover, approaches based on information theoretic measures such as mutual information and interaction information has been studied, in particular methods taking into account higher order interactions. We are also interested in detection of interactions using information theoretic measures in the context of finding gene-gene interactions.

Probabilistic modelling of the natural language

Information theoretic and probabilistic modelling of the natural language includes analysis of discrete stochastic processes with specific types of dependence which are quantified, e.g., by the rate of increase of the block entropy and the length of the maximal repetition. Such processes exhibit certain statistical properties which are close to those found in natural language production, e.g., they satisfy Hilberg's hypothesis about a power law increase of mutual information.

Multilabel classification

Multilabel classification is an extension of standard classification problem in which many target variables (called labels) are predicted simultaneously. Of a particular interest is construction of effective methods for high-dimensional data when high-dimensionality refers to large number of potential features as well as to dimensionality of the response. The aim of the research is to develop algorithms (as well as to analyse their performance theoretically) for feature selection and prediction in this set-up. They rely among others on regularization methods adapted for such models.

Positive unlabeled learning

Learning from positive and unlabelled data (PU learning) has attracted increasing interest within the machine learning literature as this type of data naturally arises in many applications (under reporting, text classification, disease gene identification). In the case of PU data, we have an access to positive examples and unlabeled examples. Unlabeled examples can be either positive or negative. We study a problem of building models as well as the problem of feature selection for PU data.