Statistical Analysis and Modeling Group

Welcome to the website of Statistical Analysis and Modeling Group! The Group is a part of the Institute of Computer Science of the Polish Academy of Sciences. The Group's research activities concern probabilistic and statistical modeling of natural phenomena and statistical inference for constructed models.

The Group maintains strong links with the Faculty of Mathematics and Information Sciences of the Warsaw University of Technology where several of its members teach courses and pursue joint research.

The seminar of the Statistical Analysis and Modeling Group usually takes place on Mondays at 10.00 AM in room 234 on the second floor of the Institute of Computer Science of the Polish Academy of Sciences (IPI PAN). The talks are usually delivered in Polish.

If you are interested in research or business colaboration related to data analysis and modeling, please contact us

People

Prof. dr hab. Szymon Jaroszewicz

Full Professor, Head of the group

uplift modelling
causal discovery
machine learning
numerical computation with random variables

Personal webpage ORCID

dr hab. Łukasz Dębowski

Associate Professor

information theory
discrete stochastic processes
statistical language modelling

Personal webpage ORCID

dr Małgorzata Łazęcka

Assistant Professor

information theory
feature selection
probabilistic graphical models
Bayesian statistics

Personal webpage ORCID

dr Dariusz Kalociński

Assistant Professor

computable structure theory
algorithmic learning
philosophy of mathematics

Personal webpage ORCID

Prof. Stan Matwin

Full Professor

text mining
bioinformatics
data privacy

Personal webpage ORCID

prof. dr hab. Jan Mielniczuk

Full Professor

information theory
feature selection
dependence analysis
time series analysis
nonparametric methods of mathematical statistics
Positive-unlabeled learning
Distribution shift

Personal webpage ORCID

dr Krzysztof Rudaś

Assistant Professor

uplift modelling
causal discovery
probabilistic graphical models
Bayesian statistics

Personal webpage ORCID

dr hab. Paweł Teisseyre

Associate Professor

multi-label classification
feature selection
information theory
Positive-unlabeled learning
Distribution shift

Personal webpage ORCID

Research topics

Below is a selected list of research topics that team members are working on.

Uplift modelling

Uplift modelling concerns modelling of individual treatment effects (e.g., marketing campaign or medical therapy) by taking into account a control group not subjected to the treatment. Specially tailored methods for such cases include adaptations of the Support Vector Machine methods and the Committees of Classifiers approaches. The theory of linear models for the uplift case is also being developed.

Krzysztof Rudaś, Szymon Jaroszewicz, Regularization for Uplift Regression , ECML/PKDD, 2023.

Krzysztof Rudaś, Szymon Jaroszewicz, Shrinkage Estimators for Uplift Regression, ECML/PKDD, 2019.

Krzysztof Rudaś, Szymon Jaroszewicz, Linear regression for uplift modeling, Data Mining and Knowledge Discovery, 2018.

Piotr Rzepakowski, Szymon Jaroszewicz, Decision trees for uplift modeling with single and multiple treatments, Knowledge and Information Systems, 2012.

Positive unlabeled learning

Learning from positive and unlabelled data (PU learning) has attracted increasing interest within the machine learning literature as this type of data naturally arises in many applications (under reporting, text classification, disease gene identification). In the case of PU data, we have an access to positive examples and unlabeled examples. Unlabeled examples can be either positive or negative. We study a problem of building models as well as the problem of feature selection for PU data.

Wojciech Rejchel, Paweł Teisseyre, Jan Mielniczuk, Joint empirical risk minimization for instance-dependent positive-unlabeled data, Knowledge-Based Systems, 2024.

Jan Mielniczuk, Adam Wawrzeńczyk, Augmented prediction of a true class for Positive Unlabeled data under selection bias, ECAI'24, 2024.

Paweł Teisseyre, Konrad Furmanczyk, Jan Mielniczuk, Verifying the Selected Completely at Random Assumption in Positive-Unlabeled Learning, ECAI’24, 2024.

Adama Wawrzeńczyk, Jan Mielniczuk, One-class classification approach to variational learning from biased positive unlabelled data, ECAI’23, 2023.

Konrad Furmanczyk, Jan Mielniczuk, Wojciech Rejchel, Paweł Teisseyre, Double Logistic Regression Approach to Biased Positive-Unlabeled Data, Proceedings of the European Conference on Artificial Intelligence ECAI’23, 2023.

Małgorzata Łazęcka, Jan Mielniczuk, Paweł Teisseyre, Estimating the class prior for positive and unlabelled data via logistic regression , Advances in Data Analysis and Classification, 2021.

Paweł Teisseyre, Jan Mielniczuk, Małgorzata Łazęcka, Different strategies of fitting logistic regression for positive and unlabelled data, ICCS’20, 2020

Statistical analysis and modelling of natural language

We investigate properties of natural language and large language models using tools from statistics and information theory. We model texts as a stochastic process and we seek for theoretical links between various empirical laws of quantitative linguistics by proving theorems and constructing toy examples. Our specific interests cover non-ergodicity, excess entropy, Santa Fe processes, Zipf's law, Hilberg's law (neural scaling law), maximal repetition length, and long-range dependence.

Łukasz Dębowski, Corrections of Zipf’s and Heaps’ Laws Derived from Hapax Rate Models, Journal of Quantitative Linguistics, 2025.

Łukasz Dębowski, Universal Densities Exist for Every Finite Reference Measure, IEEE Transactions on Information Theory, 2023.

Łukasz Dębowski, Information Theory Meets Power Laws: Stochastic Processes and Language Models , John Wiley and Sons, 2020.

Variable selection and interaction detection for high-dimensional data

Variable selection is studied for high-dimensional data. The problem is also investigated for the model misspecification case. Moreover, approaches based on information theoretic measures such as mutual information and interaction information has been studied, in particular methods taking into account higher order interactions. We are also interested in detection of interactions using information theoretic measures in the context of finding gene-gene interactions.

Barbara Żogała-Siudem, Szymon Jaroszewicz, Variable screening for Lasso based on multidimensional indexing , Data Mining and Knowledge Discovery, 2023

Piotr Pokarowski, Wojciech Rejchel, Agnieszka Sołtys, Michał Frej, Jan Mielniczuk, Improving Lasso for model selection and prediction , Scandinavian Journal of Statistics, 2022.

Barbara Żogała-Siudem, Szymon Jaroszewicz, Fast stepwise regression based on multidimensional indexes, Information Sciences, 2021

Tomasz Klonecki, Paweł Teisseyre, Jaesung Lee, Cost-constrained feature selection in multilabel classification using an information-theoretic approach, Pattern Recognition, 2023.

Paweł Teisseyre, Jaesung Lee, Multilabel all-relevant feature selection using lower bounds of conditional mutual information, Expert Systems with Applications, 2023.

Małgorzata Łazęcka, Jan Mielniczuk, Squared Error Based Shrinkage Estimators of Discrete Probabilities and Their Application to Feature Selection, Statistical Papers, 2022.

Mariusz Kubkowski, Jan Mielniczuk, Paweł Teisseyre, How to Gain on Power: Novel Conditional Independence Tests Based on Short Expansion of Conditional Mutual Information, Journal of Machine Learning Research, 2021.

Computable structure theory

Computable structure theory studies the relationship between descriptional/computational complexity and countable mathematical structures, using computability-theoretic tools such as Turing degrees, the arithmetic and the hyperarithmetic hierarchies, etc. Recently, an increasing amount of work in computable structure theory has focused on structures that can be represented by primitive recursive algorithms. Independently, this line of research has the potential to inform ongoing debates in the philosophy of mathematics, particularly in mathematical structuralism. Some of these studies are conducted as part of the ongoing NCN OPUS grant "Computable Structure Theory, and Philosophy of Mathematical Structuralism" (no. 2023/49/B/HS1/03930), held by dr Dariusz Kalociński.

Dariusz Kalociński, Luca San Mauro, Michał Wrocławski, Punctual presentability in certain classes of algebraic structures, MFCS 2024

Nikolay Bazhenov, Dariusz Kalociński, Degree spectra, and relative acceptability of notations, CSL 2023

Nikolay Bazhenov, Dariusz Kalociński, Michał Wrocławski, Intrinsic complexity of recursive functions on natural numbers with standard order, STACS 2022

Multilabel classification

Multilabel classification is an extension of standard classification problem in which many target variables (called labels) are predicted simultaneously. Of a particular interest is construction of effective methods for high-dimensional data when high-dimensionality refers to large number of potential features as well as to dimensionality of the response. The aim of the research is to develop algorithms (as well as to analyse their performance theoretically) for feature selection and prediction in this set-up. They rely among others on regularization methods adapted for such models.

Paweł Teisseyre, Classifier chains for positive unlabelled multi-label learning, Knowledge-Based Systems, 2021

Paweł Teisseyre, Learning classifier chains using matrix regularization: application to multi-morbidity prediction, ECAI’20, 2020

Paweł Teisseyre, Damien Zufferey, Marta Slomka, Cost-sensitive classifier chains: Selecting low-cost features in multi-label classification, Pattern Recognition, 2019.

Distribution shift

Traditional supervised learning methods are based on the classic assumption that training and test data are drawn from the same probability distribution. However, in many real applications the above assumption is not met, i.e. the model is trained on data from a certain distribution (source distribution), and then applied to new data that may come from another distribution (target distribution). For example, in medical applications, a model predicting the occurrence of a disease can be trained on data about patients from one country, and then we use it to predict the disease for patients from another country. In object recognition in images, the training set may contain objects captured in a different scenery than the objects in the test data. The aim of the research is to develop methods that will enable the detection of the above situations and to build models that work effectively in the case of various types of distribution shift.

Paweł Teisseyre, Jan Mielniczuk, A generalized approach to label shift: the Conditional Probability Shift Model, manuscript, 2025.