Welcome to the website of Statistical Analysis and Modeling Group! The Group is a part of the Institute of Computer Science of the Polish Academy of Sciences. The Group's research activities concern probabilistic and statistical modeling of natural phenomena and statistical inference for constructed models.
The Group maintains strong links with the Faculty of Mathematics and Information Sciences of the Warsaw University of Technology where several of its members teach courses and pursue joint research.
The seminar of the Statistical Analysis and Modeling Group usually takes place on Mondays at 10.00 AM in room 234 on the second floor of the Institute of Computer Science of the Polish Academy of Sciences (IPI PAN). The talks are usually delivered in Polish.
If you are interested in research or business colaboration related to data analysis and modeling, please contact us
Below is a selected list of research topics that team members are working on.
Uplift modelling concerns modelling of individual treatment effects (e.g., marketing campaign or medical therapy) by taking into account a control group not subjected to the treatment. Specially tailored methods for such cases include adaptations of the Support Vector Machine methods and the Committees of Classifiers approaches. The theory of linear models for the uplift case is also being developed.
Learning from positive and unlabelled data (PU learning) has attracted increasing interest within the machine learning literature as this type of data naturally arises in many applications (under reporting, text classification, disease gene identification). In the case of PU data, we have an access to positive examples and unlabeled examples. Unlabeled examples can be either positive or negative. We study a problem of building models as well as the problem of feature selection for PU data.
We investigate properties of natural language and large language models using tools from statistics and information theory. We model texts as a stochastic process and we seek for theoretical links between various empirical laws of quantitative linguistics by proving theorems and constructing toy examples. Our specific interests cover non-ergodicity, excess entropy, Santa Fe processes, Zipf's law, Hilberg's law (neural scaling law), maximal repetition length, and long-range dependence.
Variable selection is studied for high-dimensional data. The problem is also investigated for the model misspecification case. Moreover, approaches based on information theoretic measures such as mutual information and interaction information has been studied, in particular methods taking into account higher order interactions. We are also interested in detection of interactions using information theoretic measures in the context of finding gene-gene interactions.
Computable structure theory studies the relationship between descriptional/computational complexity and countable mathematical structures, using computability-theoretic tools such as Turing degrees, the arithmetic and the hyperarithmetic hierarchies, etc. Recently, an increasing amount of work in computable structure theory has focused on structures that can be represented by primitive recursive algorithms. Independently, this line of research has the potential to inform ongoing debates in the philosophy of mathematics, particularly in mathematical structuralism. Some of these studies are conducted as part of the ongoing NCN OPUS grant "Computable Structure Theory, and Philosophy of Mathematical Structuralism" (no. 2023/49/B/HS1/03930), held by dr Dariusz Kalociński.
Multilabel classification is an extension of standard classification problem in which many target variables (called labels) are predicted simultaneously. Of a particular interest is construction of effective methods for high-dimensional data when high-dimensionality refers to large number of potential features as well as to dimensionality of the response. The aim of the research is to develop algorithms (as well as to analyse their performance theoretically) for feature selection and prediction in this set-up. They rely among others on regularization methods adapted for such models.
Traditional supervised learning methods are based on the classic assumption that training and test data are drawn from the same probability distribution. However, in many real applications the above assumption is not met, i.e. the model is trained on data from a certain distribution (source distribution), and then applied to new data that may come from another distribution (target distribution). For example, in medical applications, a model predicting the occurrence of a disease can be trained on data about patients from one country, and then we use it to predict the disease for patients from another country. In object recognition in images, the training set may contain objects captured in a different scenery than the objects in the test data. The aim of the research is to develop methods that will enable the detection of the above situations and to build models that work effectively in the case of various types of distribution shift.