Freiburg, 13/03/2025
Big Data, high performance computers, AI are topics that have received a lot of attention lately. But how do scientists address problems when the available amounts of data are small? Since October 2023, researchers in the Collaborative Research Centre (CRC) “Small Data” have been studying and developing methods for such small data applications. Currently, thirty-six doctoral candidates are involved in this endeavour. We sat down with Maren Hackenberg, Lennart Purucker and Esma Secen to chat about their PhD projects, their experiences with interdisciplinary collaborations and the impact they hope the CRC will have.
Esma Secen (left), Maren Hackenberg (2nd from left) and Lennart Purucker (right) in conversation with Verena Krall (center). Foto: Tobias Kupries-Thomma / University of Freiburg
Esma Secen: My project bridges the fields of biology and medicine, focusses on unravelling the genetic basis and molecular mechanisms of a rare neurodevelopmental disease, which is a mental disability. I aim to characterise previously unassociated gene mutations leading to this disorder. These genetic determinants have hardly been studied before. Currently, my research involves wet lab experiments with neuronal cells, where I employ CRISPR/cas9 technology to precisely edit genes of interest and investigate their functional roles. Later, I am also planning to work with model organisms such as zebrafish to study the broader developmental and systemic effects of these genetic mutations.
Maren Hackenberg: My background is in mathematics and I am working on modelling dynamic processes. The applications that I consider differ a lot: I collaborate with researchers from biological and clinical fields. But they all have one thing in common: the available data is limited. With a combination of tools, for example from mathematical modelling, statistical estimation and Machine Learning, I am developing methods to capture these dynamic processes. These methods are not only useful for the specific application, but also transferable to other disciplines.
Lennart Purucker: My PhD project is about fundamental research on tabular data. I want to apply Artificial Intelligence or, more specifically, Machine Learning methods to perform predictive tasks based on data from tabular information. For example, given a dataset on the clinical effectiveness of a drug for patients with varying ages, symptoms, and medical history, I want to predict how a new patient would react to this drug.
“To me, something beautiful about the CRC is that it creates mutual understanding between the disciplines involved. This is something I would like to see in more areas of academia: A community striving towards a common goal, sharing their knowledge and insights and not worrying so much about competitions.”
Maren Hackenberg: No, this term needs to be relativised: The data is small relative to the amount of input that your model expects, or with respect to the level of noise or heterogeneity in your data. If the outcomes of one experiment are all very similar, then studying 50 cases may lead to the same results as you would have studying 1000. However, if the 50 cases fall into 15 different subgroups, the problem at hand becomes very complex and your dataset is small.
Esma Secen: There are a multitude of reasons for it. A research group might not have enough time, money or personnel to sample a large dataset. There are also cases where the available data is naturally very limited. For example, the disease I study is already rare, and I focus on only a subset of mutations that may cause it. Furthermore, the effects of a genetic mutation depend on the concrete circumstances: Under which climate do the patients live, what do they eat and drink? Thus, in principle, we would need data from all over the world to make general predictions on this disease and cover its entire complexity.
“Being exposed to perspectives from completely different fields is very inspiring for me. I rethink my approach to scientific problems and get new ideas when talking to someone with another academic background”.
Lennart Purucker: I develop fundamental methods, which should ideally work for use cases of many different disciplines. I test my methods on various different tables to see whether they work as they should. However, true, unfiltered feedback only reaches me through collaborations: Is my method useful to address the research questions of a given discipline? Do the predictions of my Machine Learning model also make sense for an expert of the specific field? With this input, I can then realign my work to the existing practical issues, which largely increases its relevance. Dealing with concrete applications of my research also gives me a sense of meaning by seeing how my work really improves the world in some way.
Maren Hackenberg: Being exposed to perspectives from completely different fields is very inspiring for me. I get to rethink my approach to scientific problems and come across new ideas when talking to someone with another academic background. Also, I hope that it prevents me from reinventing the wheel: Many problems have been solved before in other disciplines, just under a different name than I would expect. Engaging with experts from various fields helps me to build on existing solutions and focus on new challenges that have not yet been addressed.
Esma Secen: Explaining my own research to someone outside of my field helps me a lot. Often, I am no longer aware of the technical jargon I tend to use. When talking to others, I am forced to focus on the key messages of my work. This reminds me of my main motivation and provides the broader picture relevant to my research.
“What I would love to develop at some point is a machine learning model that produces not only a result for a given task but also understands its own uncertainties, the questions it can and cannot answer given the amount and quality of the input data”.
Lennart Purucker: One big issue when using Machine Learning for small data is overfitting: The AI model does not understand the given dataset correctly because it focusses too much on the wrong parts of the data and thus generates wrong output. What I would love to develop at some point is a machine learning model that produces not only a result for a given task but also understands its own uncertainties, the questions it can and cannot answer given the amount and quality of the input data.
Esma Secen: To me, what is beautiful about the CRC is that it creates mutual understanding between the disciplines involved. This is something I would like to see in more areas of academia: A community striving towards a common goal, sharing their knowledge and insights without worrying so much about competition.
Maren Hackenberg: Often, groups are underrepresented in policy decisions because they do not have the resources to sample large amounts of data regarding their concerns. In a best-case scenario, research on small data could enable them to make the most of the data they have and by this help democratise the usage of data.
Maren Hackenberg studied Mathematics and Classical Languages at the University of Freiburg and the University La Sapienza in Rome, Italy, and completed her Master’s degree in Mathematics at the University of Freiburg. Since 2020, she has been pursuing a PhD at the Institute of Medical Biometry and Statistics, where she works on methods for modeling dynamic processes in clinical and biomedical applications, using a combination of approaches from mathematical modelling, statistics, and deep learning. Since 2023, she is part of the Small Data CRC.
Lennart Purucker is a Ph.D. student at the University of Freiburg since 2023 as part of the Small Data Initiative (CRC 1597, Project C05). His research interest is in Artificial Intelligence, with a focus on Machine Learning for small data. Mr. Purucker’s primary focus is on tabular data (e.g., Excel sheets), but he also works on vision, text, and time series data.Esma Secen studied Molecular Biology and Genetics at University of Onsekiz Mart Canakkale, Turkey, and completed her Master’s degree in Molecular Medicine, majoring in Neurology, at Friedrich-Schiller-University Jena, Germany. Since 2023, she has been pursuing her PhD at the Small Data CRC, focusing on dissecting the molecular basis of monogenic neurodevelopmental disorders and focussing on dissecting the molecular basis of monogenic neurodevelopmental disorders and investigating the genetic mechanisms underlying intellectual disability in humans.
Artificial intelligence (AI) techniques typically require large data sets, also called “big data”. Biomedical data sets, on the other hand, often only comprise a relatively small number of observations. These “small data” applications may seem more manageable at first glance, but they make it much more difficult to use data-hungry Artificial Intelligence approaches. The Collaborative Research Centre 1597 “Small Data” is developing methods for using artificial intelligence techniques and modelling to discover complex patterns even in such relatively small data sets. This requires a highly interdisciplinary approach that combines expertise from computer science, mathematics, statistics, medicine and systems modelling – and establishes a shared language among researchers from the different disciplines. The German Research Foundation (DFG) is funding the CRC with over 11 million euros until June 2027. If continuation applications are successful, the new CRC could run for a total of twelve years. The spokesperson is Prof. Dr Harald Binder, Professor of Medical Biometry and Statistics at the Medical Faculty of the University of Freiburg and the Medical Centre.