IDADrugDesign: Intelligent data acquisition for drug design

A problem that occurs in machine learning methods for drug discovery is a need for standardized data. Methods and interest exist for producing new data but due to material and budget constraints it is desirable that each iteration of producing data is as efficient as possible.

We investigate Active Learning for models that use the margin in model decisiveness to measure the model uncertainty to guide data acquisition. We demonstrate that the models perform better with Active Learning than with random acquisition of data independent of machine learning model and starting knowledge.

We also study the multi-objective optimization problem of combinatorial library design. Here we present a framework that could process the output of gener- ative models for molecular design and give an optimized library design. The results show that the framework successfully optimizes a library based on molecule availability, for which the framework also attempts to identify using retrosynthesis prediction. We conclude that the next step in intelligent data acquisition is to combine the two methods and create a library design model that use the information of previous libraries to guide subsequent designs.

For further information contact Simon Johansson (


Members: Simon Johansson, Alexander Schliep. Collaborators: Ola Engkvist (AstraZeneca), Morteza Chehreghani (Chalmers).


Johansson et al.. Diverse Data Expansion with Semi-Supervised k-Determinantal Point Processes. In 2023 IEEE International Conference on Big Data (BigData), IEEE Computer Society, 5260–5265, Dec 2023.

Johansson et al.. de Novo Generated Combinatorial Library Design. Digital Discovery 2023, issue 1, 2024, 122–135. First published 27 Nov 2023.

Viet Johansson et al.. Using active learning to develop machine learning models for reaction yield prediction. Molecular Informatics 2022, 41, 2200043.

Johansson et al.. AI-assisted synthesis prediction. Drug Discovery Today: Technologies 2020.