Polymer materials are fundamental to countless applications across industries, from automotive components to medical devices. The ability to predict polymer properties from molecular structure alone represents a significant advancement in materials science, enabling rapid screening of new polymer designs without costly experimental synthesis and testing.
Business Problem
Traditional polymer development relies heavily on experimental trial-and-error approaches, which are time-consuming, expensive, and resource-intensive. This project addresses the critical need for computational tools that can predict key polymer properties directly from molecular structure, accelerating the materials discovery process and reducing development costs.
Project Goals
Develop a robust multi-output neural network for simultaneous prediction of five key polymer properties
Handle extreme data sparsity (>90% missing values) using advanced imputation techniques
Extract meaningful molecular features from SMILES representations using cheminformatics
Optimize model performance through systematic hyperparameter tuning
Create an interpretable pipeline for polymer property prediction
Target Properties
The model predicts five critical polymer properties that determine material performance:
Tg (Glass Transition Temperature): Critical temperature affecting flexibility and processing conditions
FFV (Fractional Free Volume): Measure of polymer porosity, crucial for membrane and barrier applications
Tc (Critical Temperature): Thermodynamic property for phase transitions, important for processing stability
Density: Mass per unit volume, affecting mechanical properties and applications
Rg (Radius of Gyration): Measure of polymer chain size, related to molecular mobility
Figure 1: Typical molecular structure representation (SMILES) of a polymer unit used for property prediction.
Dataset Characteristics
The dataset contains 7,973 polymer samples with SMILES representations and property measurements. Key challenges include:
Extreme Sparsity: Most properties have >90% missing values
This project implements a comprehensive machine learning pipeline combining cheminformatics, advanced imputation, and deep learning to predict polymer properties from molecular structure.
Data Preprocessing & Feature Engineering
SMILES Validation: Verified molecular structure validity using RDKit
Molecular Descriptor Calculation: Extracted 200+ chemical descriptors including molecular weight, logP, topological indices, and electronic properties
Feature Selection: Removed constant and highly correlated features to reduce dimensionality
Data Normalization: Applied StandardScaler to handle different property scales
Missing Data Analysis: Comprehensive assessment of sparsity patterns across properties
Advanced Imputation Strategy
Addressed the extreme sparsity challenge using materials-specific techniques:
MatImputer Algorithm: Specialized imputation method designed for materials science data
Physical Relationship Preservation: Maintains correlations between related polymer properties
Iterative Refinement: Multiple imputation rounds to improve accuracy
Validation Strategy: Cross-validation to assess imputation quality
Multi-Output Neural Network Architecture
Developed a shared architecture for simultaneous property prediction:
Shared Feature Extraction: Common layers to learn molecular representations
Property-Specific Heads: Dedicated output layers for each target property
Regularization Techniques: Dropout and batch normalization to prevent overfitting
Custom Loss Function: Weighted loss to handle property importance differences
Figure 2: Loss and Mean Absolute Error (MAE) convergence over 25 epochs, showing stable training for both training and validation sets.
Hyperparameter Optimization
Systematic exploration of model architecture space:
Keras Tuner Integration: Bayesian optimization for efficient search
Figure 3: Multi-output regression results comparing actual vs. predicted values for Density, Tg, FFV, Tc, and Rg. High R² scores across all properties indicate robust model performance.
Validation and Reliability
Comprehensive validation confirmed model reliability:
Cross-Validation: Consistent performance across different data splits
Physical Constraints: Predictions respect known physical limits
Uncertainty Quantification: Model provides confidence estimates for predictions
Environmental Impact: Assess sustainability and recyclability
Jupyter Notebook
The complete analysis, including code, visualizations, and detailed findings, is available in the embedded Jupyter notebook below. The notebook demonstrates the full machine learning pipeline from data preprocessing through model deployment.