Security ML | MSc Thesis | National College of Ireland

Detecting malicious web pages using ensemble models.

A machine learning research project focused on identifying fraudulent web pages using URL properties, JavaScript-generated features, and HTML keyword density signals.

95% precision
95% recall
94% F1-score

Problem

Static blacklist approaches struggle to cover recently infected web pages. The research explored whether multiple feature families could improve malicious page classification.

Approach

URL, JavaScript, and HTML-content features were analyzed independently, then modeled using SVM and Random Forest. A weighted-average ensemble combined the strongest signals.

System

The final architecture separated feature extraction, independent predictive models, and ensemble scoring so each signal family could be evaluated and improved.

Outcome

The best model identified fraudulent web URLs with 95% precision, 95% recall, and 94% F1-score, showing the value of combining multiple web-page signals.

Architecture diagram for malicious web page detection using ensemble models
Model architecture for URL, JavaScript, HTML, and weighted-average ensemble scoring.

Why it still matters

The project connects directly to modern AI platform thinking: feature quality, model boundaries, evaluation metrics, explainability, and the importance of combining signals instead of trusting one brittle classifier.

Open legacy project archive