Context-Aware PDF Clustering Using Social Spider Algorithm and Agglomerative Hierarchical Techniques
Abstract
Purpose – This study primarily aims to develop an automated PDF document clustering system using a hybrid approach that combines Social Spider Optimization (SSO) with Agglomerative Clustering. It seeks to improve the efficiency of unsupervised document classification while addressing limitations commonly associated with traditional techniques when processing large-scale data sets.
Method – In this study, the researchers adopted a metaheuristic optimization based on SSO to refine the initialization and parameter optimization process for Agglomerative Clustering. Fifteen unclassified PDF documents were preprocessed using text extraction and TF-IDF vectorization. The hybrid model’s performance was evaluated using Silhouette Coefficient, Dunn Index, and Cluster Purity through cluster distribution diagrams and Principal Component Analysis (PCA) aided by visualization analysis.
Results – Using the hybrid SSO–Agglomerative Clustering model, fifteen unclassified PDF documents were efficiently partitioned into thematically coherent clusters. The system’s performance was evaluated using internal cluster validity metrics to evaluate the performance of the model compared to the Agglomerative method. A notable improvement is observed in Silhouette Coefficient with values from 0.63 to 0.82. The Dunn Index likewise demonstrated a substantial gain from 0.49 to 0.68. The increase in Cluster Purity from 84.6% to 94.3% indicates a more accurate clustering of documents. Although, computation time increased from 9.4 to 11.9 seconds, the performance were meaningfully consistent and visually distinct, representing each document category.
Conclusion – Applying SSO with Agglomerative Clustering demonstrate a strong potential for automating PDF categorization by combining metaheuristic optimization with hierarchical clustering to handle unstructured textual information.
Recommendations - Future research should investigate the model’s scalability and be validated using bigger datasets.
Research Implications – This study provides a valuable contribution to data mining and machine learning through a novel framework for document clustering. This technique supports practical benefits in information retrieval and aiding decision process, particularly for organizations managing large document repositories.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.





