Context-Aware PDF Clustering Using Social Spider Algorithm and Agglomerative Hierarchical Techniques

Wellanie M. Molino; Francis A. Alfaro; Jonathan M. Caballero; Francis L. Dela Cruz; Jan Eilbert L. Lee; Jasmin D. Niguidula; Fernando L. Renegado; Ariel C. Tomagan; Darwin C. Vargas

Wellanie M. Molino Computer Studies Department, Technological University of the Philippines
Francis A. Alfaro Computer Studies Department, Technological University of the Philippines
Jonathan M. Caballero Computer Studies Department, Technological University of the Philippines
Francis L. Dela Cruz Computer Studies Department, Technological University of the Philippines
Jan Eilbert L. Lee Computer Studies Department, Technological University of the Philippines
Jasmin D. Niguidula Computer Studies Department, Technological University of the Philippines
Fernando L. Renegado Computer Studies Department, Technological University of the Philippines
Ariel C. Tomagan Computer Studies Department, Technological University of the Philippines
Darwin C. Vargas Computer Studies Department, Technological University of the Philippines

Abstract

Purpose – This study primarily aims to develop an automated PDF document clustering system using a hybrid approach that combines Social Spider Optimization (SSO) with Agglomerative Clustering. It seeks to improve the efficiency of unsupervised document classification while addressing limitations commonly associated with traditional techniques when processing large-scale data sets.

Method – In this study, the researchers adopted a metaheuristic optimization based on SSO to refine the initialization and parameter optimization process for Agglomerative Clustering. Fifteen unclassified PDF documents were preprocessed using text extraction and TF-IDF vectorization. The hybrid model’s performance was evaluated using Silhouette Coefficient, Dunn Index, and Cluster Purity through cluster distribution diagrams and Principal Component Analysis (PCA) aided by visualization analysis.

Results – Using the hybrid SSO–Agglomerative Clustering model, fifteen unclassified PDF documents were efficiently partitioned into thematically coherent clusters. The system’s performance was evaluated using internal cluster validity metrics to evaluate the performance of the model compared to the Agglomerative method. A notable improvement is observed in Silhouette Coefficient with values from 0.63 to 0.82. The Dunn Index likewise demonstrated a substantial gain from 0.49 to 0.68. The increase in Cluster Purity from 84.6% to 94.3% indicates a more accurate clustering of documents. Although, computation time increased from 9.4 to 11.9 seconds, the performance were meaningfully consistent and visually distinct, representing each document category.

Conclusion – Applying SSO with Agglomerative Clustering demonstrate a strong potential for automating PDF categorization by combining metaheuristic optimization with hierarchical clustering to handle unstructured textual information.

Recommendations - Future research should investigate the model’s scalability and be validated using bigger datasets.

Research Implications – This study provides a valuable contribution to data mining and machine learning through a novel framework for document clustering. This technique supports practical benefits in information retrieval and aiding decision process, particularly for organizations managing large document repositories.

Author Biographies

Wellanie M. Molino, Computer Studies Department, Technological University of the Philippines

Dr. Wellanie M. Molino is an Associate Professor III at the Computer Studies Department of the Technological University of the Philippines – Manila. She holds a Doctor of Information Technology degree and has established a research portfolio in the areas of local governance, data analytics, e-governance, and emerging technologies in education. Dr. Molino has co-authored several scholarly publications focusing on virtual-reality classrooms, NFC-based mobile solutions, citizen e-participation, and predictive analytics. Experienced in teaching mobile and web application to learners of varying abilities.

Francis A. Alfaro, Computer Studies Department, Technological University of the Philippines

Dr. Francis A. Alfaro received his Doctor of Education degree at the Technological University of the Philippines. He is a Faculty of the University Research and Development Services of TUP. A passionate faculty with excellent presentation, research and communication skills. His research focused on education, mathematics and information technology especially in prototype development and database management systems. He is also in-charge in the electronic data processing (SPSS, Turnitin plagiarism checker) of all the research manuscripts of the University.

Jonathan M. Caballero, Computer Studies Department, Technological University of the Philippines

Dr. Jonathan M. Caballero is a faculty member at the Technological University of the Philippines – Manila, specializing in computing and management. He holds a Bachelor of Science in Computer Science and a Master of Science in Computer Science, and has earned a Doctor of Technology degree, demonstrating a strong foundation in both theoretical and applied aspects of the discipline. With his academic background and leadership role, he actively contributes to curriculum development, research initiatives, and instructional delivery. His expertise spans computing technologies and academic management, positioning him as a key figure in advancing the university’s IT education programs.

Francis L. Dela Cruz, Computer Studies Department, Technological University of the Philippines

Prof. Francis L. Dela Cruz is an Assistant Professor III in the Computer Studies Department at the Technological University of the Philippines – Manila. He earned his Bachelor of Science in Computer Science and a Master in Information Technology, and is currently pursuing a PhD in Technology Management (ongoing) in TUP Manila. His permanent faculty status reflects his stable commitment to the university’s academic mission. As a mid-career academic, he contributes to both teaching and curriculum delivery in IT-related courses. His educational background positions him well to support both undergraduate and graduate-level students in the Computer Studies program.

Jan Eilbert L. Lee, Computer Studies Department, Technological University of the Philippines

Prof. Jan Eilbert L. Lee is a faculty member in the Computer Studies Department at the Technological University of the Philippines – Manila. He holds a Bachelor of Science in Mathematics and a Master of Science in Mathematics from the Eulogio “Amang” Rodriguez Institute of Science and Technology (EARIST), reflecting a strong foundation in analytical and computational disciplines. At TUP, he teaches advanced subjects such as Artificial Intelligence and Data Analytics, where he integrates mathematical theory with practical computing applications. His expertise supports the department’s goal of producing graduates skilled in intelligent systems and data-driven decision-making. Through his academic background and teaching focus, Mr. Lee contributes significantly to fostering innovation and critical thinking among IT students.

Jasmin D. Niguidula, Computer Studies Department, Technological University of the Philippines

Dr. Jasmin D. Niguidula is a faculty member in the Computer Studies Department at the Technological University of the Philippines – Manila, with expertise in computing and academic management. She holds a Bachelor of Science in Computer Science, a Master in Information Technology, and a Doctor of Technology degree, reflecting her strong academic and professional foundation. Dr. Niguidula actively contributes to the enhancement of curriculum and instruction in the field of information technology, while also engaging in research and institutional initiatives. Her work integrates both technical knowledge and leadership, making her a valuable resource in shaping future-ready IT professionals. With her extensive experience and dedication to educational advancement, she plays a pivotal role in fostering academic excellence and innovation at TUP Manila.

Fernando L. Renegado, Computer Studies Department, Technological University of the Philippines

Associate Professor III Fernando L. Renegado is a distinguished faculty member in the Computer Studies Department at the Technological University of the Philippines–Manila, specializing in computing, administration, and quality management. He earned his Bachelor of Science in Computer Science and a Master in Engineering Management, and has undertaken doctoral coursework in Technology, reflecting his dedication to higher learning and academic excellence. As a Professional Board Examination Test (PBET) lecturer, he contributes significantly to professional certification preparation in computing and engineering fields. Known for his strong command of data structures, he is regarded as an expert in teaching complex programming concepts with clarity and depth. Through his leadership in curriculum, quality assurance, and scholarly contributions, he plays a vital role in advancing both academic rigor and educational innovation within TUP-Manila's Computer Studies program.

Ariel C. Tomagan, Computer Studies Department, Technological University of the Philippines

Prof Ariel C. Tomagan is a faculty member in the Computer Studies Department at the Technological University of the Philippines – Manila. He holds a Bachelor of Science in Computer Science and a Master of Science in Information Science, and is currently pursuing a Doctor of Technology degree, with 18 academic units completed including 12 through online learning. As a permanent faculty member, he contributes to the department through both classroom instruction and academic advising. His teaching focuses on core subjects such as Networking 1 and 2 and Programming Languages, reflecting his expertise in foundational and applied computing. With a commitment to continuous learning and technical excellence, Mr. Tomagan plays a vital role in shaping the competencies of future IT professionals at TUP.

Darwin C. Vargas, Computer Studies Department, Technological University of the Philippines

Prof. Darwin C. Vargas is a faculty member in the Computer Studies Department at the Technological University of the Philippines – Manila. He holds a Master’s degree in Information Technology and has a strong interest in programming, hardware embedding, and software–hardware integration. Vargas co-authored a landmark April 2024 paper on geographic trend analysis for predicting crop-yield success rates in the Philippines. He also contributed to research on developing virtual-reality classroom environments as alternative learning platforms during the pandemic. Within campus life, he plays an active role as a faculty adviser to TUP’s robotics group, Graybots, working to mentor and inspire student innovators.

Context-Aware PDF Clustering Using Social Spider Algorithm and Agglomerative Hierarchical Techniques

Abstract

Author Biographies

Other Journal by STEP Academic