Medical Text Classification using Large Language Models
Extracting Diagnostic Insights from Pathology Reports
In partner with
National AI Campus
Project Liaisons:
Yimeng He & Xiuzhen Huang
Project Advisor:
Dr. Yuqing Zhu
Description:
This project focuses on leveraging natural language processing (NLP) techniques to extract critical information from pathology reports.
Participants will gain hands-on experience with a wide range of text classification methods, from traditional TF-IDF analysis to cutting-edge large language models.
The Cancer Genome Atlas (TCGA) pathology report corpus used in this project offers a unique opportunity for the development of advanced NLP technologies that can ultimately enhance patient diagnosis, treatment selection, and many other aspects of cancer care.
National AI Campus Link Website
National AI Campus Link Website
Objectives:
Fall 2025: Built a Baseline of Understanding
- Learn how modern AI Large Language Models are built from scratch.
- Worked on real academic data, training the LLM to sort and classify specific research topics
- Implemented and tested multiple experiments in Jupyter notebooks, taking note of what improved accuracy and what did not
- Explored how machines tend to "understand" technical language through Natural Language Processing (NLP)
What We left Fall With:
A working classification model and a clear picture of what data could do and could not do on its own.
Spring 2026: Improving Model with RAG
- In Spring we applied what we learned to a harder problem. We wanted AI to answer questions using real evidence, and not just memorized knowledge
- We built a Retrieval-Augmented Generation (RAG) system
- Improvements on how the AI finds the right evidence by using a Hierarchical Retrieval technique - documents are split into small searchable chunks, the full relevant section is reassembled before being passed to the AI for a more complete answer
- Continuous testing against a biomedical question dataset to measure whether better retrieval led to better answers (YES)
What We Left Spring With:
A more trustworthy AI that can show why it gave an answer, not just what the answer was.
Project Layout:
| RAG Hierarchical / Parent-Document Retrieval | Rocio Hernandez, Yvan Kemsseu Yobeu, Georgina Mateo |
| RAG Neural reranking | Kenia Sanchez-Macario, Haonan Ma, Steven Magana |
| RAG Hybrid Retrieval | Christopher Gonzales, Sean Santos, Alan Mai |
| RAG Hybrid retrieval | Laura Rodriguez Zea |
| RAG Method Comparison | Joseph Howerton |
| Weekly Team/Advisor Meetings | Fridays | 9:45AM - 11:00AM |
| Biweekly Liaison Meetings | Fridays | 9:00AM - 9:45AM |
Student Team
- Christopher Gonzales
- Rocio Hernandez
- Joseph Howerton
- Yvan Kemsseu Yobeu
- Haonan Ma
- Steven Magana
- Alan Mai
- Georgina Mateo
- Laura Rodriguez Zea
- Kenia Sanchez-Macario
- Sean Santos
Resources
- National AI Campus - Presentation - Fall 25'
- National AI Campus - Presentation - Spring 26'
- National AI Campus - Software Design Document (SDD) - Fall 25'
- National AI Campus - Software Design Document (SDD) - Spring 26'
- National AI Campus - Software Requirements Specification (SRS) - Fall 25'
- Project Poster
- Project Report