Medical Text Classification using Large Language Models

Extracting Diagnostic Insights from Pathology Reports

In partner with
 National AI Campus 

Project Liaisons: 

Yimeng He & Xiuzhen Huang

Project Advisor:

Dr. Yuqing Zhu


Description: 

This project focuses on leveraging natural language processing (NLP) techniques to extract critical information from pathology reports.
Participants will gain hands-on experience with a wide range of text classification methods, from traditional TF-IDF analysis to cutting-edge large language models.
The Cancer Genome Atlas (TCGA) pathology report corpus used in this project offers a unique opportunity for the development of advanced NLP technologies that can ultimately enhance patient diagnosis, treatment selection, and many other aspects of cancer care.
National AI Campus Link Website

Objectives:

Fall 2025: Built a Baseline of Understanding 
- Learn how modern AI Large Language Models are built from scratch.
- Worked on real academic data, training the LLM to sort and classify specific research topics 
- Implemented and tested multiple experiments in Jupyter notebooks, taking note of what improved accuracy and what did not 
- Explored how machines tend to "understand" technical language through Natural Language Processing (NLP)

What We left Fall With:
A working classification model and a clear picture of what data could do and could not do on its own. 

Spring 2026: Improving Model with RAG 
- In Spring we applied what we learned to a harder problem. We wanted AI to answer questions using real evidence, and not just memorized knowledge 
- We built a Retrieval-Augmented Generation (RAG) system
- Improvements on how the AI finds the right evidence by using a Hierarchical Retrieval technique - documents are split into small searchable chunks, the full relevant section is reassembled before being passed to the AI for a more complete answer 
- Continuous testing against a biomedical question dataset to measure whether better retrieval led to better answers (YES)

What We Left Spring With:
A more trustworthy AI that can show why it gave an answer, not just what the answer was. 

Project Layout: 
RAG Hierarchical / Parent-Document Retrieval Rocio Hernandez, Yvan Kemsseu Yobeu, Georgina Mateo
RAG Neural reranking Kenia Sanchez-Macario, Haonan Ma, Steven Magana
RAG Hybrid RetrievalChristopher Gonzales, Sean Santos, Alan Mai
RAG Hybrid retrievalLaura Rodriguez Zea
RAG Method ComparisonJoseph Howerton
Meetings: 
Weekly Team/Advisor MeetingsFridays9:45AM - 11:00AM
Biweekly Liaison MeetingsFridays9:00AM - 9:45AM

Student Team
  • Christopher Gonzales
  • Rocio Hernandez
  • Joseph Howerton
  • Yvan Kemsseu Yobeu
  • Haonan Ma
  • Steven Magana
  • Alan Mai
  • Georgina Mateo
  • Laura Rodriguez Zea
  • Kenia Sanchez-Macario
  • Sean Santos
Project Sponsor
Project Liaisons
Faculty Advisors