Discovery OCR & Party Information Extraction Tool

Capstone Project: Discovery OCR & Party Information Extraction Tool

1. Project Overview

The Santa Barbara County Public Defender’s Office (SBC PDO) is sponsoring a capstone project to build a cloud-based pipeline that converts discovery stored in Box.com into structured, verified party data for eDefender. On retrieval, the system will capture and retain the Bates stamp number(s) that indicate where each party appears in the discovery. The extraction explicitly covers all party types—victims, witnesses, involved parties, and law enforcement officers and includes available contact information (address, phone, email). Staff will verify extracted data prior to submission to eDefender. Project Lead: AJ Voisan. Project Sponsor: Deepak Budwani. SMEs: Shawna Mateer (LOP/Discovery Intake) and Angie Stokke (eDefender/CM).

1A. Discovery in California Criminal Law (Context)

In California, discovery is the pretrial exchange of evidence and information between the prosecution and defense governed by Penal Code §§1054–1054.7. Typical discovery includes police and supplemental reports, witness statements, criminal histories, photographs, audio/video (e.g., BWC), forensic/lab reports, and other ESI. Thorough, timely discovery is essential for due process, case assessment, negotiations, and trial preparation.

Operationally, SBC PDO centralizes digital discovery in Box.com and links working artifacts to eDefender. Entering all parties (including law enforcement officers) in eDefender supports conflict checks, targeted communications, subpoenas, and cross-case analysis (e.g., officer credibility patterns). Accurate party data makes conflict screening reliable and prevents delays or ethical issues later in the case lifecycle.

2. Current Process (Initial & Supplemental Discovery)

• Channels: e-Disclosure portal, Box.com handoff from DA, email attachments, and physical media.
• Intake: Discovery receipt emails are categorized; files are saved to case folders in Box.com using naming conventions; police reports trigger party review and entry in eDefender.
• Manual effort: Staff read reports, identify parties (incl. officers), note references, and hand-enter contact details; conflict check is initiated after entries are saved.
• Pain points: Non-searchable PDFs, retyping data, risk of missed names, and multi-hour reviews for large packets.

3. Digital Evidence & Discovery Context

Volume and heterogeneity of ESI continue to increase (BWC, CCTV, phone dumps, PDFs). A structured, automated approach bridges storage (Box.com) and case management (eDefender), preserving chain of custody while accelerating case readiness.

4. Project Rationale

Automating extraction and review reduces manual data entry, improves accuracy, supports reliable conflict checks, and delivers faster access to contactable parties for attorneys and investigators. Capturing Bates stamps enables precise source tracing.

5. Project Components

5.1 Component: Document Processing & OCR

Scope: Retrieve discovery from Box.com (case folders) via API or scheduled jobs. Normalize file types (PDF, images) and run OCR to generate machine-readable text. Capture document metadata (filename, PD#, Disc#, received date, source agency) and page-to-Bates mapping when available.

Inputs: Discovery PDFs/images, Box folder paths/IDs, Disc numbers, Bates-stamped pages.

Outputs: Searchable text per document/page, metadata record, Bates index.

5.2 Component: Information Extraction using NLP

Scope: Apply NER and pattern-based extraction to identify parties (victims, witnesses, involved parties, law enforcement officers), roles, and contact data (address, phone, email). Associate each extracted entity with page/Bates references. Approach: fine-tune a lightweight custom model on local samples and/or evaluate legal extraction APIs.

Inputs: OCR text, Bates index, model configuration, domain dictionaries (agencies, rank titles).

Outputs: Candidate entity list with fields (name, role, contact), confidence scores, Bates references.

5.3 Component: Data Structuring

Scope: Transform extracted candidates into a standardized schema (JSON/CSV/table). Perform normalization (name splitting, address parsing), dedupe within-batch, and map fields to eDefender equivalents (party type, subtype, contact fields).

Inputs: Candidate entity list, schema map, normalization rules.

Outputs: Structured table with unique party rows, denormalized contact info, Bates array, source doc IDs.

5.4 Component: Human Review Interface (Staff Review UI / Dashboard)

Scope: Provide side-by-side source snippet and extracted fields; allow approve/edit/reject; add missing parties; merge duplicates; flag uncertain fields. Include filters by case, Disc#, confidence, and role. Show Bates/page references inline.

Inputs: Structured table, snippets (char offsets), user identity/roles.

Outputs: Approved records with audit of reviewer actions and timestamps.

5.5 Component: Data Validation

Scope: Automated checks for completeness (required fields), format (phone/email), duplicate detection (fuzzy match to existing eDefender persons, DOB when available), and role sanity (officer vs. civilian). Validation gates block export until passed.

Inputs: Approved records, validation rules, access to eDefender person index (read-only).

Outputs: Validated records, issue list for remediation.

5.6 Component: Data Export & eDefender Integration

Scope: Export validated data in API payloads or batch files (CSV/XML) for import. Maintain mapping for party types/subtypes and contact fields. If API work exceeds Capstone timeline, package batch exports and a Phase 2 API plan.

Inputs: Validated records, field mappings, eDefender API specs and/or import templates.

Outputs: Submitted API transactions or import files; submission receipts/logs.

5.7 Component: Logging & Audit Trail

Scope: Record document status, processing events, exceptions, reviewer actions, validation results, and export outcomes. Expose logs in a dashboard for monitoring and compliance review; retain hashes/IDs for traceability to Box source.

Inputs: Event hooks from all components, user actions, error handlers.

Outputs: Centralized audit log, metrics for throughput, accuracy, and turnaround.

6. Key Technologies

The following technologies can be utilized to build and manage the above components: (research and exploration of OCR engines and NLP/NER stacks is needed to find what is most suitable for the existing platforms)

• Box.com (storage/API/Box AI)
• OCR engine (e.g., Tesseract/Azure/AWS/Box AI)
• NLP/NER stack (spaCy/transformers or legal APIs)
• Lightweight database (staging) and JSON/CSV exports
• Web UI framework for Staff Review dashboard
• eDefender API or batch import templates
• Centralized logging (e.g., ELK/Cloud logs)

7. Expected Deliverables

1) OCR pipeline and metadata capture
2) NLP extraction with Bates references (incl. officers)
3) Standardized data schema and staging table
4) Staff Review UI with side-by-side snippets
5) Validation rules & dedupe routines
6) Export module; Phase 2 API plan if deferred
7) Logging & audit dashboard
8) Documentation and demo


RoleNamee-mail
Faculty AdvisorJungsoo Limjlim34@calstatela.edu
Project LeadJennifer Liasjlias2@calstatela.edu
Customer liaison/requirements leadNadia Hernandeznherna170@calstatela.edu
Architecture/design leadJoseph Lamjlam87@calstatela.edu
UI LeadLemeng Zhaolzhao25@calstatela.edu
Backend LeadAddison Zhouazhou19@calstatela.edu
QA/QC leadJesus Villajvilla24@calstatela.edu
Documentation LeadDaniel Concepciondconcep@calstatela.edu
Demo LeadThomas Ogdentogden3@calstatela.edu
Presentation LeadTommy Works

tworks@calstatela.ed

Support LeadJose Holguinjholgu21@calstatela.edu
Co-LeadPeter Uy

puy@calstatela.edu




TeamsMembers
Staff Review UIThomas, Jen, Lemeng
Data Structure and ValidationPeter, Joseph
Document OCR and ExtractionNadia, Daniel, Tommy
Logging and AuditJesus, Addison, Jose



MeetingsDateTime
Weekly advisor group meetingFriday8 AM - 9:00 AM
Bi-Weekly Liaison MeetingFriday9 AM - 10:00 AM
Weekly team meetingFriday10 AM - 11:00 AM

Student Team
  • Daniel Concepcion
  • Nadia Hernandez
  • Jose Holguin
  • Joseph Lam
  • Jennifer Lias
  • Thomas Ogden Jr
  • Peter Uy
  • Jesus Villa
  • Tommy Works
  • Lemeng Zhao
  • Addison Zhou