Discovery OCR & Party Information Extraction Tool
Capstone Project: Discovery OCR & Party
Information Extraction Tool
1. Project Overview
The Santa Barbara
County Public Defender’s Office (SBC PDO) is sponsoring a capstone project to
build a cloud-based pipeline that converts discovery stored in Box.com into
structured, verified party data for eDefender. On retrieval, the system will
capture and retain the Bates stamp number(s) that indicate where each party
appears in the discovery. The extraction explicitly covers all party
types—victims, witnesses, involved parties, and law enforcement officers and
includes available contact information (address, phone, email). Staff will
verify extracted data prior to submission to eDefender. Project Lead: AJ
Voisan. Project Sponsor: Deepak Budwani. SMEs: Shawna Mateer (LOP/Discovery
Intake) and Angie Stokke (eDefender/CM).
1A. Discovery in California Criminal Law
(Context)
In California,
discovery is the pretrial exchange of evidence and information between the
prosecution and defense governed by Penal Code §§1054–1054.7. Typical discovery
includes police and supplemental reports, witness statements, criminal
histories, photographs, audio/video (e.g., BWC), forensic/lab reports, and
other ESI. Thorough, timely discovery is essential for due process, case
assessment, negotiations, and trial preparation.
Operationally, SBC
PDO centralizes digital discovery in Box.com and links working artifacts to
eDefender. Entering all parties (including law enforcement officers) in
eDefender supports conflict checks, targeted communications, subpoenas, and
cross-case analysis (e.g., officer credibility patterns). Accurate party data
makes conflict screening reliable and prevents delays or ethical issues later
in the case lifecycle.
2. Current Process (Initial &
Supplemental Discovery)
• Channels:
e-Disclosure portal, Box.com handoff from DA, email attachments, and physical
media.
• Intake: Discovery receipt emails are categorized; files are saved to case
folders in Box.com using naming conventions; police reports trigger party
review and entry in eDefender.
• Manual effort: Staff read reports, identify parties (incl. officers), note
references, and hand-enter contact details; conflict check is initiated after
entries are saved.
• Pain points: Non-searchable PDFs, retyping data, risk of missed names, and
multi-hour reviews for large packets.
3. Digital Evidence & Discovery Context
Volume and
heterogeneity of ESI continue to increase (BWC, CCTV, phone dumps, PDFs). A
structured, automated approach bridges storage (Box.com) and case management
(eDefender), preserving chain of custody while accelerating case readiness.
4. Project Rationale
Automating extraction
and review reduces manual data entry, improves accuracy, supports reliable
conflict checks, and delivers faster access to contactable parties for
attorneys and investigators. Capturing Bates stamps enables precise source
tracing.
5. Project Components
5.1 Component: Document Processing & OCR
Scope: Retrieve
discovery from Box.com (case folders) via API or scheduled jobs. Normalize file
types (PDF, images) and run OCR to generate machine-readable text. Capture
document metadata (filename, PD#, Disc#, received date, source agency) and
page-to-Bates mapping when available.
Inputs: Discovery
PDFs/images, Box folder paths/IDs, Disc numbers, Bates-stamped pages.
Outputs: Searchable
text per document/page, metadata record, Bates index.
5.2 Component: Information Extraction using
NLP
Scope: Apply NER and
pattern-based extraction to identify parties (victims, witnesses, involved
parties, law enforcement officers), roles, and contact data (address, phone,
email). Associate each extracted entity with page/Bates references. Approach:
fine-tune a lightweight custom model on local samples and/or evaluate legal
extraction APIs.
Inputs: OCR text,
Bates index, model configuration, domain dictionaries (agencies, rank titles).
Outputs: Candidate
entity list with fields (name, role, contact), confidence scores, Bates
references.
5.3 Component: Data Structuring
Scope: Transform
extracted candidates into a standardized schema (JSON/CSV/table). Perform
normalization (name splitting, address parsing), dedupe within-batch, and map
fields to eDefender equivalents (party type, subtype, contact fields).
Inputs: Candidate
entity list, schema map, normalization rules.
Outputs: Structured
table with unique party rows, denormalized contact info, Bates array, source
doc IDs.
5.4 Component: Human Review Interface (Staff
Review UI / Dashboard)
Scope: Provide
side-by-side source snippet and extracted fields; allow approve/edit/reject;
add missing parties; merge duplicates; flag uncertain fields. Include filters
by case, Disc#, confidence, and role. Show Bates/page references inline.
Inputs: Structured
table, snippets (char offsets), user identity/roles.
Outputs: Approved
records with audit of reviewer actions and timestamps.
5.5 Component: Data Validation
Scope: Automated
checks for completeness (required fields), format (phone/email), duplicate
detection (fuzzy match to existing eDefender persons, DOB when available), and
role sanity (officer vs. civilian). Validation gates block export until passed.
Inputs: Approved
records, validation rules, access to eDefender person index (read-only).
Outputs: Validated
records, issue list for remediation.
5.6 Component: Data Export & eDefender
Integration
Scope: Export
validated data in API payloads or batch files (CSV/XML) for import. Maintain
mapping for party types/subtypes and contact fields. If API work exceeds Capstone timeline, package batch exports and a Phase 2 API plan.
Inputs: Validated
records, field mappings, eDefender API specs and/or import templates.
Outputs: Submitted
API transactions or import files; submission receipts/logs.
5.7 Component: Logging & Audit Trail
Scope: Record
document status, processing events, exceptions, reviewer actions, validation
results, and export outcomes. Expose logs in a dashboard for monitoring and
compliance review; retain hashes/IDs for traceability to Box source.
Inputs: Event hooks
from all components, user actions, error handlers.
Outputs: Centralized
audit log, metrics for throughput, accuracy, and turnaround.
6. Key Technologies
The following
technologies can be utilized to build and manage the above components: (research
and exploration of OCR engines and NLP/NER stacks is needed to find what is
most suitable for the existing platforms)
• Box.com
(storage/API/Box AI)
• OCR engine (e.g., Tesseract/Azure/AWS/Box AI)
• NLP/NER stack (spaCy/transformers or legal APIs)
• Lightweight database (staging) and JSON/CSV exports
• Web UI framework for Staff Review dashboard
• eDefender API or batch import templates
• Centralized logging (e.g., ELK/Cloud logs)
7. Expected Deliverables
1) OCR pipeline and
metadata capture
2) NLP extraction with Bates references (incl. officers)
3) Standardized data schema and staging table
4) Staff Review UI with side-by-side snippets
5) Validation rules & dedupe routines
6) Export module; Phase 2 API plan if deferred
7) Logging & audit dashboard
8) Documentation and demo
| Role | Name | |
|---|---|---|
| Faculty Advisor | Jungsoo Lim | jlim34@calstatela.edu |
| Project Lead | Jennifer Lias | jlias2@calstatela.edu |
| Customer liaison/requirements lead | Nadia Hernandez | nherna170@calstatela.edu |
| Architecture/design lead | Joseph Lam | jlam87@calstatela.edu |
| UI Lead | Lemeng Zhao | lzhao25@calstatela.edu |
| Backend Lead | Addison Zhou | azhou19@calstatela.edu |
| QA/QC lead | Jesus Villa | jvilla24@calstatela.edu |
| Documentation Lead | Daniel Concepcion | dconcep@calstatela.edu |
| Demo Lead | Thomas Ogden | togden3@calstatela.edu |
| Presentation Lead | Tommy Works | tworks@calstatela.ed |
| Support Lead | Jose Holguin | jholgu21@calstatela.edu |
| Co-Lead | Peter Uy | puy@calstatela.edu |
| Teams | Members |
| Staff Review UI | Thomas, Jen, Lemeng |
| Data Structure and Validation | Peter, Joseph |
| Document OCR and Extraction | Nadia, Daniel, Tommy |
| Logging and Audit | Jesus, Addison, Jose |
| Meetings | Date | Time |
| Weekly advisor group meeting | Friday | 8 AM - 9:00 AM |
| Bi-Weekly Liaison Meeting | Friday | 9 AM - 10:00 AM |
| Weekly team meeting | Friday | 10 AM - 11:00 AM |
- Daniel Concepcion
- Nadia Hernandez
- Jose Holguin
- Joseph Lam
- Jennifer Lias
- Thomas Ogden Jr
- Peter Uy
- Jesus Villa
- Tommy Works
- Lemeng Zhao
- Addison Zhou