After the Boston Marathon bombing, it took the FBI close to a year to process thousands of hours of CCTV footage. Today, the same volume would take two days. Tomorrow, it takes minutes. The evidence was always there. The eyes were not.
A homicide case generates 300 hours of body-cam footage, 200 hours of surveillance video, 50 hours of 911 recordings, 12 phone extractions, and 400 pages of forensic reports. No investigative team can review all of it. They sample. They prioritize. They make judgment calls about which footage to watch and which to skip. And in the footage they skip — the hours of seemingly empty parking lot recordings, the minutes of ambient noise on body-cam audio, the pages of routine lab results — lies the evidence that could change the case. A face in a car window at 2:14 AM. A voice on a 911 call that matches a witness who later recanted. A license plate that appears in three different surveillance feeds from three different cities on the same day.
The FBI learned this lesson in 2013, when thousands of hours of Boston Marathon CCTV footage overwhelmed their processing capacity. It took close to a year to fully process the data. Out of that crisis, they developed the Multimedia Processing Framework — a computer vision system that could extract faces, license plates, objects, and track subjects across massive datasets. Today, that same volume would take approximately two days. The technology exists. The question is whether every agency has access to it.
Vault's AI Analysis engine brings FBI-grade analytical capability to every agency in the Vault ecosystem. Computer vision detects and classifies every face, vehicle, weapon, and object in every frame of every video. Speech-to-text transcribes every word in every audio track with speaker diarization and named entity recognition. Cross-modal linking connects a face detected in surveillance footage to a voice identified in a 911 call to a license plate captured by an LPR system — building an evidence graph that reveals connections no manual review could construct. And natural language search makes the entire corpus queryable in plain English: "show me every instance of a red sedan near 4th and Main between 9 PM and midnight." The machine does not replace the investigator. It ensures the investigator sees everything the evidence contains — not just what they had time to watch.
From object detection to evidence graph construction, every piece of evidence analyzed, indexed, linked, and searchable — with every AI annotation provenance-tagged for court.
The human eye processes video at the speed of attention — one scene at a time, one angle at a time, one narrative at a time. An investigator watching body-cam footage focuses on the suspect, the officer, the confrontation. They do not see the face in the second-floor window behind the suspect. They do not notice the partial license plate on the vehicle parked four cars back. They do not register that the backpack on the bench in the background of frame 14,847 matches the backpack described in a separate witness statement filed three days later. The Computer Vision engine sees all of it simultaneously. Models trained on millions of law enforcement images — body-cam footage with its unique fisheye distortion, low-light conditions, rapid camera movement, and extreme angles — process every frame at 30fps, detecting and classifying every object that crosses a confidence threshold calibrated to the evidence type. Faces are detected regardless of angle, occlusion, or lighting. Vehicles are classified by make, model, color, and body type — not just "car," but "2019 BMW X5, silver, SUV." Weapons are detected and classified by type: handgun, rifle, knife, blunt object. Clothing is described by color, pattern, and type. Scene elements — storefronts, street signs, landmarks — are identified and geolocated. Every detection is timestamped to the frame, confidence-scored, and tagged with bounding box coordinates. The result is a searchable metadata layer overlaid on the original evidence — every frame annotated with everything the machine can see, ready for the investigator to query rather than watch.
Audio evidence contains information that video cannot capture — what was said, by whom, in what tone, at what moment. A suspect's statement during a body-cam encounter that contradicts their later deposition. A 911 caller's description of a vehicle that matches surveillance footage from two miles away. A witness in an interview room who names an individual that appears in a phone extraction's contact list. Without transcription, this information is locked inside audio files that investigators must listen to in real time — hours of footage producing hours of listening. Vault's Transcription engine unlocks the audio layer entirely. Every audio track in the evidence corpus — body-cam microphones, dash-cam audio, 911 dispatch recordings, interview room recordings, phone call intercepts, voicemails from phone extractions — is transcribed using speech-to-text models optimized for law enforcement audio environments. These environments are uniquely challenging: simultaneous speakers, radio chatter bleeding into body-cam microphones, ambient noise from traffic and weather, distance from the microphone as subjects move, accents and dialectal variation, and code-switching between languages within the same encounter. Speaker diarization identifies and labels each unique speaker within a recording — separating the officer's voice from the suspect's voice from the bystander's voice — enabling investigators to search for what a specific speaker said without reading the entire transcript. Named entity recognition extracts people, places, organizations, dates, phone numbers, and addresses from the transcript, linking spoken references to other evidence items in the case. The transcription engine supports 100+ languages with automatic language detection, ensuring that multilingual encounters are transcribed without requiring the investigator to know what language is being spoken.
The most valuable evidence is rarely a single file. It is the connection between files that reveals what happened. A face detected at 2:14 AM in CCTV footage from a gas station is meaningless in isolation. But when the evidence graph connects that face to a voice on a 911 call made from the same gas station's payphone at 2:16 AM, and that 911 call mentions a vehicle matching a silver BMW X5 detected by LPR cameras at two intersections within a one-mile radius between 2:08 and 2:22 AM, and a phone extraction from a suspect's device shows GPS coordinates placing that phone at the gas station at 2:13 AM — the isolated detections become a narrative that places a specific person at a specific location at a specific time, corroborated across four independent evidence sources. No investigator manually reviewing each evidence source independently would have constructed this chain. The Cross-Modal Linking engine builds these connections automatically. After computer vision, transcription, and entity extraction have processed the evidence corpus, the linking engine searches for correspondences: temporal correlations (events occurring within configurable time windows across different evidence sources), spatial correlations (GPS coordinates, addresses, or landmarks appearing in multiple sources), entity correlations (the same name, phone number, plate number, or vehicle description appearing across different evidence types), and biometric correlations (voice prints matching across audio sources, face embeddings matching across video sources). Each discovered link is scored by confidence and presented as a node in the evidence graph — a visual, queryable representation of every connection the AI has found across the entire evidence corpus. Investigators explore the graph interactively, following connections from one evidence item to the next, with each link documented by the specific detections that produced it.
Traditional evidence search requires investigators to know where to look before they look. They must select the right camera, the right time window, the right case file, and then manually review the content within those parameters. If the evidence they need is in a different camera, a different time window, or a different file format entirely, they will not find it — not because it does not exist, but because they did not know to look in the right place. Vault's Natural Language Search eliminates this limitation by making the entire evidence corpus queryable in plain English. The investigator types a description of what they are looking for — not a file name, not a camera ID, not a timestamp, but a description of the content: "red sedan near 4th and Main between 9 PM and midnight," "person wearing a blue jacket carrying a bag," "any mention of the name Rodriguez in audio recordings," "all body-cam footage where an officer draws a weapon." The search engine translates this natural language query into a multi-modal search across every indexed evidence item: matching the visual description against computer vision detections (red sedan = vehicle detections classified as sedan, color: red), the location against geolocated evidence items and scene analysis (4th and Main = GPS coordinates or street sign detections matching the location), the time window against evidence metadata and transcript timestamps, and the spoken content against transcription indices. Results are returned in under 3 seconds, ranked by relevance, with each result linked to the specific timestamp and frame where the match was found. The investigator clicks a result and is taken directly to the moment in the evidence where the red sedan appears — no scrubbing, no scanning, no hours of manual review. CLIP-based semantic matching enables queries that go beyond literal keyword matching into conceptual search: "aggressive confrontation" returns body-cam segments where the AI's behavioral analysis detected raised voices, aggressive postures, and rapid movement — concepts that cannot be captured by keywords alone.
Tracking a person of interest across a network of cameras is one of the most time-consuming tasks in investigative work — and one of the most politically sensitive. Facial recognition, while effective, raises civil liberties concerns that have led to bans or restrictions in multiple jurisdictions. An investigator may need to track a suspect from a crime scene through a transit system, across a commercial district, and into a residential area — across dozens of cameras operated by different entities — without using technology that triggers regulatory prohibitions. Person Re-Identification (RE-ID) solves this by tracking individuals using non-biometric features. Instead of analyzing facial characteristics, RE-ID models analyze gait patterns (the unique way a person walks), body proportions (height, shoulder width, torso-to-leg ratio), clothing appearance (color, texture, pattern, layering), and accessories (bags, hats, umbrellas). These features are combined into a re-identification embedding — a mathematical representation of the person's appearance that can be matched across cameras without requiring a face to be visible. The investigator selects a person of interest in one camera feed and asks the system to find the same person across the entire camera network. The RE-ID engine compares the target's embedding against every person detected in every other camera feed, returning matches ranked by confidence with the camera location and timestamp. The technology was pioneered by Queen Mary University of London's Computer Vision Group and has been internationally recognized for its ability to track subjects across distributed camera networks without using any private data, facial imagery, or person-specific biometrics. For jurisdictions that have banned or restricted facial recognition but still need to track suspects across multi-camera environments, RE-ID provides a privacy-preserving alternative that achieves the same investigative objective without the regulatory and civil liberties concerns.
Surveillance footage is mostly nothing. A camera watching a parking lot for 24 hours captures 23 hours and 40 minutes of an empty parking lot and 20 minutes of activity that matters. An investigator reviewing that footage in real time spends an entire day watching nothing. Multiply this across 50 cameras covering a crime scene perimeter, and the review task is 50 person-days of mostly empty footage. Video Synopsis technology transforms this calculus. The engine analyzes every frame, identifies periods of activity (people moving, vehicles entering or exiting, objects appearing or disappearing), and compresses the footage by eliminating dead time — the hours of empty frames where nothing relevant occurs. The result is a synopsis video where only the moments of activity are preserved, presented in their original chronological context but with the empty intervals removed. A 24-hour recording from a single camera compresses to 20-40 minutes of event-dense content. Fifty cameras covering a crime scene for 24 hours compress from 1,200 hours of footage to approximately 30-50 hours of reviewable content — a reduction that transforms an impossible task into a manageable one. Beyond simple compression, the Synopsis engine clusters activity by type and location: all vehicle movements grouped together, all pedestrian activity grouped, all interactions between people highlighted. Key moments — sudden movements, confrontations, object transfers, entries and exits through specific doors or gates — are flagged with attention markers that guide the investigator to the most relevant segments first. The investigator does not watch 412 hours of footage. They watch 4.7 hours of everything that happened.
After the AI has processed 847 evidence items — detecting 184,291 objects, transcribing 2.1 million words, discovering 1,847 cross-modal links, and building a comprehensive evidence graph — the investigator needs a starting point. Not a wall of raw detections, but a structured summary that answers: What happened? When? Who was involved? What evidence supports each element of the narrative? And where should I focus my attention first? The Evidence Summary engine generates this starting point automatically. From the complete AI analysis, the engine produces a chronological event timeline reconstructing the sequence of events across all evidence sources, with each event linked to the specific evidence items that support it. Witness statement comparisons cross-reference transcribed statements against each other and against the physical evidence, flagging consistencies and contradictions. The evidence inventory report catalogs every item in the corpus with its AI analysis summary — what was detected, what was transcribed, what links were discovered. The investigative priority ranking identifies the evidence items most likely to be critical to the case: items with high cross-modal link density (appearing in connections across multiple evidence sources), items containing contradictions with other evidence, items flagged by the computer vision engine as containing weapons or violent interactions, and items where the AI confidence is low enough that human review is essential. This triage function is critical for large cases: instead of reviewing 847 items sequentially, the investigator starts with the 47 items the AI has identified as most likely to contain case-critical information — and works outward from there. Every generated summary is clearly marked as AI-generated, with links to the underlying evidence and detection data that produced each statement.
AI-generated analysis is not evidence. It is an analytical overlay on evidence — and the distinction is legally critical. When the computer vision engine detects a weapon in a body-cam frame, that detection is a machine's interpretation, not a fact established by the original recording. When the transcription engine converts audio to text, the transcript is a machine-generated approximation, not a verbatim record. When the cross-modal linking engine connects a face in CCTV to a voice in a 911 call, that connection is a probabilistic correlation, not a proven identity. If any of these AI outputs are presented in court without clear provenance documentation — without making explicit that they are machine-generated, what model produced them, what confidence threshold was applied, and what error rate the model exhibits — the defense will challenge them under Daubert, and the challenge may succeed. Vault's Provenance engine ensures that every AI output carries complete documentation of its origin. Every computer vision detection is tagged with the model name and version, the detection confidence score, the training data characteristics, and the known error rates for that object type in that environment. Every transcription is tagged with the speech-to-text model version, the estimated word error rate for the audio conditions present, and the language detection confidence. Every cross-modal link is tagged with the correlation method, the confidence threshold, and the specific detections that produced the link. All AI-generated content is visually and structurally distinguished from original evidence — annotations appear in a separate layer, transcripts are labeled as machine-generated, summaries carry explicit provenance headers. This separation ensures that no jury, no judge, and no opposing counsel can mistake an AI interpretation for an established fact. The original evidence remains pristine. The AI analysis enhances understanding. The provenance documentation ensures that the enhancement is transparent, auditable, and defensible.
Three investigations. Three evidence mountains conquered. Every connection the AI found was verified by humans and held in court.
Every face. Every voice. Every vehicle. Every connection. Every second of footage — searchable, linked, and understood.