AI Predicts Pedestrians’ Next Move

A.I
AI Predicts Pedestrians’ Next Move
A new multimodal AI called OmniPredict uses a GPT-4o–style large model to anticipate pedestrian actions in real time, outperforming traditional vision systems on standard benchmarks. Researchers say it could change how autonomous vehicles—and other machines—plan around humans, but the claim that the system is "reading minds" demands careful scrutiny.

On city streets the safest split-second decision is often the one you never have to make. This week researchers at Texas A&M and collaborators in Korea unveiled OmniPredict, an AI system that does more than spot a person in the road: it tries to infer what that person will do next. Described in a peer-reviewed article in Computers & Electrical Engineering, OmniPredict blends scene images, close-up views, bounding boxes, vehicle telemetry and simple behavioural cues to forecast a pedestrian’s likely action in real time.

A model that anticipates, not just detects

Traditional autonomous-vehicle stacks separate perception from planning: cameras and lidar detect objects, then downstream modules decide how to brake or steer. OmniPredict replaces that rigid pipeline with a multimodal large language model (MLLM) architecture that fuses visual and contextual inputs and produces a probabilistic prediction about human behaviour—whether someone will cross, pause in an occluded area, glance toward the vehicle, or perform another action. In laboratory tests the team reports a roughly 67% prediction accuracy on established pedestrian-behaviour benchmarks, a performance gain of about ten percentage points against recent state-of-the-art methods.

The researchers frame the advance as a shift from reactive automation toward anticipatory autonomy. "Cities are unpredictable. Pedestrians can be unpredictable," said the project lead, noting that a car which anticipates a likely step into the road can plan earlier and more smoothly, potentially reducing near-misses. The result is not a human mind-reading oracle but a statistical engine that converts visual cues—pose, head direction, occlusion, vehicle speed—into a short-term forecast of movement.

How OmniPredict reads the scene

At the technical core, OmniPredict uses an MLLM—the kind of architecture increasingly used for chat and image tasks—adapted to interpret video frames and structured contextual signals. Inputs include a wide-angle scene image, zoomed crops of individual pedestrians, bounding-box coordinates, and simple sensor data such as vehicle velocity. The model processes these multimodal streams together and maps them to four behaviour categories the team found useful for driving contexts: crossing, occlusion, actions and gaze.

Two properties matter. First, the MLLM’s cross-modal attention lets the model link a distant body orientation to a local gesture—someone turning their torso while looking down at a phone, for example—without bespoke hand-coded rules. Second, the system appears to generalise: the researchers ran OmniPredict on two challenging public datasets for pedestrian behaviour (JAAD and WiDEVIEW) without bespoke, dataset-specific training and still saw above‑state-of-the-art results. That generalisation is the headline claim, and it’s why the group describes OmniPredict as a "reasoning" layer sitting above raw perception.

Benchmarks, limits and the realism gap

Benchmarks tell one part of the story. The reported 67% accuracy and a 10% improvement over recent baselines are meaningful in academic comparisons, but they do not automatically translate into roadworthy safety. Benchmarks contain many repeated patterns and a narrower distribution of scenarios than live city driving; rare events, adversarial behaviour and unusual weather often swamp model assumptions when systems leave the lab.

Critics are quick to point out that the language "reading human minds" risks overstating the result. The model’s predictions derive from statistical associations learned from past data: similar visual contexts in the training set led to similar outcomes. That’s powerful, but it is not the same as access to human intent or internal mental states. In practice, pedestrians are influenced by local culture, street design and social signalling; an AI that doesn’t account for those layers can make confident but wrong predictions.

Safety, privacy and behavioural feedback

If a vehicle plans around what it expects you to do, human behaviour may change in response—a point sometimes called the behavioural feedback loop. People who know cars will anticipate them might take more risks, or conversely become more wary; either dynamic can change the statistical relationships the model depends on. That makes continuous in‑field validation essential.

The system’s reliance on visual and contextual cues also raises privacy and equity questions. Models trained on urban footage often inherit the biases and blind spots of their datasets: who was recorded, under which conditions, and with what cameras. Weaknesses in detection for certain skin tones, clothing types or body shapes could translate into different prediction quality across populations. Engineering teams must therefore prioritise dataset diversity, transparency about model failure modes, and procedures to audit and mitigate biased behaviour.

From multimodal LLMs to brain-inspired architectures

The parallel is conceptual rather than literal. Current AI does not replicate human consciousness or the mechanisms of real intention. But taking inspiration from neural organisation—how networks route information and form specialised modules—can help engineers design systems that better balance speed, robustness and adaptability on chaotic city streets.

What needs to happen before deployment

OmniPredict is a research prototype, not a finished autonomy stack. Before deployment in vehicles, it needs long-term field trials, rigorous safety validation under corner cases, and integration tests that show how behavioural predictions should influence motion planning. Regulators and manufacturers will also have to decide standards for acceptable false-positive and false-negative rates when a system predicts human actions—trade-offs that carry clear safety implications.

Finally, the project underscores a recurring truth of applied AI: accuracy on curated tests is necessary but not sufficient. Real-world systems must be auditable, fair and robust to distribution shifts; they must degrade gracefully when uncertain. The prospect of machines that "anticipate" human movement is attractive for safety and flow in urban transport, but it brings technical, ethical and legal questions that should be resolved before cars make irreversible decisions based on those predictions.

The work from Texas A&M and partners points to a near future in which perception, context and behavioural reasoning are inseparable components of autonomous systems. That future will be safer only if it combines the new predictive layer with conservative safety design, careful testing and clear rules for transparency and accountability.

Sources

  • Computers & Electrical Engineering (research paper on OmniPredict)
  • Texas A&M University College of Engineering
  • Korea Advanced Institute of Science and Technology (KAIST)
  • Nature Machine Intelligence (research on neuromorphic networks)
  • McGill University / The Neuro (Montreal Neurological Institute-Hospital)
Mattias Risberg

Mattias Risberg

Cologne-based science & technology reporter tracking semiconductors, space policy and data-driven investigations.

University of Cologne (Universität zu Köln) • Cologne, Germany

Readers

Readers Questions Answered

Q What is OmniPredict and what does it do?
A OmniPredict is a multimodal AI system that uses a large language model architecture to fuse visual inputs with contextual signals and forecast a pedestrian's likely next move in real time. It accepts wide-angle scene images, close-up crops of pedestrians, bounding-box coordinates, and vehicle telemetry, and outputs probabilistic predictions about actions such as crossing, pausing in occluded areas, or shifting gaze.
Q How does OmniPredict classify pedestrian behavior?
A OmniPredict maps its multimodal inputs to four behavior categories relevant for driving: crossing, occlusion, actions, and gaze. It uses cross-modal attention to link a distant body orientation with a local gesture, enabling predictions without hand-coded rules and allowing the model to infer short-term movement from the combination of pose, head direction, and context.
Q How well does it perform on benchmarks, and what are the caveats?
A In lab tests, OmniPredict achieved about 67% prediction accuracy on JAAD and WiDEVIEW benchmarks, roughly 10 percentage points higher than recent baselines. Yet benchmark performance does not automatically translate to road safety; these datasets have narrower scenario distributions, and real-world driving can present rare events and adversarial conditions that challenge the model. The claim of generalisation beyond training data is highlighted by researchers as a key headline.
Q What needs to happen before deployment and what concerns exist?
A Before deployment, OmniPredict remains a research prototype requiring long-term field trials, rigorous safety validation under corner cases, and integration tests showing how predictions influence motion planning. The work also calls for standards on acceptable false-positive and false-negative rates, plus ongoing auditing for bias, privacy, and the potential for a behavioural feedback loop where people change how they act around anticipatory systems.
Q Does OmniPredict read minds or access internal mental states?
A Is OmniPredict attempting to read minds? The researchers emphasize that the system is not accessing internal intent or consciousness; it transforms visual cues and contextual data into statistical forecasts of near-term movement learned from past data, which can be confident yet incorrect if situations differ from training patterns.

Have a question about this article?

Questions are reviewed before publishing. We'll answer the best ones!

Comments

No comments yet. Be the first!