AirControl

A gesture-driven computer control system built around webcam-based hand tracking, modular event handling, and configurable gesture-to-action mapping.

Overview

The idea for this project started from a simple moment: while trying to watch a movie, I found myself wishing I could pause, play, or adjust the volume using just my hands, without reaching for a keyboard or remote. That small frustration became the starting point for this project, which I call AirControl.

AirControl is a gesture-driven control system that allows a user to interact with their computer through hand movements captured by a webcam. It is built as a modular system in which perception, event handling, and command execution are cleanly separated, making it easy to extend with new gestures and actions.

Description

AirControl is organised as a real-time gesture-processing pipeline. Hand landmarks are first extracted from a webcam feed using Google’s MediaPipe framework, which tracks the positions of key points on the hand, particularly the fingertips and joints.

These landmarks are then interpreted by gesture detectors to produce discrete gesture events. This step is important because raw landmark data is continuous and often noisy; converting it into higher-level events makes the system better able to reason about user intent in a structured and stable way. Instead of reacting directly to small fluctuations in finger position, the system can respond to clearer actions such as a pinch, hold, or wave.

These events are passed through a configurable mapping layer, where each recognised gesture is associated with a particular action such as mouse clicking, media control, scrolling, or window navigation. The use of an external configuration layer allows gesture–action bindings to be changed without modifying the underlying detection logic, while the priority system ensures that if multiple gestures are recognised at once, the most important valid action is chosen consistently.

The core architecture is now complete, and the system supports a stable set of gestures for clicking, dragging, scrolling, volume control, media playback, cursor mode switching, window navigation, muting, screenshots, calling, and quitting. Recent improvements focused on reducing gesture overlap, refining smoothing, and introducing mode-like interaction locks for drag and scroll so that once those actions begin, competing gestures are suppressed until the interaction ends.

System Pipeline

Design Details

Separation of perception and action

A key design decision was to separate perception from action. The logic that detects a gesture is kept distinct from the logic that decides what that gesture should do. This separation matters because it decouples how input is interpreted from how it is used, making the system easier to extend, maintain, and experiment with. Without this separation, adding a new gesture or changing an existing behaviour would require changes throughout the codebase rather than in a single layer.

Priority-based conflict resolution

Because gesture-based input is inherently noisy and ambiguous, multiple gestures may occasionally appear valid at the same time. To handle this, AirControl uses a priority-based conflict resolution mechanism. Each gesture is assigned a priority level, and when competing inputs occur, the system executes only the highest-priority valid action. In addition, stateful interaction locks are used for drag and scroll: once one of these interactions begins, only the events required to continue or end that interaction are allowed through. This makes the interface behave like a controlled system rather than a collection of independent detectors.

Continuous integration and maintainability

To support ongoing development, the project also includes a continuous integration workflow that runs automated checks on each push and pull request. In practice, this means linting and tests are executed automatically in a clean environment, helping catch regressions early and making the codebase easier to maintain as new gestures and interaction rules are added. This was particularly useful once the system grew beyond a simple demo and began to involve multiple detectors, command plugins, and stateful interaction policies.

Gesture Set

Gesture Action
PinchLeft click
Middle pinchRight click
Ring pinch start / move / endDrag start / drag move / drag end
Pinky pinch (thumb + pinky)Toggle cursor mode
Index–middle pinch (hold + move)Scroll mode (continuous up/down scrolling)
Middle pinch swipe left / rightSwitch windows / spaces
FistMute toggle
Volume up holdIncrease volume
Volume down holdDecrease volume
Open palm holdPlay / pause media
Horizontal yoTake screenshot
Call sign (thumb + pinky extended)Start call
Vulcan saluteQuit application

References

  1. Google, MediaPipe Solutions Guide, available at ai.google.dev/edge/mediapipe/solutions/guide .