Our Blog

Is human motion tracking enough to build digital poka yoke systems?

Posted by Zeeshan Zia, PhD

Most assembly processes involve a wide range of "freewheeling steps" that cannot by detected by motion tracking alone.

Motion capture or Mocap technology has been used by Hollywood for decades to transfer real human motion onto CG characters. Microsoft commoditized this technology in 2010 with their Kinect sensor often bundled with Xbox gaming consoles. Our founding team made fundamental contributions to the science behind capturing 3D wireframes from ordinary cameras in the 2000s and early 2010s, and won several awards for it, including a Microsoft Research Best Paper Award at the prestigious IEEE Workshop on 3D Representation and Recognition 2011 (3dRR-11). We applied this technology for precise hand tracking in Microsoft HoloLens, as well as got patents on it [1] [2].

In the industrial context motion tracking technology has been used for worker training on simple table top processes, provided by vendors such as LightGuide and Invisible AI. You can see such a demonstration in the following video.

Play Video

Note how parts have to be placed at precise locations on the table top in this demonstration. The system tracks steps in the process by figuring out whether the worker placed his hand at a precise location on the surface.

Such hard constraints are acceptable for training purposes in limited situations, but most assembly processes have a more freewheeling nature, where parts are attached in the “air”, at a fast pace, often out-of-order, and sometimes simultaneously e.g. four screws picked at the same time in one hand while the other hands runs the torque gun to push them in.

Meeting takt time, i.e. the required product assembly duration that is needed to match the demand, is of paramount importance on the assembly line; and placing rigid constraints on the worker to perform a process does more harm than good.

Now contrast the assembly process above with an actual process on the line, being tracked by our Pathfinder platform.

Play Video

Ours is a universal solution that can capture subtle details of a manual activity and is insanely simple to setup.

Even in this “table top” assembly process that takes place in a Shenzhen factory, you see parts being unwrapped above the table surface, the unit being moved freely on the table and being cleaned by blowing air on it. Notice how the worker is working naturally, how subtle several of the steps are, yet our solution is able to track them in bold on the left hand side.

We estimate that about 80% of manufacturing examples are like this where motion tracking alone is insufficient.

This is precisely our raison d etre: to provide a universal solution that can capture subtle details of a manual activity and is insanely simple to setup [3]. In fact, one of the most sophisticated automotive manufacturers on the planet evaluated the motion tracking approach against ours, and found us to be far ahead of the competition.

So how do we do it? What’s special about our approach?

Our past two decades of applied research has brought us to three key technical ideas that allow us to build universal “digital poka yoke” systems. These are:

1. Context is important: Human motion alone isn’t enough to understand a complex activity. The same motion cues can mean different things depending on the objects the worker is manipulating. We have invented technology that “discovers” objects that the worker interacts with (without requiring any object-level labels for setup), while interpreting the precise style of “grasping”. Here’s a video that showcases the concept (using a 3rd party tool for visualization only).
Play Video

2. Avoid hard rules: Another learning from our experiences building robust computer vision systems is to always avoid fixed constraints. We have developed novel technology for aligning a live video against the entire set demonstration (“training”) videos, in a “soft” manner i.e. within a neural network, where our system internally holds hundreds of probabilities that help it cover all possible ways in which a certain step could be performed. The core ideas have been peer-reviewed and published at the top conference in the field [4]. We summarize the concept in the following video.

Play Video

3. Continuously improve models: Machine learning models are never perfect on day 1. Those who claim to have perfect models are lying to you! Our emphasis is on quick deployment that gets you 90-95% of the way in a week; which then continuously learns and improves beyond 99.9999% accuracy in the following weeks and months as it gets the chance to observe the same process further.

A key requirement to self-improving models, is to create algorithms that automatically understand the structure of a task, and can transfer labels from examples annotated earlier to the new data, to continuously re-train the models. We achieve this through unsupervised video learning, describe at length in our technical paper [5].

Play Video

An additional benefit we get out of modeling manual processes with the above three insights is that we are able to incorporate corner cases in a consistent framework. This is a strategy that our team learned from shipping AI systems in a domain fraught with “long tail” problems i.e. self-driving.

To summarize, human motion tracking is a valuable tool in many domains. We were amongst the pioneers of democratizing that technology. But unfortunately, its far from sufficient when it comes to guiding assembly workers in 80% of processes, due to the large number of variations between workers. We have a superior solution that not only accomodates but thrives on these variations.