Sixth Gen Person Following Robot

Paper and Code
Below is a broad overview of my thesis. To read the full paper, Click Here

This thesis tested the viability and effectiveness of using stereoscopic imaging with cheap components for the purpose of person detection and following. Stereoscopic cameras can produce a depth map, much like active imaging systems such as the Microsoft Kinect can, but are not subject to the same environmental limitations and may be configured to work in all types of lighting. A stereoscopic imaging rig was built from two wide angle USB cameras and was driven by a low power compute platform.

The Histogram of Oriented Gradients algorithm was used as the primary means of person detection due to its low false positive rate, invariance to color and lighting, and ability to detect humans in various poses. Each frame was further processed with a circle Hough transform to detect the head position, which was used to fine tune the person’s location and aid in removing false positives. Finally, a feature vector describing the subject’s shirt and pants color was constructed, which was used to identify the primary tracking subject within a group.

The robot was tested in indoor and outdoor environments, under varying lighting conditions, and with a varying number of people in the scene. Testing revealed that the robot was able to work in both environments with a simple lens swap and without changes to the software. Differences in terrain type proved to have no effect on performance. The results showed that stereoscopic imaging can be a cheap, robust and effective solution for person following.

A person following robot primarily requries two things: Person detection and person identification. Person detection (segmentation) is the process of imaging and analyzing the scene for the presence of humans. Person identification is the process of figuring out who is who and, in the case of a person following robot, which human is the person that should be followed.

As the title of the page implies, this project follows five prior generations of person following robot, each of which employed a different strategy for person detection and identification. The first three generations attempted to segment humans from the rest of the scene by clothing color. Essentially, the robot would expect to see someone with a particular color of shirt and threshold the image for that color. Any large relatively shirt-shaped blob was assumed to be the person that the robot would follow. The problems with this strategy were numerous: Multiple people with the same shirt color would confuse the robot, changes in lighting would make the shirt appear to be a different color that the robot does not expect, and a background of a similar color to the shirt color would cause the thresholding to fail.

The fourth generation robot attempted to overcome the limitations of the first three robots by employing shape detection for person detection, which would be invariant to background colors or variable lighting. The software used Hough transforms in an attempt to detect human heads by looking for circular objects in the scene. However, it was quickly discovered that false positives would be a problem as there are many circular objects in the world which are not human heads. To improve reliability, the authors would look under each circular object for a rectangular torso of a specific color. Testing revealed that torso-detection did indeed greatly improve reliability, but the software was never tested with a physical robot chassis. The software was simply too slow and ran on the order of frames per minute rather than frames per second.

The fifth generation robot utilized the Microsoft Kinect to track humans in three dimensions. Since the Microsoft Kinect came with proprietary libraries for human detection and tracking, it seemed like an ideal solution. However, testing showed that the Kinect was tailored for use in a very specific environment. The Kinect failed to detect humans outside or near natural light as infrared from the sun would interfere with the Kinect's own infrared beam, rendering the robot blind. The proprietary libraries also appear to be tailored for detecting humans in a specific orientation, upright and facing toward the kinect sensor. The Kinect often failed to detect people in profile or facing away from the robot. The fifth generation robot also used the iRobot Create (Roomba) platform for mobility, which turned out to be less than ideal. Since the chassis was small and light weight, the robot could not carry anymore more than it's computer and imaging hardware. Furthermore, when traveling over concrete, carpet, or any rough terrain, the robot would induce judder into the video stream, preventing the Kinect libraries from detecting humans.

Problem Space
The sixth generation person following robot attempted to overcome all of the problems of the prior generations. The goals of the sixth generation person following robot project were to create a robot that:
  • Can detect people in variable lighting conditions and against complex or similarly colored backgrounds.
  • Can detect people in all orientations.
  • Can detect people that are partially occluded by other objects.
  • Can detect and identify multiple people in the scene.
  • Can automatically recover from tracking loss.
  • Works on any type of terrain.
  • Is power efficient and can run on mobile hardware.
  • Is cheap to build as the project was self-funded.

The Segway RMP chassis was chosen as the robot chassis as it is large, heavy, and powerful enough to carry the imaging and compute hardware, along with any other "baggage." Since it is a two wheeled chassis it also has the benefit of having a very small turn radius, making it ideal for indoor environments with tight corridors and sharp angles where a four wheeled chassis would have difficulty.

The Segway RMP came with a small computer that was underpowered for the task of both driving the chassis and running the computer vision software. Thus, the NVidia Jetson TK1 board was selected as the main compute hardware. The TK1 contains an ARM Cortex-A15 four-core CPU running at 2.3GHz, and more importantly, a 192 core CUDA-capable GPU. The board is ideal for power-efficient computer vision tasks as it does not draw much power under load (~10 watts), and provides a relatively powerful GPU for image processing and CV tasks.

Finally, a custom stereoscopic imaging rig was built from two wide-angle USB camera modules mounted to an aluminum channel.

The software was modular and contained several stages of processing, not all of which were guaranteed to run depending on the circumstances. The first stage was acquiring imagines from the stereoscopic imaging rig and processing them for depth information. Once an image was acquired from the left and right cameras, a standard Stereo Block-Matching algorithm was used to construct a depth map, from which the distance to any object in the scene could be obtained. The depth map allows the robot to understand its environment in three dimensions, and also to track people in three dimensional space without the need to actively image the scene by projecting a laser or infrared beam.

Next, the Histogram of Oriented Gradients algorithm, a human-shape detection technique, was run on the left cameras output to discover any humans in the scene. Each detected human was entered into a tracker that tracked not only their location in three dimensions, but also their attributes such as their clothing color, head position, trajectory, bounding box size, and so forth. If the tracker was populated after the image acquisition phase, then the HOG algorithm would only check the expected locations of the previously detected humans in the newly acquired image, thus saving precious compute time.

As the HOG algorithm sometimes produced false positives, the Hough-Circle transform was used on the upper 1/3rd of each human's bounding box to detect the presence of a head. If a head was detected, the person's tracker entry was updated with the size and location of the head. Another technique used to remove false positives was the addition of a filter to the tracker. The tracker would automatically reject any submission that was large in the frame but distant (such as a building), or small in the frame (either distant human or probably not human).

Finally, a color-matching routine was run on every tracker entry to determine the tracking target. The robot included a training program, where the software would sample color data from the intended tracking target's pants and shirt. At each iteration of the software, a color-matching routine was run to determine which person's clothing colors match the training data the closest. Heavy bias was also given to the person's trajectory and whether they were previously selected as the tracking target, thus minimizing the potential for a similarly colored person crossing paths with the tracking target and confusing the robot. The robot remembers who they are following, what color the person is, and where they are going in three dimensional space. The robot would also alpha-blend the training data with color information from the newly acquired images per frame to mitigate the effects of lighting changes as a person moves from one lighting environment to another.

Once the correct tracking target was identified, the robot would calculate a direction and speed based on the person's horizontal offset in the frame and their distance as read from the depth map, and signal the Segway's computer with motion commands. The Segway's computer would translate the high level motion commands into low level motor commands, which it passed along to the Segway chassis to put the robot into physical motion.

If the robot lost tracking of the target, the robot would turn in the person's last known direction. So, if the person went out of the robot's field of view by running left, the robot would turn to the left to reacquire a lock.

Tests were performed in indoor and outdoor environments, with single and multiple people. The results showed that the robot was:
  • Able to track people with almost zero false positives.
  • Able to function on tile, concrete or grass, with no discernible effect on detection and tracking performance.
  • Able to track a single target even when multiple people were present.
  • Able to adequately recover from tracking loss, and worked against complex backgrounds (bushes, distant buildings) and similarly colored backgrounds (target wore a white shirt against white walls).

The software was also very efficient, running at 15 to 25 frames per second on mobile hardware. And the robot was cheap to produce, with the components (excluding the Segway chassis) costing approximately $350. The sixth generation person following robot was able to achieve all of its goals.