Wednesday, December 17, 2008

Stereo Vision System ("SVS") by Surveyor Inc.

Surveyor Inc. present a truly low-cost (cca. USD 550) stereo vision module. Here is an article (containing the press release)  and the link at Surveyor.com

Monday, June 9, 2008

The 2004 paper by Ong and Bowden ("A Boosted Classifier Tree for Hand Shape Detection")


Introduction

This is my summary of the 2004 paper by Eng-Jon Ong and Richard Bowden, "A Boosted Classifier Tree for Hand Shape Detection". In this paper, the authors propose a method that does two things:
  • detects the presence of human hands within an image, and
  • classifies the hand shape (when detected within an image).

Preparing image datasets


The authors first collected a set of various video sequences. Using this set, a training set of 5013 images containing human hands was automatically extracted (segmented-out) by modeling skin tones with a single Gaussian and isolating regions of high skin probability.

This set of 5013 images was then split into two subsets:
  • Set A: 2504 images for training, and
  • Set B: 2509 images for testing.

Training the general hand detector


Using the set of 2504 images for training, a general hand detector was trained using FloatBoost. A cascade of 11 stages, with a total of 634 weak classifiers, was obtained. This general hand detector was then tested on the set of 2509 images for testing, and the detection error was 0.2%.


Training the specialized shape detectors

Both sets A (2504 training images) and B (2509 testing images) are now combined into set C. This set C of 5013 images containing human hands are then clustered (grouped) into 300 sets of images ("hand image clusters") that share common appearance (see figure below depicting 10 example hand image clusters). Again, this clustering was an automatic process. Hands were represented with a set of so-called shape context features. These features have the advantage that they are robust to noise and alignment.

Figure: an example of ten automatically extracted hand image clusters,
using shape context features and K-mediod clustering.

From these 300 clusters, 900 images were then randomly selected. (Therefore, on average 3 images were selected from each cluster.) These 900 images were then defined to constitute the hand shape test database. Thus 5013-900=4113 images remained for FloatBoost training (4113/300 = 13.71 images per cluster).

A cascade of strong classifiers was then FloatBoost-trained on the images of each of these 300 clusters. For each cluster, the images in the remaining 299 clusters provided the "non-hand-shape" images. The authors found that the average error for these detectors on the shape test database was 2.6%.


Combining the general hand detector and specialized shape detectors

Having now both the general hand detector, and specialized shape detectors, we combine them together as depicted in the figure below:

Thursday, June 5, 2008

Width/height ratios of training samples in Viola-Jones method?

Do width/height ratios of training samples in Viola-Jones method have to follow the natural ratios of photographed objects? For example, let's say that the bounding rectangle of a certain hand gesture has the ratio of 1.o. Does it matter if training samples are defined to be of ratio 1.2? Does this ratio of 1.2 affect detection rates?

Friday, March 21, 2008

Motorola concept phone with a stereo camera

Motorola has developed an interesting concept phone featuring a stereo camera (link 1, link 2), presumably to be able to compute disparity maps?

Saturday, March 15, 2008

Explaining output from haartraining.exe

This is the table output from haartraning.exe, for a current stage in a cascade:

+----+----+-+---------+---------+---------+---------+
| N |%%SMP|F| ST.THR | HR | FA | EXP. ERR|
+----+----+-+---------+---------+---------+---------+
  • N = current feature for this cascade (seq->total),
  • %%SMP = percentage of samples used, if trimmings enabled (v_wt)
  • F = '+' if isFilpped, if symmetry is specified (v_flipped), '-' otherwise
  • ST.THR = stage threshold,
  • HR = Hit Rate based on Stage threshold (v_hitrate / numpos),
  • FA = False alarm based on Stage threshold (v_falsealarm / numneg)
  • EXP.ERR = Strong classification error of adaboost algorithm,
    based on threshold=0 (v_experr)

Friday, March 14, 2008

Some notes on Viola-Jones detection method

Viola-Jones detection method:
  • A feature (or feature instance) is an instance of feature prototype (feature template)
  • A weak classifier is based on one feature
  • A strong classifier (also called boosted classifier, or stage) is composed of a number of weak classifiers
  • A cascade of boosted classifiers is, then, composed of several strong classifiers, or stages
  • In a cascade:
    • 1st strong classifier (has two features only) detects almost 100% of subwindows containing the object, but also passes 40% of all subwindow candidates that do not contain the object
    • 2nd strong classifier (five features) detects almost 100% of subwindows containing the object, but passes only 20% of non-object subwindows
    • 3rd strong classifier (twenty features) detects almost 100% and passes merely 10% of non-object subwindows
    • ... and so on
  • Therefore, in a cascade, strong classifiers become more and more complex, and more and more "picky", and it gets progressively more difficult for a candidate subwindow to pass through the entire cascade, thus "passing the test"
  • Feature templates (prototypes) can be scaled independently horizontally and vertically, in order to obtain feature instances:
    • the minimum sizes are being defined by the template's original resolution (e.g. 25x25 pixels) and
    • the maximum sizes by the frame (window) size (e.g. 640x480 pixels)
  • If the false positive rate is 0.01 (10^-2), we falsely detect the object in 1 out of 100 frames; at 30 FPS, we have one false positive each 100/30 = 3.3 seconds
  • If the false positive rate is 0.00001 (10^-4), we falsely detect the object in 1 out of 10,000 frames; at 30 FPS, we have one false positive each 10,000/30 = 333.3 seconds = 5 min 33.3 s
  • If the false positive rate is 0.00001 (10^-6), we falsely detect the object in 1 out of 1,000,000 frames; at 30 FPS, we have one false positive each 1,000,000/30 = 9 hours 15 min
  • For a cascade of classifiers, the false positive rate F of the cascade is

    F = f1*f2*...*fK

    where K is the number of classifiers, and fi is the false positive rate of the i-th classifier on the examples that get through to it.
  • For a cascade of classifiers, the detection rate D of the cascade is

    D = d1*d2*...*dK

    where K is the number of classifiers, and di is the detection rate of the i-th classifier on the examples that get through to it.
  • For a strong classifier, the proportion of sub-windows which are labelled as potentially containing the object, is effectively equal to the false positive rate.

Monday, December 17, 2007

3DV Systems' ZCam depth-sensing camera

An interesting development from 3DV Systems, an Israeli company... They've developed ZCam, a depth-sensing monocular camera (which is also allegedly cheap --- how cheap is not yet known, however). What's better, it operates with infrared, so all the known problems with natural illumination are circumvented in a potential application.



Links:

Wednesday, September 12, 2007

Extracting (x, y) coordinates from a contour

After one extracts the contour of an object by calling

num_contours = cvFindContours(img, storage, contour ... )

the question frequently arises: how to extract the list of all (x, y) pixel coordinates defining the contour? The answer:

int maxLevel = 3;
CvTreeNodeIterator iterator;
CvPoint pt;
cvInitTreeNodeIterator( &iterator, contour, maxLevel );
while( (contour = (CvSeq*)cvNextTreeNode( &iterator )) != 0 )
{
CvSeqReader reader;
int count = contour->total;
cvStartReadSeq( contour, &reader, 0 );
count -= !CV_IS_SEQ_CLOSED ( contour );
for ( int i = 0; i < count; i++ )
{
CV_READ_SEQ_ELEM ( pt, reader );
cout << pt.x << pt.y;
}
}

Monday, July 16, 2007

Effects of cvFlip()

As per OpenCV documentation, the function cvFlip flips an image in one of three different ways, depending on the value of its parameter flip_mode:
  • = 0 flip image around x-axis
  • < 0 flip image around both axes, x-axis and y-axis
  • > 0 flip image around y-axis

Interestingly, the effect I am getting is:
  • = 0 no effect
  • < 0 flip image around y-axis
  • > 0 flip image around both axes, x-axis and y-axis

Examples:

Figure: cvFlag -1. Instead of flipping around both x and y axes, it flips around y only.



Figure: cvFlag 0. Instead of flipping around x axis, it has no effect.



Figure: cvFlag +1. Instead of flipping around y axis, it flips around both x and y axes.


The answer to this apparent paradox is the following: the first photo (in all three pairs of photos above) was grabbed by a camera, and it happens that in this case OpenCV returns an image with a bottom-left (BL) origin (which is different from the default in OpenCV, where all images have top-left (TL) origin).

Later, when I copy the grabbed BL image into another TL image, OpenCV automatically flips the image around the x-axis, and when I later flip the resulting image with cvFlip(0), the result is that these two operations cancel each other.

The same, of course holds for flags >0 and <0.

Friday, July 13, 2007

How does OpenCV detect the orientation of a calibration pattern?

As we all know OpenCV draws the red line first when using the function cvDrawChessboardCorners() (see also this post). To make things clearer w.r.t. the detected orientation of the calibration pattern (in our case, a NxM chessboard where N is odd (even) and M is even (odd)), I made a test with various configurations and made screenshots. (Click on the figure below to download the full-resolution version.)

Figure: nine different NxM chessboard combinations, with chessboard corners drawn


From the image above, we can make the following conclusions: if we feed OpenCV an NxM chessboard, where there is N-1 (M-1) corners in the horizontal (vertical) direction so that the leftmost-topmost square is black, then the N side (horizontal side) of the board is always in the X direction (in other words, corners along N edge will always have the same color). Especially, the first row of corners in X direction will be drawn in red because this is how cvDrawChessboardCorners() works.

Sunday, July 1, 2007

The new Logitech QuickCam Pro 9000

Technological progress marches on... The new Logitech QuickCam Pro 9000 webcam has some impressive specifications: Carl Zeiss optics, HD video 960x720, a true 2M sensor... All that for measly 100 bucks. Check out the Logitech page.

Figure: Logitech QuickCam Pro 9000

Wednesday, June 27, 2007

OpenCV FAQ

(Note: this OpenCV FAQ page is a constant work in progress.)

Q. Can extrinsic parameters and world's origin be determined uniquely with respect to a MxN checkerboard? What about a NxN checkerboard?
A: Yes, in the case of MxN checkerboard calibration patterns, extrinsic parameters (and consequently, world's = pattern's origin) can be defined uniquely, no matter how your camera looks at the pattern. However, this is not so for a NxN checkerboard.

Sunday, June 24, 2007

Camera's coordinate system in OpenCV

A couple of notes of the orientation of the camera's coordinate system in OpenCV. Let's say that we used a number of 320x240 photos (containing a calibration pattern, for example a checkerboard) to calibrate a camera. Here is an example of such a photo:



This is a 8x7 square pattern, where each square has dimensions 24.1 x 24.1 milimeters. OpenCV analyzes this photo and detects the corners, with pattern's (=world's) origin denoted by O in the following figure (green cross):



The coordinates of the origin are (0, 0, 0)world = (-271.6, 111.2, 489.9)camera = (254, 103)image.

Red cross represents the detected principal point (a part of intrinsics data). For the set of calibration photos in question, OpenCV detected the principal point at (177.1, 118.32) which rounded off gives (177, 118). This is almost at the middle of the photo (160,120), but not quite. This happens frequently due to camera imperfections.

Now let's take again a look at the photo above. If the x grows to the right, and y grows up, one would expect both coordinates x, y of the origin to be positive in the camera's system, right? Wrong. With OpenCV, in the camera's system, x grows to the left, and y grows up, which enables the camera coordinate system to be always right-handed (because +z shoots away from the camera, into the photo).

Thus, the coordinates of the origin in camera's system, for this particular case, happens to be x = -271.6, and y = 111.2. By definition, coordinate z is always positive and in this case it happens that z = 489.9.

Figure: Camera's coordinate system (axis z goes into the image)


Interestingly, when we reverse the image, we get the following:
Figure: Camera's coordinate system (axis z goes into the image)


Finally, when we draw in the world's (=pattern's) coordinate system, we get the following (yellow axes):

Therefore, +z axis of the world coordinate system also looks INTO the pattern, therefore we "can't see +z axis" because it's obstructed by the pattern itself. Here is the same image, only turned upside down:

Friday, June 22, 2007

Webcams with large field of view (FOV)

Webcams with large field of view (FOV) are useful in computer-vision setups. The following two photos demonstrate the difference in FOV between older and newer webcams (original shots can be found here).


Figure: older webcam (approx. 52º FOV)



Figure: newer webcam (Creative Live! Ultra with 76º FOV)


Obviously not only is the FOV significantly larger, colors are also more vivid and faithful (which can also be taken advantage of in a computer-vision application).

Here are some newer webcams (with FOV in degrees):
... compared with some older (pre-2006) webcams:
  • Creative NX Pro / 52º
  • Logitech ClickSmart 420 / 43º
  • Logitech QuickCam Communicate STX / 42º

Wednesday, June 20, 2007

Printing out camera parameters (intrinsics & extrinsics) after call to cvCalibrateCamera2()

This is one way to print out the intrinsics matrix obtained after a call to cvCalibrateCamera2():

for (int x = 0; x<3; x++)
{
printf("\n");
for (int y = 0; y<3; y++)
{
printf( "K[%i, %i]=%10.6f ", x, y, cvmGet(camera_matrix, x, y) );
}
}

The result of the loop above is ([fx 0 cx; 0 fy cy; 0 0 1]):

K[0, 0]=498.286392 K[0, 1]= 0.000000 K[0, 2]=153.187943
K[1, 0]= 0.000000 K[1, 1]=605.857215 K[1, 2]= 49.275781
K[2, 0]= 0.000000 K[2, 1]= 0.000000 K[2, 2]= 1.000000

The result above has been obtained after calling the function on the image in here.


It is possible to use the same loop above for printing out matrix R, however before you do that, you have to transform the rotation vector (as computed by cvCalibrateCamera2()) into matrix form:

CvMat* rot_matrix;
rot_matrix = cvCreateMat( 3, 3, CV_32FC1 );
cvRodrigues2(&rot_vects, rot_matrix);

Now simply use the loop above (just substitute camera_matrix with rot_matrix).

To print out translation vector:

printf("\n translation vector: ");
double T[3];
CvMat _T = cvMat( 1, 3, CV_64F, T );
cvConvert( &trans_vects, &_T );
printf( "\n T[0]=%10.6f ", T[0] );
printf( "\n T[1]=%10.6f ", T[1] );
printf( "\n T[2]=%10.6f ", T[2] );

cvDrawChessboardCorners(...)

In case the checkerboard pattern has been found/detected in the input image, function cvDrawChessboardCorners() (whose implementation can be found in cvcalibinit.cpp) will traverse the detected corners like this:

int x, y;
CvPoint prev_pt = {0, 0};
const int line_max = 7;
static const CvScalar line_colors[line_max] =
{
{{0,0,255}},
{{0,128,255}},
{{0,200,200}},
{{0,255,0}},
{{200,200,0}},
{{255,0,0}},
{{255,0,255}}
};

for( y = 0, i = 0; y < pattern_size.height; y++ )
{
CvScalar color = line_colors[y % line_max];
if( cn == 1 )
color = cvScalarAll(200);
color.val[0] *= scale;
color.val[1] *= scale;
color.val[2] *= scale;
color.val[3] *= scale;

for( x = 0; x < pattern_size.width; x++, i++ )
{
CvPoint pt;

pt.x = cvRound(corners[i].x*(1 << shift));
pt.y = cvRound(corners[i].y*(1 << shift));

if( i != 0 )
cvLine( image, prev_pt, pt, color, 1, line_type, shift );

cvLine(
image,
cvPoint(pt.x - r, pt.y - r),
cvPoint(pt.x + r, pt.y + r),
color,
1,
line_type,
shift );

cvLine(
image,
cvPoint(pt.x - r, pt.y + r),
cvPoint(pt.x + r, pt.y - r),
color,
1,
line_type,
shift );

cvCircle( image, pt, r+(1 << shift), color, 1, line_type, shift );

prev_pt = pt;
}
}

Therefore the function traverses each horizontal line. Furthermore, the following colors will be used to render the horizontal lines (note that OpenCV stores color data in BGR format, instead of RGB):
  • 0,0,255 = 255,0,0RGB = red
  • 0,128,255 = 255,128,0RGB = orange
  • 0,200,200 = 200,200,0RGB = light olive
  • 0,255,0 = 0,255,0RGB = green
  • 200,200,0 = 0,200,200,0RGB = light blue
  • 255,0,0 = 0,0,255RGB = blue
  • 255,0,255 = 255,0,255RGB = pink

Therefore, the red axis designates axis X in the pattern. Here is an example:



The same image, with corners drawn:



The following corners (total 7*6=42 of them) have been detected by OpenCV:

(0, 0): x, y = 115.00, 181.50
(1, 0): x, y = 112.00, 168.00
(2, 0): x, y = 109.00, 153.50
(3, 0): x, y = 105.50, 138.50
(4, 0): x, y = 101.50, 122.50
(5, 0): x, y = 98.00, 105.50
(6, 0): x, y = 93.00, 86.50
(0, 1): x, y = 136.50, 181.50
(1, 1): x, y = 134.00, 167.50
(2, 1): x, y = 132.00, 153.50
(3, 1): x, y = 129.00, 138.50
(4, 1): x, y = 126.00, 122.50
(5, 1): x, y = 123.00, 105.00
(6, 1): x, y = 120.00, 86.00
(0, 2): x, y = 158.00, 181.50
(1, 2): x, y = 156.50, 167.50
(2, 2): x, y = 155.00, 153.50
(3, 2): x, y = 153.00, 138.50
(4, 2): x, y = 151.00, 122.50
(5, 2): x, y = 149.00, 104.50
(6, 2): x, y = 146.50, 85.50
(0, 3): x, y = 180.50, 181.50
(1, 3): x, y = 180.00, 168.50
(2, 3): x, y = 178.00, 153.50
(3, 3): x, y = 177.50, 138.50
(4, 3): x, y = 176.00, 122.50
(5, 3): x, y = 175.00, 104.50
(6, 3): x, y = 173.50, 85.50
(0, 4): x, y = 202.50, 182.00
(1, 4): x, y = 202.00, 168.50
(2, 4): x, y = 201.50, 153.50
(3, 4): x, y = 201.50, 139.00
(4, 4): x, y = 201.50, 122.00
(5, 4): x, y = 201.50, 104.50
(6, 4): x, y = 201.50, 84.50
(0, 5): x, y = 225.50, 182.50
(1, 5): x, y = 225.50, 168.50
(2, 5): x, y = 226.50, 154.00
(3, 5): x, y = 227.00, 138.50
(4, 5): x, y = 227.50, 121.50
(5, 5): x, y = 228.50, 103.50
(6, 5): x, y = 230.00, 84.00

NOTE: if the size of a field (square, whether white or black) is for example 34x34mm, then corner:
  • (0, 0) has 3D coordinates (0mm, 0mm, 0mm)
  • (1, 0) has 3D coordinates (34mm, 0mm, 0mm)
  • (2, 0) has 3D coordinates (68mm, 0mm, 0mm)
  • ...
  • (0, 1) has 3D coordinates (0mm, 34mm, 0mm)
  • (1, 1) has 3D coordinates (34mm, 34mm, 0mm)
  • (2, 1) has 3D coordinates (68mm, 34mm, 0mm)
  • ...
  • ...
  • ...
  • (0, 5) has 3D coordinates (0mm, 5*34mm, 0mm)
  • (1, 5) has 3D coordinates (34mm, 5*34mm, 0mm)
  • ...
  • (6, 5) has 3D coordinates (6*34mm, 5*34mm, 0mm)
and so on.

What follows now is the image with corner indexes (i, j) indicated:


Here with coordinate axes:

Axis orientation in OpenCV

If you're going to use a checkerboard to calibrate a camera using OpenCV and its function cvFindExtrinsicCameraParams_64d(), note that OpenCV expects a checkerboard with right-handed orientation. This means that x axis goes from left to right, y from bottom to top, and z from below up (that is, out of the picture) .

Cameras

Abstracted camera: screen - optics - aperture.

Pinhole camera: screen - aperture. (That is, optics is an identity.) The problem is that aperture (=pinhole) must be small thus exposure time must be long.

Optical camera: screen - optics - aperture. Aims at producing the same picture as pinhole camera, but by means of a much larger aperture. It can do that because optics correctly concentrates all the incoming light rays. Because aperture is larger, exposure time is shorter.

Tuesday, June 19, 2007

On Homographies

Homography (also called collineation) is an invertible mapping h from Pn to Pn such that:
three points x1, x2, x3 lie on the same line <=> h(x1), h(x2), h(x3) lie on the same (another) line too.

In the 2-space, mapping h: P2 -> P2 is a homography <=> exists a regular matrix H in M33 such that h(x)=Hx, for all points x in P2. Note that H can be multiplied by any non-zero factor without altering the transformation, thus there are eight independent ratios amongst the nine elements of H, therefore H has eight DoF.

Homogeneous coordinates

There are two distinct advantages of using homogeneous coordinates in computer vision & graphics:
  1. perspective projection, which is a non-linear transformation, can be represented by a system of LINEAR equations.
  2. points at infinity can be easily dealt with in all computations (instead of computing limits at infinity).