Computer vision research has experienced dramatic advances in the past a few years. Many people become to think that, after many years that computer vision did not really work, now it is time to make computer vision useful for big applications. I asked myself in 2012: what will be the next big thing in computer vision? I saw two directions: 1) fine-grained image recognition for bridging the physical world and information world through image recognition – sometimes, people also call this search by image (but please not to be confused by image similarity search by Google, Bing or Baidu), and 2) visual 3D sensing for autonomous driving.We have been very focused in the above two directions since 2012. I am very lucky to have the privilege to work with a group of brilliant researchers to solve some fundamental research challenges in these two directions.

A slide that I made in early 2013 that describes the two directions that we have been focusing on.

1.Fine-grained Image Recognition

The classification error (top5) of ImageNet has gone below 5%, which already reportedly out-performed humans. Has image recognition been a solved problem for computer vision community? To my personal view, this is not an end to the image recognition research, rather, the exciting things of image recognition just get started – since image recognition becomes to be working, let's try to solve some real (but challenging) problems.How image recognition can be most useful? As the head of an industrial research group, I spent some efforts in studying why some famous image recognition startups or projects in big companies did not fly at the end. It is not obvious – in fact, I still see some companies/startups being repeating those failed paths. Our philosophy is not to build universal image recognizer; rather, we need to go deep into vertical domains one-by-one while emphasizing on researching as generic algorithms as possible. Let me give two examples in the following to illustrate what I mean by “deep into vertical domains”.

Example 1: food recognition. Given a food image, we wish to recognize it into “which restaurant which dish”. Why this is useful: help you to connect from the dish that is in front of you (and you take the photo using your smartphone) to the internet (for example, online review/recipe/nutrition facts of this dish). Or, if you wish to share the food photo with your friends at social media, you would not need to type in “tag” by yourself. The image recognition will help you automatically tag it and your friends will know “exactly” what dish you are referring to!

An example of our food recognition result. The left image is a query image, and it is “Green onion pancake” from “Shanghai Dumpling” Restaurant. The rows at the right are the classification results. Each number in red means “how likely the query image belongs to that class”. The text below each red number is the class name “which restaurant which dish”. The image at a row are some exemplar images from that class (e.g., to show how “Green onion pancake” from “Shanghai Dumpling” Restaurant looks like).

Example 2: flower recognition. Given a flower photo, we recognize it against a flower shop’s catalog. Why this is useful: if you see that beautiful flower and you use your smartphone to take the photo, we can help you to recognize the flower and you can buy it directly from your smartphone: Crinum Cape Lilly Powellii Album -- it is £5.90 per pack!

From these two examples, you get the sense of how I think image recognition is useful. Image recognition is the enabling technology for easily connecting the physical world (the world that you see) and the information world. We need to do fine-grained image recognition rather than big-class classification like PASCAL or ImageNet (do not get me wrong, these benchmark datasets are very useful for developing some technologies, but they are not easy to be used directly for real applications). Fine-grained image recognition is a more challenging research problem than big-class classification.

To solve the fine-grained image recognition one day, we have been developed very rich technology portfolio. Deep learning is very important, but we also develop strong algorithms for metric learning, boosting, object-centric feature learning, etc. Please refer to my papers for more details.

An example of our flower recognition result. The left image is a query image, and it is “CRINUM_Cape_Lilly_Powellii_Album” flower.

2.Visual 3D Sensing for Autonomous Driving

Like it or not, every automaker needs to start researching and developing its own autonomous driving solutions. Autonomous driving is now the biggest topic in auto industry. Of course, automakers are taking a step-by-step (or more precisely, generation-by-generation) approach rather than the moonshot approach (e.g., like Google’s style). Along the way to autonomous driving, automakers would gradually roll out they driving safety functions and hope to eventually have matured technologies for autonomous driving. Sensing is an important part for autonomous driving – you have to sense the environments before you can drive. I am very interested in compute-vision based sensing for autonomous driving. The logic is very simple, human driving relies on human vision, autonomous driving should rely on artificial intelligence vision – and that is what we call it, computer vision.

The following figure illustrates what I mean by visual 3D sensing. To build such visual 3D sensing system, it requires many computer vision technologies to work in concert; they are object detection, object tracking, structure-from-motion, 3D scene understanding, and so on. While I am overseeing the project, as a researcher (not as a department head), I am very interested in solving object detection problem, especially object detection in videos. For example, right now, 99% of chances are object detection in videos is done through frame-by-frame detection and then tracking. I wish to research radically new ways – for example, the way to use temporal information as early as possible in the pipeline, or how to use some primitive 3D information (from monocular SFM) to help. Please see my papers for more details.

My slide to illustrate what I mean by visual 3D sensing. Given the video sequence (left), we want to detection what are the objects in the scene and then estimate 3D coordinates of each objects. With the 3D information, we hope to construct the bird-eye view of the traffic participants.