Gesture Viewport: Interacting with Media Content Using Finger Gestures on Any Surface

on Comments (0)

At ICME 2014 in Chengdu, China, we presented a technical demo called “Gesture Viewport,” which is a projector-camera system that enables finger gesture interactions with media content on any surface. In the demo, we used a portable Pico projector to project a viewport widget (along with its content) onto a desktop and a Logitech webcam to monitor the viewport widget. We proposed a novel and computationally efficient finger localization method based on the detection of occlusion patterns inside a virtual “sensor” grid rendered in a layer on top of the viewport widget. We developed several robust interaction techniques to prevent unintentional gestures from occurring, to provide visual feedback to a user, and to minimize the interference of the “sensor” grid with the media content. We showed the effectiveness of the system through three scenarios: viewing photos, navigating Google Maps, and controlling Google Street View. Click on the following link to watch a short video clip that illustrates these scenarios.

Demo Video for Gesture Viewport

Many people who had seen the demo were impressed. They thought that the idea behind the demo, that is the proposed occlusion pattern based finger localization method, was very clever. That probably is a big reason why we won the Best Demo Award at ICME 2014. For more details of the demo, please refer to this paper.

Do Topic-Dependent Models Improve Microblog Sentiment Estimation?

on Comments (0)

When estimating the sentiment of movie and product reviews, domain adaptation has been shown to improve sentiment estimation performance.  But when estimating the sentiment in microblogs, topic-independent sentiment models are commonly used.

We examined whether topic-dependent models improve performance when a large number of training tweets are available. We collected tweets with emoticons for six months and then created two types of topic-dependent polarity estimation models:  models trained on Twitter tweets containing a target keyword and models trained on an enlarged set of tweets containing terms related to a topic. We also created a topic-independent model trained on a general sample of tweets. When we compared the performance of the models, we noted that for some topics, topic-dependent models performed better, although for the majority of topics, there was no significant difference in performance between a topic-dependent and a topic-independent model.

We then proposed a method for predicting which topics are likely to have better sentiment estimation performance when a topic-dependent sentiment model is used. This method also identifies terms and contexts for which the term polarity often differs from the expected polariy. For example, ‘charge’ is generally positive, but in the context of ‘phone’, it is often negative. Details can be found in our ICWSM 2014 paper.

Introducing cemint

on Comments (0)

At FXPAL we have long been interested in how multimedia can improve our interaction with documents, from using media to represent and help navigate documents on different display types to digitizing physical documents and linking media to documents.


In an ACM interactions piece published this month we introduce our latest work in multimedia document research. Cemint (for Component Extraction from Media for Interaction, Navigation, and Transformation) is a set of tools to support seamless intermedia synthesis and interaction. In our interactions piece we argue that authoring and reuse tools for dynamic, visual media should match the power and ease of use of their static textual media analogues. Our goal with this work is to allow people to use familiar metaphors, such as copy-and-paste, to construct and interact with multimedia documents.

Cemint applications will span a range of communication methods. Our early work focused on support for asynchronous media extraction and navigation, but we are currently building a tool using these techniques that can support live, web-based meetings. We will present this new tool at DocEng 2014 — stay tuned!

To cluster or to hash?

on Comments (0)

Visual search has developed a basic processing pipeline in the last decade or so on top of the “bag of visual words” representation based on local image descriptors.  You know it’s established when it’s in Wikipedia.  There’s been a steady stream of work on image matching using the representation in combination with approximate nearest neighbor search and various downstream geometric verification strategies.

In practice, the most computationally daunting stage can be the construction of the visual codebook which is usually accomplished via k-means or tree structured vector quantization.  The problem is to cluster (possibly billions of) local descriptors, and this offline clustering may need to be repeated when there are any significant changes to the image database.  Each descriptor cluster is represented by one element in a visual vocabulary (codebook).  In turn, each image is represented by a bag (vector) of these visual words (quantized descriptors).

Building on previous work on high accuracy scalable visual search, a recent FXPAL paper at ACM ICMR 2014 proposes Vector Quantization Free (VQF)  search using projective hashing in combination with binary valued local image descriptors.   Recent years have seen the development of binary descriptors such as ORB or BRIEF that improve efficiency with negligible loss of accuracy in various matching scenarios.   Rather than clustering the collected descriptors harvested globally from the image database, the codebook is implicitly defined via projective hashing.  Subsets of the elements of ORB descriptors are hashed by projection (i.e. all but a small number of bits are discarded) to form an index table, as below.


By creating multiple different tables, image matching is implemented by a voting scheme based on the number of collisions (i.e. partial matches) between the descriptors in a test image and those in a database image.

The paper presents experimental results on image databases that validate the expected significant increase in efficiency and scalability using the VQF approach.  The results also show improved performance over some competitive baselines in near duplicate image search.  There remain some interesting questions for future work to understand tradeoffs around the size of the hash tables (governed by the number of bits projected) and the number of tables required to deliver a desired level of performance.

SearchPanel: supporting exploratory search in regular search engines

on Comments (0)

People often use more than one query when searching for information. We revisit search results to re-find information and build an understanding of our search need through iterative explorations of query formulation. Unfortunately, these tasks are not well supported by search interfaces and web browsers. The only indication of our search process we get is a different colored link to pages we already have visited. In our previous research, we found that a simple query preview widget helped people formulate more successful queries and more efficiently explore the search results. However, the query preview widget would not work with regular search engines since it required back-end support. To bring support for exploratory search to common search engines, such as Google, Bing or Yahoo, we designed and built a Chrome browser plug-in, SearchPanel.

SearchPanel collects and visualizes information about the web pages retrieved in small panel next to the search results. With a glance, a searcher can see which web pages have been previously retrieved, visited and bookmarked. If a web page has a favicon, it is included in the bar (2) to help scanning and navigation of the search results. Each search result is represented as a bar in SearchPanel. The color of the bar (3) indicates retrieval status (teal = new, light blue = previously retrieved but not viewed, and dark blue = previously retrieved and viewed web page). The length of the bar (5) indicates how many times a web page has been visited; shorter bar indicates more visits. If a web page in the results list have previously been bookmarked, a yellow star is shown next to the bar (6). Users can easily re-run the same query with a different search engine by selecting one of the search engine buttons (1). When the user navigates to a web page linked in the search results, a white circle (4) is shown next to the bar representing that search result. This circle persists even if the user continues to follow links away from the web page linked in the search results. Complex2_numbers

When moving away from the search page, SearchPanel stays put and provides a short cut for accessing the search results. The search result being explored is indicated in SearchPanel by a circle. Moving the mouse over a bar in SearchPanel when not on the search page, displays the search result snippet.


We evaluated SearchPanel in a real world deployment and found that appears to have been primarily used for complex information needs, in search sessions with long durations and high numbers of queries. For search session with single queries, we found very little use of SearchPanel. Based on our evaluation, we conclude that SearchPanel appears to be used in the way it was designed; when it is not needed it is out of the way and not used, but when one simple query does not answer the search need, SearchPanel is used for supporting the information seeking process. More details about SearchPanel can be found in our SIGIR 2014 paper.

New look for FXPAL web sites

on Comments (0)

Our Home Page and this Blog have a new look. It is less blue, more orange, and doesn’t have a picture of our old building on it. In theory, it also works much better on mobile devices. The home page seems to behave quite nicely. Webmaster is still working on the blog here – searching, author and category links, and the like will get better soon. But you can see posts, and our brilliant researchers can create posts. Thanks for your patience.

AirAuth: Authentication through In-Air Gestures Instead of Passwords


At the CHI 2014 conference, we demonstrated a new prototype authentication system, AirAuth, that explores the use of in-air gestures for authentication purposes as an alternative to password-based entry.

Previous work has shown that passwords or PINs as an authentication mechanism have usability issues that ultimately lead to a compromise in security. For instance, as the number of services to authenticate to grows, users use variations of basic passwords, which are easier to remember, thus making their accounts susceptible to attack if one is compromised.

On mobile devices, smudge attacks and shoulder surfing attacks pose a threat to authentication, as finger movements on a touch screen are easy to record visually and to replicate.

AirAuth addresses these issues by replacing password entry with a gesture. Motor memory makes it a simple task for most users to remember their gesture. Furthermore, since we track multiple points on the user’s hands, we do obtain tracking information that is unique to the physical appearance of the legitimate user, so there is an implicit biometric built into AirAuth. Smudge attacks are averted due to the touchless gesture entry and a user study we conducted shows that AirAuth is also quite resistant towards camera-based shoulder surfing attacks.

Our demo at CHI showed the enrollment and authentication phases of our system. We gave attendees the opportunity to enroll in our system and check AirAuth’s capabilities to recognize their gestures. We got great responses from the attendees and obtained enrollment gestures from a number of them. We plan to use these enrollment gestures to evaluate AirAuth’s accuracy in field conditions.

Improving the Expressiveness of Touch Input


Touch input is now the preferred input method on mobile devices such as smartphones or tablets. Touch is also gaining traction in the desktop segment and is also common for interaction with large table or wall-based displays. At present, the majority of touch displays can detect solely the touch location of a user input. Some capacitive touch screens can also report the contact area of a touch, but usually, no further information about individual touch inputs is available to developers of mobile applications.

It would, however, be beneficial to capture further properties of the user’s touch, for instance the finger’s rotation around the vertical axis (i.e., the axis orthogonal to the plane of the touch screen) as well as its tilt (see images above). Obtaining rotation and tilt information for a touch would allow for expressive localized input gestures as well as new types of on-screen widgets that make use of the additional local input degrees of freedom.

Having finger pose information together with touches adds additional local degrees of freedom of input for each touch location. This, for instance, allows the user interface designer to remap established multi-touch gestures such as pinch-to-zoom to other user interface functions or to free up screen space by allowing input (e.g., adjusting a slider value, scrolling a list, panning a map view, enlarging a picture) to be performed at a single touch location that usually need (multi-) touch gestures that require a significant amount of screen space. New graphical user interface widgets that make use of finger pose information, such as rolling context menus, hidden flaps or occlusion-aware widgets have also been suggested.

Our PointPose prototype performs finger pose estimation at the location of touch using a short-range depth sensor viewing the touch screen of a mobile device. We use the point cloud generated by the depth sensor for finger pose estimation. PointPose estimates the finger pose of a user touch by fitting a cylindrical model to the subset of the point that corresponds to the user’s finger. We use the spatial location of the user’s touch to seed the search for the subset of the point cloud representing the user’s finger.

One advantage of our approach is that it does not require complex external tracking hardware (as in related work), and external computation is unnecessary as the finger pose extraction algorithm is efficient enough to run directly on the mobile device. This makes PointPose ideal for prototyping and developing novel mobile user interfaces that use finger pose estimation.

Painted on a cathedral ceiling or it didn’t happen

on Comments (1)

My kids are home-schooled.  One of the many consequences is that they are sheltered from bureaucracy more than the average kid.

One of my teenagers is involved with a not-quite-local high school, because, well, why should the public school community be denied the joy of sharing in his perceived infallibility?  In order for me to volunteer to drive him and some classmates to an event, I needed to fill out a Form.  A teenager is not the best communication medium, but it only took a week of back-and-forth to determine that no electronic version existed, and to actually get the Form in to my hands.  His last text about it was “I have the forms”.   And indeed, when he handed it to me, he said, “You have to fill it out in Quadruplicate!”

The Form, of course, was a carbonless paper form, with a white, orange, pink and yellow sheet.   I replied, “It’s okay.  It’s like carbon paper.  You just fill out the top.”   His whiny cry of “But how do you Know?”, was less a doubting of my knowledge than a complaint that there was no “About this form” link at the bottom of the paper.  Though I still doubt it, he claimed to never have seen the like (remember that infallibility?).   (I also find it amusing that although Zingerman’s has gone digital and gotten rid of the carbonless ordering forms, they still say “Yellow copy” and “Pink copy” on the interim white receipts.)

A bit more questioning and discussion with my colleagues revealed that our kids really believe that there are only 3 generations of a technology:  What they and their peers use, what their parents use (now, not in their youth), and the original invention.

Thus text documents are either shared in the cloud, stored locally on a laptop/desktop, or painstakingly hand-duplicated by monastic scribes.  Personal music is streamed, parents listen to satellite radio and MP3s that came from old CDs, and people used to listen to rocks and sticks played around the communal cooking fire pit.   Vinyl LPs aren’t music at all.  As my kids said at a friend’s party a few years ago, “Why do you have those plastic things we make art bowls out of in your closet?”   We found a 20+ year old AAA Triptik for a cross-country drive and one of the kids asked how we updated that.  Might as well have been runes on dragon skin.

There are lots of other examples, and I’m resisting the urge to write about them.  But I’m thinking of all those intermediate technologies that are disappearing like so many 5 1/4″ floppies.

P.S. This post sat as a draft for about a year, and I’m only putting it out because I hear Gene’s voice asking me to put out content.  Which I intend to do.


Sorry for the down time and happy anniversary


We moved into our new building about 2 years ago.  Long enough ago that we have quite a few energetic new employees that don’t know that we were ever anywhere else.   But the “new” place is nice, and getting better, and worthy of celebrating, at least in a little way.

I was thinking of bringing in donuts on Monday to celebrate, in order to follow one of Gene’s bagel rules:  If you want donuts, you have to get them yourself.   However, hard drives play by their own rules.

The FXPAL Blog is one of the few web servers we have that ran directly on server hardware, given that it started before “clouds”.   When the disk sneezed over the weekend, the site went down.  So I skipped the donut pickup to pick up the pieces of our blog.   We took this as an opportunity to virtualize and update the underlying infrastructure.   I expect there are a few plugins not-quite right, and the title bar is messed up – sorry, Tony.

Once I get it all right, I’ll bring the donuts.