# mixMeet

At FXPAL, we build and evaluate systems that make multimedia content easier to capture, access, and manipulate. In the Interactive Media group we are currently focusing on remote work and distributed meetings in particular. On one hand, meetings can be inefficient at best and a flat-out boring, waste-of-time at worst. However, there are some key benefits to meetings, especially those that are more ad hoc and driven by specific, concrete goals. More and more meetings are held with remote workers via multimedia-rich interfaces (such as HipChat and Slack).  These systems augment web-based communication with lightweight content sharing to reduce communication overhead while helping teams focus on immediate tasks.

We are developing a tool, mixMeet, to make lightweight, multimedia meetings more dynamic, flexible, and hopefully more effective. mixMeet is a web-based collaboration tool designed to support content interaction and extraction for use in both live, synchronous meetings as well as asynchronous group work. mixMeet is a pure web system that uses the WebRTC framework to create video connections. It supports live keyframe archiving and navigation, content-based markup, and the ability to copy-and-paste content to personal or shared notes. Each meeting participant can flexibly interact with all other clients’ shared screen or webcam content.  A backend server can be configured to archive keyframes as well as record each user’s stream.

Our vision for mixMeet is to make it easy to mark up and reuse content from meetings, and make collaboration over visual content a natural part of web-based conferencing. As you can see from the video below, we have made some progress toward this goal. However, we know there are many issues with remote, multimedia-rich work that we don’t yet fully understand. To that end, we are currently conducting a study of remote videoconferencing tools. If your group uses any remote collaboration tools with distributed groups please fill out our survey.

# on automation and tacit knowledge

We hear a lot about how computers are replacing even white collar jobs. Unfortunately, often left behind when automating these kinds of processes is tacit knowledge that, while perhaps not strictly necessary to generate a solution, can nonetheless improve results. In particular, many professionals rely upon years of experience to guide designs in ways that are largely invisible to non-experts.

One of these areas of automation is document layout or reflow in which a system attempts to fit text and image content into a given format. Usually such systems operate using templates and adjustable constraints to fit content into new formats. For example, the automated system might adjust font size, table and image sizes, gutter size, kerning, tracking, leading, etc. in different ways to match a loosely defined output style. These approaches can certainly be useful, especially for targeting output to devices with arbitrary screen sizes and resolutions. One of the largest problems, however, is that these algorithms often ignore what might have been a considerable effort by the writers, editors, and backshop designers to create a visual layout that effectively conveys the material. Often designers want detailed control over many of the structural elements that such algorithms adjust.

For this reason I was impressed with Hailpern et al.’s work at DocEng 2014 on document truncation and pagination for news articles. In these works, the authors’ systems analyze the text of an article to determine pagination and truncation breakpoints in news articles that correspond to natural boundaries in articles between high-level, summary content and more detailed content. This derives from an observation that journalists tend to write articles in “inverted pyramid” style in which the most newsworthy, summary information appears near the beginning with details toward the middle and background info toward the end. This is a critical observation in no small part because it means that popular newswriting bears little resemblance to academic writing. (Perhaps what sets this work apart from others is that the authors employed a basic tenet of human-computer interaction: the experiences of the system developer are a poor proxy for the experiences of other stakeholders.)

Foundry, which Retelny et al. presented at UIST 2014, takes an altogether different approach. This system, rather than automating tasks, helps bring diverse experts together in a modular, flexible way. The system helps the user coordinate the recruitment of domain experts into a staged workflow toward the creation of a complex product, such as an app or training video. The tool also allows rapid reconfiguration. One can imagine that this system could be extended to take advantage of not only domain experts but also people with different levels of expertise — some “stages” could even be automated. This approach is somewhat similar to the basic ideas in NudgeCam, in which the system incorporated general video guidelines from video-production experts, templates designed by experts in the particular domain of interest, novice users, and automated post hoc techniques to improve the quality of recorded video.

The goal of most software is to improve a product’s quality as well as efficiency with which it is produced. We should keep in mind that this is often best accomplished not by systems designed to replace humans but rather those developed to best leverage people’s tacit knowledge.

# video text retouch

Several of us just returned from ACM UIST 2014 where we presented some new work as part of the cemint project.  One vision of the cemint project is to build applications for multimedia content manipulation and reuse that are as powerful as their analogues for text content.  We are working towards this goal by exploiting two key tools.  First, we want to use real-time content analysis to expose useful structure within multimedia content.  Given some decomposition of the content, which can be spatial, temporal, or even semantic, we then allow users to interact with these sub-units or segments via direct manipulation.  Last year, we began exploring these ideas in our work on content-based video copy and paste.

As another embodiment of these ideas, we demonstrated video text retouch at UIST last week.  Our browser-based system performs real-time text detection on streamed video frames to locate both words and lines.  When a user clicks on a frame, a live cursor appears next to the nearest word.  At this point, users can alter text directly using the keyboard.  When they do so, a video overlay is created to capture and display their edits.

Because we perform per-frame text detection, as the position of edited text shifts vertically or horizontally in the course of the original (unedited source) video, we can track the corresponding line’s location and update the overlaid content appropriately.

By leveraging our familiarity with manipulating text, this work exemplifies the larger goal to bring interaction metaphors rooted in content creation to enhance both the consumption and reuse of live multimedia streams.  We believe that integrating real-time content analysis and interaction design can help us create improved tools for multimedia content usage.

# Ego-Centric vs. Exo-Centric Tracking and Interaction in Smart Spaces

In the recent paper published at SUI 2014,”Exploring Gestural Interaction in Smart Spaces using Head-Mounted Devices with Ego-Centric Sensing”, co-authored with Barry Kollee and Tony Dunnigan, we studied a prototype Head Mounted Device (HMD) that allows the interaction with external displays by input through spatial gestures.

In the paper, one of our goals was to expand the scope of interaction possibilities on HMDs, which are currently severely limited, if we consider Google Glass as a baseline. Glass only has a small touch pad, which is placed at an awkward position on the devices rim, at the user’s temple. The other input modalities Glass offers are eye blink input and voice recognition. While eye blink can be effective as a binary input mechanism, in many situations it is rather limited and could be considered socially awkward. Voice input suffers from recognition errors for non-native speakers of the input language and has considerable lag, as current Android-based devices, such as Google Glass, perform text-to-speech in the cloud. These problems were also observed in the main study of our paper.

We thus proposed three gestural selection techniques in order to extend the input capabilities of HMDs: (1) a head nod gesture, (2) a hand movement gesture and (3) a hand grasping gesture.

The following mock-up video shows the three proposed gestures used in a scenario depicting a material selection session in a (hypothetical) smart space used by architects:

We discounted the head nod gesture after a preliminary study showed a low user preference for such an input method. In a main study, we found that the two gestural techniques achieved performance similar to a baseline technique using the touch pad on Google Glass. However, we hypothesize that the spatial gestural techniques using direct manipulation may outperform the touch pad for larger numbers of selectable targets (in our study we had 12 targets in total), as secondary GUI navigation activities (i.e., scrolling a list view) are not required when using gestures.

In the paper, we also present some possibilities for ad-hoc control of large displays and automated indoor systems:

Ambient light control using spatial gestures tracked by via an HMD.

Considering the larger picture, our paper touches on the broader question of ego-centric vs exo-centric tracking: past work in smart spaces has mainly relied on external (exo-centric) tracking techniques, e.g., using depth sensors such as the Kinect for user tracking and interaction. As wearable devices get increasingly powerful and as depth sensor technology shrinks, it may, in the future, become more practical to users to bring their own sensors to a smart space. This has advantages in scalability: more users can be tracked in larger spaces, without additional investments in fixed tracking systems. Also, a larger number of spaces can be made interactive, as the users carry their sensing equipment from place to place.

# Information Interaction in Context 2014

I asked FXPAL alumni Jeremy Pickens to contribute a post on the best paper award at IIiX which is named after our late colleague Gene Golovchinsky.  For me, the episode Jeremy recounts exemplifies Gene’s willingness and generosity in helping others work though research questions.  The rest of this post is written by Jeremy.

I recently had the opportunity to attend the Information Interaction in Context conference at the University of Regensburg.  It is a conference which attempts to bring together the systems and the user perspective on information retrieval and information seeking.  In short, it was exactly the type of conference at which our colleague Gene Golovchinsky was quite at home.  In fact, Gene had been one of the chairs of the conference before his passing last year.  The IIiX organizers made him an honorary chair.  During his time as chair, Gene secured FXPAL’s sponsorship of the conference including the honorarium that accompanied the Best Paper award.  The conference organizers decided to officially give the award in Gene’s memory, and as a former FXPAL employee, I was asked to present the award and to say a few words about Gene.

I began by sharing who I knew Gene to be through the lens of our first meeting.  It was 1998.  Maybe 1999.  Let’s say 1998.  I was a young grad student in the Information Retrieval lab at UMass Amherst.  Gene had recently convinced FXPAL to sponsor my advisor’s Industrial Advisory Board meeting.  This meant that once a year, the lab would put together a poster session to give IAB members a sneak preview of the upcoming research results before they appeared anywhere else.

Well, at that time, I was kinda an odd duck in the lab because I had started doing music Information Retrieval when most of my colleagues were working on text.  So there I am at the IAB poster session, with all these commercial, industry sponsors who have flown in from all over the country to get new ideas about how to improve their text search engines…and I’m talking about melodies and chords.  Do you know that look, when someone sees you but really does not want to talk with you?  When their eyes meet yours, and then keep on scanning, as if to pretend that they were looking past you the whole time?  For the first hour that’s how I felt.

Until Gene.

Now, I’m fairly sure that he really was not interested in music IR.  But not only did Gene stop and hear what I had to say, but he engaged.  Before I knew it, half an hour (or at least it felt like it) had passed by, and I’d had one of those great engaging Gene discussions that I would, a few years later when FXPAL hired me, have a whole lot more of.  Complete with full Gene eye twinkle at every new idea that we batted around.  Gene had this way of conducting a research discussion in which he could both share (give) ideas to you, and elicit ideas from you, in a way that I can only describe as true collaboration.

After the conference dinner and presentation had concluded, there were a number of people that approached me and shared very similar stories about their interactions with Gene.  And a number of people who expressed the sentiment that they wished they’d had the opportunity to know him.

I should also note that the Best Paper award went to Kathy Brennan, Diane Kelly, and Jamie Arguello, for their paper on “The Effect of Cognitive Abilities on Information Search for Tasks of Varying Levels of Complexity“. Neither I nor FXPAL had a hand in deciding who the best paper recipient was to be; that task went to the conference organizers.  But in what I find to be a touching coincidence, one of the paper’s authors, Diane Kelly, was actually Gene’s summer intern at FXPAL back in the early 2000s.  He touched a lot of people, and will be sorely missed.  I miss him.

# LoCo: a framework for indoor location of mobile devices

Last year, we initiated the LoCo project on indoor location.  The LoCo page has more information, but our central goal is to provide highly accurate, room-level location information to enable indoor location services to complement the location services built on GPS outdoors.

Last week, we presented our initial results on the work at Ubicomp 2014.  In our paper, we introduce a new approach to room-level location based on supervised classification.  Specifically, we use boosting in a one-versus-all formulation to enable highly accurate classification based on simple features derived from Wi-Fi received signal strength (RSSI) measures.  This approach offloads the bulk of the complexity to an offline training procedure, and the resulting classifier is sufficiently simple to be run on a mobile client directly.  We use a simple and robust feature set based on pairwise RSSI margin to both address Wi-Fi RSSI volatility.

$h_m(X) = \begin{cases} 1 & X(b_m^{(1)}) - X(b_m^{(2)}) \geq \theta_m \\ 0 & \text{otherwise} \end{cases}$

The equation above shows an example weak learner which simply looks at two elements in an RSSI scan and compares their difference against a threshold.  The final strong classifier for each room is a weighted combination of a set of weak learners greedily selected to discriminate that room.  The feature is designed to express the ordering of RSSI values observed for specific access points, and a flexible reliance on the difference between them, and the threshold $\theta_m$ is determined in training.  An additional benefit of this choice is that processing a subset of the RSSI scan according to the selected weak learners further reduces the required computation.  Comparing against the kNN matching approach used in RedPin [Bolliger, 2008], our results show competitive performance with substantially reduced complexity.  The Table below shows cross validation results from the paper for two data sets collected in our office.  The classification time appears in the rightmost column.

We are excited about the early progress we’ve made on this project and look forward to building out our indoor location system in several directions in the near future.  But more than that, we look forward to building new location driven applications exploiting this technique which can leverage existing infrastructure (Wi-Fi networks) and devices (cell phones) we already use.

# Gesture Viewport: Interacting with Media Content Using Finger Gestures on Any Surface

At ICME 2014 in Chengdu, China, we presented a technical demo called “Gesture Viewport,” which is a projector-camera system that enables finger gesture interactions with media content on any surface. In the demo, we used a portable Pico projector to project a viewport widget (along with its content) onto a desktop and a Logitech webcam to monitor the viewport widget. We proposed a novel and computationally efficient finger localization method based on the detection of occlusion patterns inside a virtual “sensor” grid rendered in a layer on top of the viewport widget. We developed several robust interaction techniques to prevent unintentional gestures from occurring, to provide visual feedback to a user, and to minimize the interference of the “sensor” grid with the media content. We showed the effectiveness of the system through three scenarios: viewing photos, navigating Google Maps, and controlling Google Street View. Click on the following link to watch a short video clip that illustrates these scenarios.

Many people who had seen the demo were impressed. They thought that the idea behind the demo, that is the proposed occlusion pattern based finger localization method, was very clever. That probably is a big reason why we won the Best Demo Award at ICME 2014. For more details of the demo, please refer to this paper.

# Do Topic-Dependent Models Improve Microblog Sentiment Estimation?

on

When estimating the sentiment of movie and product reviews, domain adaptation has been shown to improve sentiment estimation performance.  But when estimating the sentiment in microblogs, topic-independent sentiment models are commonly used.

We examined whether topic-dependent models improve performance when a large number of training tweets are available. We collected tweets with emoticons for six months and then created two types of topic-dependent polarity estimation models:  models trained on Twitter tweets containing a target keyword and models trained on an enlarged set of tweets containing terms related to a topic. We also created a topic-independent model trained on a general sample of tweets. When we compared the performance of the models, we noted that for some topics, topic-dependent models performed better, although for the majority of topics, there was no significant difference in performance between a topic-dependent and a topic-independent model.

We then proposed a method for predicting which topics are likely to have better sentiment estimation performance when a topic-dependent sentiment model is used. This method also identifies terms and contexts for which the term polarity often differs from the expected polariy. For example, ‘charge’ is generally positive, but in the context of ‘phone’, it is often negative. Details can be found in our ICWSM 2014 paper.

# Introducing cemint

on

At FXPAL we have long been interested in how multimedia can improve our interaction with documents, from using media to represent and help navigate documents on different display types to digitizing physical documents and linking media to documents.

In an ACM interactions piece published this month we introduce our latest work in multimedia document research. Cemint (for Component Extraction from Media for Interaction, Navigation, and Transformation) is a set of tools to support seamless intermedia synthesis and interaction. In our interactions piece we argue that authoring and reuse tools for dynamic, visual media should match the power and ease of use of their static textual media analogues. Our goal with this work is to allow people to use familiar metaphors, such as copy-and-paste, to construct and interact with multimedia documents.

Cemint applications will span a range of communication methods. Our early work focused on support for asynchronous media extraction and navigation, but we are currently building a tool using these techniques that can support live, web-based meetings. We will present this new tool at DocEng 2014 — stay tuned!

# To cluster or to hash?

on

Visual search has developed a basic processing pipeline in the last decade or so on top of the “bag of visual words” representation based on local image descriptors.  You know it’s established when it’s in Wikipedia.  There’s been a steady stream of work on image matching using the representation in combination with approximate nearest neighbor search and various downstream geometric verification strategies.

In practice, the most computationally daunting stage can be the construction of the visual codebook which is usually accomplished via k-means or tree structured vector quantization.  The problem is to cluster (possibly billions of) local descriptors, and this offline clustering may need to be repeated when there are any significant changes to the image database.  Each descriptor cluster is represented by one element in a visual vocabulary (codebook).  In turn, each image is represented by a bag (vector) of these visual words (quantized descriptors).

Building on previous work on high accuracy scalable visual search, a recent FXPAL paper at ACM ICMR 2014 proposes Vector Quantization Free (VQF)  search using projective hashing in combination with binary valued local image descriptors.   Recent years have seen the development of binary descriptors such as ORB or BRIEF that improve efficiency with negligible loss of accuracy in various matching scenarios.   Rather than clustering the collected descriptors harvested globally from the image database, the codebook is implicitly defined via projective hashing.  Subsets of the elements of ORB descriptors are hashed by projection (i.e. all but a small number of bits are discarded) to form an index table, as below.

By creating multiple different tables, image matching is implemented by a voting scheme based on the number of collisions (i.e. partial matches) between the descriptors in a test image and those in a database image.

The paper presents experimental results on image databases that validate the expected significant increase in efficiency and scalability using the VQF approach.  The results also show improved performance over some competitive baselines in near duplicate image search.  There remain some interesting questions for future work to understand tradeoffs around the size of the hash tables (governed by the number of bits projected) and the number of tables required to deliver a desired level of performance.