;; -*- mode: outline -*- CRAM - a Contextual Relational Augmented Memory for Personal Data Mining KDL Challenge Grant Proposal Matthew Cornell Knowledge Discovery Laboratory Computer Science Department University of Massachusetts Amherst 2004-06-14 1. ROADMAP The MAIN GOAL of this work is to create self-structuring storage that integrates disparate information sources (e.g., web pages visited, email sent/received, RSS feeds subscribed to) in order to infer relationships between entities, then utilize context to focus the user's attention on input that is important, novel, or interesting. End users would obtain TANGIBLE BENEFITS from self-organizing data storage because it would directly address the worsening problem of information overload. The techniques developed would allow users to save time by focusing attention where most needed, and to efficiently find previously seen items that are currently lost in the flood of incoming data. Areas of application would include widely-used information sources such as web pages visited, email sent/received, RSS feeds subscribed to, news groups read, and instant messages sent/received. The CRITICAL TECHNICAL BARRIERS that have stood in the way are due primarily to both the failure of previous approaches to consider the interconnectedness of information, and their failure to utilize relatively new techniques emerging from the field of Relational Knowledge Discovery. The MAIN ELEMENTS of our approach involve research in three related aspects: 1) Identifying relevant entities from disparate information sources, relying on research from Information Extraction, 2) inferring the interrelationships between entities, relying on work in Link Detection, and 3) learning and recognizing user context in order to focus attention on desired input. Additionally we will need to continue our work in relational data storage and access that facilitates this type of knowledge discovery. The RATIONALE for our work in this direction is based on our lead in research into the techniques of Relational Knowledge Discovery, esp. in the creation of models that leverage relational data without succumbing to newly-discovered biases inherent in such networks of data. Additionally, we will take advantage of our work in creating a high-performance data store supporting: a) heterogeneous types of items and their relationships, and b) the new kinds of operations required for learning and applying these models. Our EXPECTED RESULTS include the development of a class of algorithmic techniques to support creating attention focusing models, and the creation of end-user tools to integrate these models with a small number of diverse information streams. The RISKS ENTAILED by not doing this work are that our vision of an integrated relational knowledge store won't come to pass, and that the techniques and lessons of Relational Knowledge Discovery won't be considered for this important problem area. This could result in a loss of a significant amount of time by many of us who continue needing to manage an increasingly large influx of daily information. We plan to EVALUATE PROGRESS and capabilities on two levels: a) studying their application to small standardized test datasets (which we will have to develop) for scientific correctness and accuracy, and 2) through qualitative user analysis and user studies of the effectiveness of tools using these models. 2. DESCRIPTION Our proposal seeks to address the increasingly important problem of information overload, which often results in 'digital deja vu' - trying to recall the location of an important piece of information that was seen before but is now lost. As many have pointed out, at least three emerging trends are dramatically exacerbating the problem: massive (essentially unlimited) personal digital storage, increasing numbers of digital information sources, and diverging quality of those sources. From [2]: ...it's not difficult to imagine saving, on your PC's disks - all interlinked at least in terms of who, what, where, when, and why; and subsets of which will be replicated across many devices, o all of your email, forever o all of your personal and business calendars, forever o all of your class and meeting notes, and scribblings and annotations on everything - including audio and maybe even other media, forever o all of your documents and presentations, forever, as well as the transcripts of meetings that you attend o all of your contacts and address books o all of your pictures - hundreds of thousands of digital snapshots at at least a megabyte a pop, as sensors become ever more dense, and as every phone becomes a camera phone o innumerable voice messages, as cell phones begin to have "record this & send it to me" buttons on the side o of course, huge collections of audio, as well as video - some personally created, some licensed and cached As a result, tools to help manage human attention will be critical. From [1]: ... human attention has become the limiting resource of the information age. Informational complexity has exploded, making attention -- the resource needed to make sense of information -- exceedingly valuable. Consequently, an organization's ability to focus attention effectively has become a key determinant of its success. Our goal is to create self-structuring storage that accepts input from diverse information sources, structures it by finding relationships between (possibly inferred) entities, and supports trained 'attention focus' models that prioritize information for the user. Put another way, we seek to develop techniques that support a free-form database with automated analytic tools for personal data mining of the kinds of information listed above. Most current approaches to solving this problem use either manual hierarchical organization (e.g., outliner programs), manual keyword tagging (e.g., browser bookmarks with keywords), manual network organization (hypertext and personal wiki programs), or automated text indexing ('Googling your data') via Information Retrieval methods (e.g., Google's Gmail, Zoe). We believe using these alone or in combination will fail because they either a) aren't amenable to automation, b) do not take advantage of the inherent relational characteristics of the data, or c) don't allow mixed-initiative (manual) manual intervention for the very high quality advice/organization people excel at. Our proposed approach, summarized here and discussed below in detail, can be broken into three parts: incremental data representation and storage, relationship and entity inference, and context-based attention focus modeling. Underlying all three parts is a network-based data representation that explicitly models relationships between data items. In addition to being the richest representation, it closely mimics relationships that people reason about. Upon the data representation (which must scale and support incremental updates) we propose applying Machine Learning techniques from the field of Relational Knowledge Discovery to help users focus attention. From the user's perspective, the system would work like a black box that is blindly fed data from existing information manipulation tools (such as email and web browsing) and produces recommendations for attention, using training information obtained from the user. It would also support other powerful analysis tools such as network-based ones (e.g., centrality measures, hubs and authorities), and graphical browsing of the data. Next we discuss the proposed problem's three parts in detail. Data Representation and Storage The first part of our proposal is a flexible network representation that is general enough to handle disparate information sources such as web pages browsed, email sent and received, and RSS feeds subscribed to. We plan to use an 'overlay' approach to integration with existing data-specific tools (e.g., web browsers, email clients) in which frequent incremental updates import data saved in those applications into the store, easing adoption. The storage will support millions of data items and relationships, and will allow for very fast graphical querying. Relationship and Entity Inference Because we will accept data from disparate information sources and will import data (rather than trying to force users to adopt a 'kitchen sink' tool like [3]), we must use existing data formats. These formats are very different, and most do not support explicit representations of data items and relationships. Thus, an important and challenging step of organizing the store is to detect entities and the relationships between them. We will adopt approaches from Information Extraction for this purpose. Attention Focus Modeling The final part of our proposal is the most challenging and the most important - the ability to learn where a user's attention should be focused. The central idea is that the user's work context is crucial to determining what information is relevant to her. For example, when a user is investigating an idea, she will perform a number of activities using information tools, such as searching for relevant web pages and sending email. Taken collectively during her investigation, these information sources are related only by the user's goal in accessing them. The proposed solution should be able to learn these contexts, so that later when the user wants to re-activate one, the analysis tools should be able to indicate new (and old) relevant information. Note that it is possible that users will encounter similar resources (from the same web site or related to the same email user) when performing different activities, but those should be excluded from their attention. Another way to think of such a model is as a 'digital fovea' [4] - an trainable artifact that is sensitive to a particular need. Once trained, it will activate (call for attention) when it sees something of interest. 3. FUTURE APPLICATIONS A natural extension to this proposal is that of adding personal information management (PIM) data, including inputting todo items, calendar entries, and personal notes/stickies (which would link to and mark up any other kind of data entity). Additionally, many other data input sources could be integrated, including various forms of media (photos, phone calls, video), as well as location/GPS information. Finally, an intriguing application of the proposal to collaboration would be to share trained attention models. The thought would be to give a trained model to a colleague who would then allow the model to see her incoming data, but without sharing the data itself. If the shared model indicated the owner's interest, the colleague could (possibly automatically) send the data (or a reference to it) to the model's owner. 4. PROPOSAL STRENGTHS AND CHALLENGES Scope: We envision the proposal will see results in 2-3 years. Collaboration opportunities: As mentioned above, this problem would require the integration of areas of research in Information Retrieval, Information Extraction, and Machine Learning (possibly including various kinds of models including Reinforcement Learning). Ease of evaluation: Evaluation is a challenge for this proposal because initial ideas tend to involve human judgement to determine the quality of attention recommendations. We suspect more thinking would provide alternative evaluation techniques. Readily available data: Because we would be using personal information that each individual would control and not share, data would be both readily available (say from email archives and browser histories), and its privacy would be maintained. 5. REFERENCES [1] Staying on Point The Science and Art of Managing Attention, Andrew Norman, http://www.imakenews.com/techyvent/e_article000214657.cfm?x=a2sSp5g,aLBqYML [2] '640KB ought to be enough for anyone', Ray Ozzie's Weblog, http://www.ozzie.net/blog/stories/2003/11/14/640kbOughtToBeEnoughForAnyone.html [3] Universal information client projects: Haystack (http://haystack.lcs.mit.edu/), OmniaMea (http://www.jetbrains.com/omea/index.html), and Chandler (http://www.osafoundation.org/) [4] http://www.stlukeseye.com/anatomy/Fovea.asp