ECA - eDSTC Project Proposal - Eklund (Revision)

  1. Project Description
  2. Project Name  DISTRIBUTED WAREHOUSING AND KNOWLEDGE DISCOVERY 
    Project Leader  Professor Peter Eklund 
    Project Duration  3-4 Years
    Project Sponsors 
     
    Defence Intelligence 
    Defence Science and Technology Organisation

    Project definition

    Email is already an extremely important tool for enterprise communication with individual users commonly retaining gigabyte collections of email. Although much of the technology required to effectively index and manipulate these collections exists, there are few systems capable of leveraging this information at an enterprise level.
    This project seeks to address these issues by constructing a distributed email data warehouse and knowledge repository for both enterprise level, and individual level, markup and indexation of email collections. The project seeks to leverage experience already gained through research on fine-grained knowledge markup of web documents and scalable visualistion techniques for document collections to construct a warehouse suitable for a co-operative distributed architecture for email based knowledge and data discovery.

    The project has been formulated by scientific advice from defense intelligence contacts. There already exists companies involved in exploring and restoring email. For instance, a US company called Electronic Evidence Discovery (EED) who deal with email discovery and restoration. EED state on their homepage ....

      [for email] discovery in a timely and effective manner, .. corporations must fundamentally change the way they store, manage and retain digital information.
    (see http://www.wired.com/wired/archive/7.05/email.html for an article on Electronic Evidence Discovery).
    Corporations can change the way they store, manage and retain digital information and these changes may obviate the need for specialist data discovery companies.

    Project execution

    The project team will comprise 3 PhD Students and 2 Research Fellows at Griffith University Gold Coast Campus, plus systems administration and software support personel. The Ph.D Students and Research Fellows will contribute by conducting research activities in the areas outlined in this proposal and of interest to eDSTC.
    The team will be self contained and administered by a research leader with at least a Ph.D level qualifications in Computer Science with demonstrated software engineering leadership.

    Project review

    The project establishes an email data warehouse for explorations in visualization, information and knowledge retrieval and distributed systems technology. eDSTC participants will evaluate the technologies and adapted them to their own special circumstances. The project should be demonstrated to participants annually. Success  involves outcomes migrating across eDSTC researchers and participants.

    Project exploitation

    The project has substantial exploitation pathways, both by the eDSTC participants and in more general terms, by any organization involved with the analysis and exploitation of electronic free-text document collections.
    Please identify (as best you can) any or all of:
    1. Potential industry/industries that may be an audience for the work and results
    2. Potential company/organisation(s) that may be an audience for the work and results
    3. Relevant company/organisation(s) contact point that may be an audience for the work and results:
    4.  

    Any organization involved with information & knowledge retrieval aspects of corporate services with an interest in the historical recovery and analysis of electronic email. This embraces Legal, Government, Defense and corporate financial services. The data discovery technologies are generic and universal and the distributed environment a core eDSTC demonstrator.

    Research Theme

    1. Knowledge and Resource Management
    2. Enterprise Processes and Work Practice Support
     
    The project involves knowledge and data discovery, through the visualization and extraction of knowledge from free-text document collections, and secondly enterprise processes and work practice support since it is intended that the framework be delivered using generic distributed systems technology.

    eDSTC Target Domain

    1. Defense
    2. Health
    3. Statement of Aims

    Objectives

    A data warehouse of email documents will be used to demonstrate the principles of knowledge and data discovery from free-text sources. Tools for automatically extracting "knowledge" from free-text will be provided along with a suitable distributed workflow computing environment allowing groups of individuals to work together in the process of the analysis, collection, navigation and maintenance of large-scale free-text document collections.
    Although the aim is to develop new knowledge and data discovery tools for free-text, more traditional information retrieval tools will also be included in the analyst's desktop (such as search engines, HiB, WebKB, HiBKB and existing ontology editors).

    Visualization aids for navigating free-text document collections are goals based on our existing research agenda. Current thinking places distributed system technology at the focus of a federated approach to the collection, maintenance and exploitation of free-text sources contained in a data warehouse.

    More advanced support and system administration tools are also required and these efforts should be supported.

    Scientific Foundation

    Identify what is novel and/or challenging:
    1. Innovation
    2. There are currently no federate free-text and knowledge extraction environment research projects that we know of.
    3. Key ideas
    4. Federated free-text document data warehouse, created automatically according to a general purpose distributed computing environment, e.g. ODSI.
    5. Scientific challenges
    6. Managing the group discipline in work practices that make the framework universal, relevant and customizable to all members of the work group.
      The scalability of the analysis and visualisation tools.
    7. Contribution to the research community
    8. Publications in the distributed computing and groupware literature, artificial intelligence, information & knowledge retrieval forums, library sciences, ontology formation and knowledge formulation, acquisition and retrieval literature.

    Motivation

    Indicate why this project should be undertaken.
    This project develops a demonstrator for a generic distributed groupware environment that is both immediately valuable to defense and government participants in the context of email free-text document warehousing and analysis as well as customizable to special purpose requirements such as library science, health, legal applications and so on.

    The project addresses points 1., 2., 3., 4. and 5. of the key research challenges identified in the 4.3 Knowledge Resource and Management section of eDSTC business plan. Furthermore, it proposes to engineer major outcomes identified by points b., c., d., and e. of 4.3 of the eDSTC business plan.

    What is the perceived benefit for:

    1. Participant
    2. Defense Intelligence would have access to an environment that is immediately deployable in the context of free-text analysis and warehousing.
      Document visualization, metaphors for visual exploration, knowledge and data discovery from text, integrated OLAP, RDF distributed framework for free-text document collection and analysis, application of formal concept analysis and other symbolic concept clustering and visualisation techniques, federate knowledge oriented environment, scalable distributed knowledge and data discovery architecture
     
  3. Plan
  4. Approach and Background

    The objectives will be achieved by creating a common, but customizable, distributed work environment for each of the research group members. Each team member will have access to a thin client running the most recent version of the ODSI. In the first instance, this thin client will be used for email reading and searching. This necessitates the creation of a customized email reader within the environment. This will be a program that was developed by Richard Cole and Peter Eklund called ECA ( Email Concept Analysis ).
    Figure 1: Visualisation of an Email "Theme"

    A server will be established to warehouse email. The specific source of email has yet to be determined. From this core environment additional tools will be created and incorporated into the distributed environment.

    Success will occur if there are substantial enough system administration and programmer resources attached to support the text-document system.

    The major risk to the project is that the distributed environment is not sufficiently supported, either in staff support or hardware terms, to make its use appropriate to the research team. The warehouse server, and its administrators, can be located at eDSTC HQ, locally or anywhere.

    Past experience suggests that research groups reach a point where outcomes no longer accumulate to the long-term benefit of either the research group or the basic technologies the research has created (after 5 years). In other words, knowledge loss occurs across the research base and these gaps are filled by other projects or individual skill-bases. This is both true in terms of software authoring in research but and also general scientific know-how. It is also a statement that rings true of The Enterprise in general.

    The opportunity is to leverage existing core technologies, such as ODSI and ECA, to create the kernel of an email and text document distributed environment which, aims to improve, augment and refine a research group's capability.

    List of Outcomes and Schedule

    Please list the outcomes intended for the project. Example outcomes are:
     
    1. Publication papers
    2. WWW, IEEE Intelligent Systems, ACM SIGIR, Distributed Systems Engineering Journal etc.
    3. Training and consulting
    4. Expertise for the DSTC in distributed groupware and text data warehousing. Any distributed groupware that emerges from the research described on free-text document analysis.
    5. National and International Standards
    6. Exercises OLAP and RDF.
    7. Expertise and skill development

    8. Practical application of RDF and XML, and practical experience with OLAP.
    9. Impact on education

    10. Ph.D. outcomes
       
       
    List the outcomes of the projects with availability dates, including some intermediate milestones to indicate/verify progress.
    1. Port the ODSI to a new server machine that will become the focus of the development effort.
    2. Port the ECA software to the ODSI.
    3. The third task is to create appropriate scripts and pathways for the email warehouse to collect and take shape (1 year for points 1., 2. and 3.)
    4. Tools that render and visualize the email collection have developed and experimented with by the present research team, these are reported in the literature as scaleable and this should be tested.
    5. Knowledge extraction and acquisition from the free-text sources is required. The problems of shared ontologies, forming coherence mappings between them and resolving ambiguities and redundancy in shared ontologies is research (year 2 and 3 for task 3, 4, 5 and 6.).

    Indicate exploitation or commercialisation strategies for the outcomes defined above. Give an indication of target audience and methods of reaching that audience.

       
    1. A stable distributed groupware environment with one or two text analysis and visualization tools as a desktop is of immediate interest to defense intelligence. This can be achieved in year 1, eDSTC could have a return on the 3 year investment in as little as 1 year into the project.
    2. The additional visualization, process, knowledge extraction and shared ontology tools naturally follow from a successful implementation of 1. Further software licensing can occur at this point.
    3. The final results of the 3-4 year project can be transformed to a commercial software product for non-participants in the final (4th year) of project funding.

    Project Resources

    Please list team members:
    Prof. Peter Eklund, Project Leader, Griffith University  50% 
    Dr. Philippe Martin, Research Fellow, Griffith University  50% 
    Dr. Francois Modave, Postdoctoral Research Fellow, Griffith University  50%
    Number, roles and percentage availability of research staff to be recruited (where possible, identify any key skill requirements)

    Chief Programmer and Project Coordinator (100%) - Ph.D. in Computer Science, strong C++ programming experience and software engineering know-how.

    Any additional support staff required from non-research resources - e.g. project manager, software engineering or sysadmin resources to complete the project. Identify roles and percentage time required.

       
    1. Systems Administrator (50%) - Unix
    2. System Programmer (50%) - C++
    3. 2 postdoctoral research fellow (on ein data warehousising and another in visualisation)

    Associated students (existing or planned) with level of study identified and time period of involvement.
    Richard Cole, Ph.D. candidate, Griffith University,  50% - 1 year
    Bernd Groh, Ph.D. Candidate, Griffith University,  50% - 2 years
    Thomas Tilley, Ph.D. Candidate, Griffith University,  80% - 3 years

    Identify all significant travel costs for the proposed period of the project:
       
    1. Domestic or international destination(s)
    2. It is important for at least one member of he project team to attend the Knowledge and Data Discovery Conference and the International Conf. on the WWW plus annual presentations to US-based defense intelligence and private organizations. Trips interstate to Canberra. Sydney and Adelaide as per DSTO and other related participant requests.
       
    3. Purpose of travel
    4. Present peer reviewed papers to conferences, presentations to defense and industry sources are likely to generate contract research outcomes for the eDSTC. Demonstration of the software outcomes is an important philosophy for this work.
     
    Identify all equipment required for the proposed project:
    1. New software or hardware that would have to be purchased

    2.  
      1. A powerful Solaris Sun Enterprise server, possibly midrange with considerable disk storage (terabyte capacity);
      2. ECA (The Email Concept Analysis program) is not owned by DSTC but it may be licensed freely for research purposes.
     
  5. Collaboration Highlight potential or proposed research and commercial collaboration:
    1. Internal (with other projects, involvement in integrators/demos, collaboration with participants ...)

    2.  
      This proposal intersects with a number of eDSTC preliminary proposals. In many respects the general distributed environment we propose to use, the ODSI, can be used to support several of these projects in a similar fashion to our own. These projects are listed below:
      1. Nigel Ward, Records Continuum Research Group, Monash University
      2. Renato Iannella, Digital Resource management
      3. Peter Bruza, Advanced Retrieval Technologies for Information and Knowledge
      4. Building Knowledge Using Social Information, Tim Mansfield, Nigel Ward
      5. Presentation of Ambient Information, Tim Mansfield
      6. Explain: answering questions with knowledge, Robert McArthur
      7. MatchDetectReveal: finding overlapping documents in digital libraries, Arkady Zaslavsky (Monash Uni)
      8. Automatic Thesaurus Creation, Chris Rowles
      9. Access to Invisible Web, Chris Rowles
      10. Access to Multiple Heterogeneous, Databases Chris Rowles
      11. Managing Information Streams, Stephen Crawley
      12. Open Architecture for Collaboration, Tim Mansfield
       
      We anticipate close collaboration with these groups and individuals. Unique in this proposal is the creation of a data warehouse as a repository for free-text electronic email documents. There is no exclusivity in access to this email data warehouse, it is intended to be re-used for other projects and tasks.
       
    3. External (nominate national or international relationships or organisations)
    4. In each case, indicate the form and extent of the relationship - and how it will benefit the project and DSTC.
       
      1. Defense Intelligence eDSTC has existing contract with research with the Defense Intelligence.
      2. TH Darmstadt The Technical University of Darmstadt has been a close collaborator over the years. They are the creators of formal concept analysis and have a substantial group of mathematicians (more than a dozen) working on related theories of lattice and order theory that are of benefit to the foundational aspects of our symbolic learning approach to data and knowledge discovery and visualisation. There are existing IREX - International Researcher Exchange Programmes funded by the ARC with TH Darmstadt and two of my students have been recipiants of DAAD - Deutscher Akademischer Austauschdienst funding.
      3. CYCorp The problem of jointly maintaining and using ontologies of terms for describing shared knowledge and information is a problem close to the agenda of CYCorp. We hope to extend that relationship (with CYCorp as a customer) in the years to come.
       
    5. Related work
    6. Identify any similar or related work in the research or commercial domains. Indicate how the proposed project differs from such other work. This analysis could and should include related work in the past or new Research Programme of DSTC.

      The idea of CYC is to construct a general knowledge repository containing relevant ontologies and facts about the world in which we live. The project has been going for over 10 years. In many respects CYC is an idea before its time. It is clear that the necessary infrastructure that allows groups of individuals to exchange knowledge in a seamless and distributed way is at least as important as the artifacts that they exchange. This infrastructure was absent in CYC.
      As security, backups and bandwidth become increasingly important to organizations over the next 10 years, the prospects for daily backups of terabyte desktops threaten. This explosion of infrastructure costs is an economic impossibility since the unit cost of ownership of computing devices will skyrocket. The age of the unrestricted fat client will give way to smarter, more customizable thin client groupware environments that assist in workflow, warehousing and job-related analysis tools.

      The scientific advances that will facilitate this "vision" are scalable visualization and warehousing, distributed workflow environments and general purpose (but customizable) group productivity tools of the type described in this project proposal. Building expertise in these areas gives Australia (and eDSTC) a substantial competitive edge.