-
Project Description
| Project Name |
DISTRIBUTED WAREHOUSING AND KNOWLEDGE DISCOVERY |
| Project Leader |
Professor Peter Eklund |
| Project Duration |
3-4 Years |
Project Sponsors
|
Defence Intelligence
Defence Science and Technology Organisation |
Project definition
Email is already an extremely important tool for enterprise
communication with individual users commonly retaining gigabyte collections
of email. Although much of the technology required to effectively index
and manipulate these collections exists, there are few systems capable
of leveraging this information at an enterprise level.
This project seeks to address these issues by constructing
a distributed email data warehouse and knowledge repository for both enterprise
level, and individual level, markup and indexation of email collections.
The project seeks to leverage experience already gained through research
on fine-grained knowledge markup of web documents and scalable visualistion
techniques for document collections to construct a warehouse suitable for
a co-operative distributed architecture for email based knowledge and data
discovery.
The project has been formulated by scientific advice from
defense intelligence contacts. There already exists companies involved
in exploring and restoring email. For instance, a US company called Electronic
Evidence Discovery (EED) who deal with email discovery and restoration.
EED state on their homepage ....
[for email] discovery in a timely and effective manner, .. corporations
must fundamentally change the way they store, manage and retain digital
information.
(see http://www.wired.com/wired/archive/7.05/email.html
for an article on Electronic Evidence
Discovery).
Corporations can change the way they store, manage and retain digital
information and these changes may obviate the need for specialist data
discovery companies.
Project execution
The project team will comprise 3 PhD Students and 2 Research
Fellows at Griffith University Gold Coast Campus, plus systems administration
and software support personel. The Ph.D Students and Research Fellows will
contribute by conducting research activities in the areas outlined in this
proposal and of interest to eDSTC.
The team will be self contained and administered by a
research leader with at least a Ph.D level qualifications in Computer Science
with demonstrated software engineering leadership.
Project review
The project establishes an email data warehouse for explorations
in visualization, information and knowledge retrieval and distributed systems
technology. eDSTC participants will evaluate the technologies and adapted
them to their own special circumstances. The project should be demonstrated
to participants annually. Success involves outcomes migrating across
eDSTC researchers and participants.
Project exploitation
The project has substantial exploitation pathways, both
by the eDSTC participants and in more general terms, by any organization
involved with the analysis and exploitation of electronic free-text document
collections.
Please identify (as best you can) any or all of:
-
Potential industry/industries that may be an audience for the work and
results
-
Potential company/organisation(s) that may be an audience for the work
and results
-
Relevant company/organisation(s) contact point that may be an audience
for the work and results:
Any organization involved with information & knowledge
retrieval aspects of corporate services with an interest in the historical
recovery and analysis of electronic email. This embraces Legal, Government,
Defense and corporate financial services. The data discovery technologies
are generic and universal and the distributed environment a core eDSTC
demonstrator.
Research Theme
-
Knowledge and Resource Management
-
Enterprise Processes and Work Practice Support
The project involves knowledge and data discovery, through
the visualization and extraction of knowledge from free-text document collections,
and secondly enterprise processes and work practice support since it is
intended that the framework be delivered using generic distributed systems
technology.
eDSTC Target Domain
-
Defense
-
Health
-
Statement of Aims
Objectives
A data warehouse of email documents will be used to demonstrate
the principles of knowledge and data discovery from free-text sources.
Tools for automatically extracting "knowledge" from free-text will be provided
along with a suitable distributed workflow computing environment allowing
groups of individuals to work together in the process of the analysis,
collection, navigation and maintenance of large-scale free-text document
collections.
Although the aim is to develop new knowledge and data
discovery tools for free-text, more traditional information retrieval tools
will also be included in the analyst's desktop (such as search engines,
HiB, WebKB, HiBKB and existing ontology editors).
Visualization aids for navigating free-text document collections
are goals based on our existing research agenda. Current thinking places
distributed system technology at the focus of a federated approach to the
collection, maintenance and exploitation of free-text sources contained
in a data warehouse.
More advanced support and system administration tools are
also required and these efforts should be supported.
Scientific Foundation
Identify what is novel and/or challenging:
-
Innovation
There are currently no federate free-text and knowledge
extraction environment research projects that we know of.
-
Key ideas
Federated free-text document data warehouse, created automatically
according to a general purpose distributed computing environment, e.g.
ODSI.
-
Scientific challenges
Managing the group discipline in work practices that make
the framework universal, relevant and customizable to all members of the
work group.
The scalability of the analysis and visualisation tools.
-
Contribution to the research community
Publications in the distributed computing and groupware
literature, artificial intelligence, information & knowledge retrieval
forums, library sciences, ontology formation and knowledge formulation,
acquisition and retrieval literature.
Motivation
Indicate why this project should be undertaken.
This project develops a demonstrator for a generic distributed
groupware environment that is both immediately valuable to defense and
government participants in the context of email free-text document warehousing
and analysis as well as customizable to special purpose requirements such
as library science, health, legal applications and so on.
The project addresses points 1., 2., 3., 4.
and 5. of the key research challenges identified in the 4.3 Knowledge Resource
and Management section of eDSTC business plan. Furthermore, it proposes
to engineer major outcomes identified by points b., c., d., and e. of 4.3
of the eDSTC business plan.
What is the perceived benefit for:
-
Participant
Defense Intelligence would have access to an environment
that is immediately deployable in the context of free-text analysis and
warehousing.
Document visualization, metaphors for visual exploration,
knowledge and data discovery from text, integrated OLAP, RDF distributed
framework for free-text document collection and analysis, application of
formal concept analysis and other symbolic concept clustering and visualisation
techniques, federate knowledge oriented environment, scalable distributed
knowledge and data discovery architecture
-
Plan
Approach and Background
The objectives will be achieved by creating a common,
but customizable, distributed work environment for each of the research
group members. Each team member will have access to a thin client running
the most recent version of the ODSI. In the first instance, this thin client
will be used for email reading and searching. This necessitates the creation
of a customized email reader within the environment. This will be a program
that was developed by Richard Cole and Peter Eklund called ECA (
Email
Concept Analysis ).
Figure 1: Visualisation of an Email "Theme"
A server will be established to warehouse email. The specific
source of email has yet to be determined. From this core environment additional
tools will be created and incorporated into the distributed environment.
Success will occur if there are substantial enough system
administration and programmer resources attached to support the text-document
system.
The major risk to the project is that the distributed environment
is not sufficiently supported, either in staff support or hardware terms,
to make its use appropriate to the research team. The warehouse server,
and its administrators, can be located at eDSTC HQ, locally or anywhere.
Past experience suggests that research groups reach a point
where outcomes no longer accumulate to the long-term benefit of either
the research group or the basic technologies the research has created (after
5 years). In other words, knowledge loss occurs across the research base
and these gaps are filled by other projects or individual skill-bases.
This is both true in terms of software authoring in research but and also
general scientific know-how. It is also a statement that rings true of
The Enterprise in general.
The opportunity is to leverage existing core technologies,
such as ODSI and ECA, to create the kernel of an email and text document
distributed environment which, aims to improve, augment and refine a research
group's capability.
List of Outcomes and Schedule
Please list the outcomes intended for the project.
Example outcomes are:
-
Publication papers
WWW, IEEE Intelligent Systems, ACM SIGIR, Distributed
Systems Engineering Journal etc.
-
Training and consulting
Expertise for the DSTC in distributed groupware and text
data warehousing. Any distributed groupware that emerges from the research
described on free-text document analysis.
-
National and International Standards
Exercises OLAP and RDF.
-
Expertise and skill development
Practical application of RDF and XML, and practical experience with
OLAP.
-
Impact on education
Ph.D. outcomes
List the outcomes of the projects with availability
dates, including some intermediate milestones to indicate/verify progress.
-
Port the ODSI to a new server machine that
will become the focus of the development effort.
-
Port the ECA software to the ODSI.
-
The third task is to create appropriate scripts
and pathways for the email warehouse to collect and take shape (1 year
for points 1., 2. and 3.)
-
Tools that render and visualize the email collection
have developed and experimented with by the present research team, these
are reported in the literature as scaleable and this should be tested.
-
Knowledge extraction and acquisition from the
free-text sources is required. The problems of shared ontologies, forming
coherence mappings between them and resolving ambiguities and redundancy
in shared ontologies is research (year 2 and 3 for task 3, 4, 5 and 6.).
Indicate exploitation or commercialisation strategies for the outcomes
defined above. Give an indication of target audience and methods of reaching
that audience.
-
A stable distributed groupware environment with
one or two text analysis and visualization tools as a desktop is of immediate
interest to defense intelligence. This can be achieved in year 1, eDSTC
could have a return on the 3 year investment in as little as 1 year into
the project.
-
The additional visualization, process, knowledge
extraction and shared ontology tools naturally follow from a successful
implementation of 1. Further software licensing can occur at this point.
-
The final results of the 3-4 year project can
be transformed to a commercial software product for non-participants in
the final (4th year) of project funding.
Project Resources
Please list team members:
|
Prof. Peter Eklund, Project Leader, Griffith University |
50% |
|
Dr. Philippe Martin, Research Fellow, Griffith University |
50% |
|
Dr. Francois Modave, Postdoctoral Research Fellow, Griffith
University |
50% |
Number, roles and percentage availability of research
staff to be recruited (where possible, identify any key skill requirements)
Chief Programmer and Project Coordinator (100%) - Ph.D.
in Computer Science, strong C++ programming experience and software engineering
know-how.
Any additional support staff required from non-research resources
- e.g. project manager, software engineering or sysadmin resources to complete
the project. Identify roles and percentage time required.
-
Systems Administrator (50%) - Unix
-
System Programmer (50%) - C++
-
2 postdoctoral research fellow (on ein data warehousising
and another in visualisation)
Associated students (existing or planned) with level of study identified
and time period of involvement.
|
Richard Cole, Ph.D. candidate, Griffith University, |
50% - 1 year |
|
Bernd Groh, Ph.D. Candidate, Griffith University, |
50% - 2 years |
|
Thomas Tilley, Ph.D. Candidate, Griffith University, |
80% - 3 years |
Identify all significant travel costs for the proposed
period of the project:
-
Domestic or international destination(s)
It is important for at least one member of he project
team to attend the Knowledge and Data Discovery Conference and the International
Conf. on the WWW plus annual presentations to US-based defense intelligence
and private organizations. Trips interstate to Canberra. Sydney and Adelaide
as per DSTO and other related participant requests.
-
Purpose of travel
Present peer reviewed papers to conferences, presentations
to defense and industry sources are likely to generate contract research
outcomes for the eDSTC. Demonstration of the software outcomes is an important
philosophy for this work.
Identify all equipment required for the proposed
project:
-
New software or hardware that would have to be purchased
-
A powerful Solaris Sun Enterprise server, possibly
midrange with considerable disk storage (terabyte capacity);
-
ECA (The Email Concept Analysis program) is not owned by DSTC but it may
be licensed freely for research purposes.
-
Collaboration Highlight potential or proposed research and commercial
collaboration:
-
Internal (with other projects, involvement in integrators/demos, collaboration
with participants ...)
This proposal intersects with a number of eDSTC preliminary
proposals. In many respects the general distributed environment we propose
to use, the ODSI, can be used to support several of these projects in a
similar fashion to our own. These projects are listed below:
-
Nigel Ward, Records Continuum Research Group, Monash University
-
Renato Iannella, Digital Resource management
-
Peter Bruza, Advanced Retrieval Technologies for Information and Knowledge
-
Building Knowledge Using Social Information, Tim Mansfield, Nigel Ward
-
Presentation of Ambient Information, Tim Mansfield
-
Explain: answering questions with knowledge, Robert McArthur
-
MatchDetectReveal: finding overlapping documents in digital libraries,
Arkady Zaslavsky (Monash Uni)
-
Automatic Thesaurus Creation, Chris Rowles
-
Access to Invisible Web, Chris
Rowles
-
Access to Multiple Heterogeneous, Databases Chris Rowles
-
Managing Information Streams, Stephen Crawley
-
Open Architecture for Collaboration, Tim Mansfield
We anticipate close collaboration with these groups and
individuals. Unique in this proposal is the creation of a data warehouse
as a repository for free-text electronic email documents. There is no exclusivity
in access to this email data warehouse, it is intended to be re-used for
other projects and tasks.
-
External (nominate national or international relationships or organisations)
In each case, indicate the form and extent of the relationship
- and how it will benefit the project and DSTC.
-
Defense Intelligence eDSTC has existing contract with research with
the Defense Intelligence.
-
TH Darmstadt The Technical University of Darmstadt has been a close
collaborator over the years. They are the creators of formal concept analysis
and have a substantial group of mathematicians (more than a dozen) working
on related theories of lattice and order theory that are of benefit to
the foundational aspects of our symbolic learning approach to data and
knowledge discovery and visualisation. There are existing IREX - International
Researcher Exchange Programmes funded by the ARC with TH Darmstadt and
two of my students have been recipiants of DAAD - Deutscher Akademischer
Austauschdienst funding.
-
CYCorp The problem of jointly maintaining and using ontologies of
terms for describing shared knowledge and information is a problem close
to the agenda of CYCorp. We hope to extend that relationship (with CYCorp
as a customer) in the years to come.
-
Related work
Identify any similar or related work in the research or commercial
domains. Indicate how the proposed project differs from such other work.
This analysis could and should include related work in the past or new
Research Programme of DSTC.
The idea of CYC is to construct a general knowledge repository
containing relevant ontologies and facts about the world in which we live.
The project has been going for over 10 years. In many respects CYC is an
idea before its time. It is clear that the necessary infrastructure that
allows groups of individuals to exchange knowledge in a seamless and distributed
way is at least as important as the artifacts that they exchange. This
infrastructure was absent in CYC.
As security, backups and bandwidth become increasingly
important to organizations over the next 10 years, the prospects for daily
backups of terabyte desktops threaten. This explosion of infrastructure
costs is an economic impossibility since the unit cost of ownership of
computing devices will skyrocket. The age of the unrestricted fat client
will give way to smarter, more customizable thin client groupware environments
that assist in workflow, warehousing and job-related analysis tools.
The scientific advances that will facilitate this "vision"
are scalable visualization and warehousing, distributed workflow environments
and general purpose (but customizable) group productivity tools of the
type described in this project proposal. Building expertise in these areas
gives Australia (and eDSTC) a substantial competitive edge.