Mentor: Dr. Ping Chen
Knowledge is critical in building
intelligent systems. Without sufficient knowledge, computers often
exhibit brittle behaviors, and can only carry out tasks that have been
fully foreseen by their designers. However, many real-world applications
require a large amount of high-quality knowledge, which results in a
serious knowledge acquisition bottleneck. Recently large-scale knowledge
acquisition has attracted a lot of interests in Artificial Intelligence,
Computational Linguistics, and Web Mining. The goal of this project is
to acquire and evaluate a large amount of lexical-dependency knowledge.
Lexical-dependency knowledge consists of dependency relations generated
by dependency parsing of text. A dependency relation is an asymmetric
binary relation between two words, one called head or governor, and the
other called dependent or modifier. In dependency grammars a sentence is
represented as a set of dependency relations, which normally form a tree
that connects all the words in a sentence. For example, ``blue sky''
contains one dependency relation: ``sky -> blue'', where ``sky'' is the
head, and ``blue'' is the dependent. Lexical-dependency knowledge can be
used in many lexicon-level Natural Language Processing applications. For
example, parsing ``Colorless green ideas sleep furiously'' will generate
the following dependency relations, ``idea -> green'', ``idea ->
colorless'', ``sleep -> idea'', and ``sleep -> furiously'', and we can
conclude that the sentence makes totally no sense since none of its
dependency relations are semantically valid. Similarly dependency
knowledge can also be used to catch spelling errors by checking semantic
dependency relation violations. Another application is word prediction
and automatic completion by identifying connected words. Potentially
lexical dependency knowledge can improve the performance of many Natural
Language Processing (NLP) applications. The challenges of this project
lie in collecting sufficient high-quality knowledge and efficient
application of such knowledge to various NLP fields. Participants of
this project will work on a complete cycle of research, including
problem analysis, design, implementation, evaluation, and paper
publishing.
Requirements: Proficient
programming skill of Visual C++ and understanding of data structures.
(1) Co-reference
resolution
Student: Andrew Tran (Midterm-presentation)
The purpose of the
project is to research, develop and test a method of co-reference
resolution, specifically pronoun resolution. Co-reference Resolution is
a problem that is still unsolved within the field of computational
linguistics. Our system will take a raw document, send it through a
series of processes to analyze and then discern who or what any given
pronoun in the document refers to.
(2) Dependency parsing visualization
Student: Antoine W.B. (Midterm-presentation)
The current format of
the minipar output can be confusing and ascetically displeasing. The
objective of my program is to display parsed sentenced structures input
by the user in a tree format. This will be done by utilizing the
functions and information provided in minipar and presenting it through
programmed visual algorithms to achieve the tree structure.
2. Word sense disambiguation
Mentor: Dr. Ping Chen
Student: Max Choly (Midterm-presentation)
Word sense
disambiguation is the process of determining the correct sense/meaning
of a word within a given context. The current WSD method, developed at
UHD's AI lab, employs a strategy called Tree Matching. This algorithm
parses the glosses of to-be-disambiguated words into dependency tress.
Each node/word of these glosses is located in a large context knowledge
base, which itself contains dependency relations collected from web
searches. If the same words that appear in the knowledge base also
appear within the original sentence, a score is generated based on the
frequency of these words within the knowledge base. The Tree Matching
algorithm has been extended to additionally match synonyms of words, and
my work has so far further extended this approach to match hypernyms and
hyponyms of words.
Requirements: Proficient
programming skill of Visual C++ and understanding of data structures.
3. Time Tagger
This project aims at
finding time related phrases in natural language text. The program
matches time phrases in text to a large collection of predefined
templates. All time phrases are tagged according to the TimeML
Annotation Guidelines, v. 1.0. The program attempts to extract the
value of the matched phrase and convert it to a standard format. A
“Type” is also assigned to either “Time”, “Date” or “Duration”. This
information can be used by automated processes to better understand and
classify material.
Requirements: Proficient
programming skill of Visual C++ and understanding of data structures.
4. Automatic diagnosis
This project aims at
automated assigning of ICD-9-CM codes to Clinical Free Texts. These
texts are actual patient's doctor reports with clinical impressions and
notes. The purpose of this program is to form a system that compares
pre-edited and coded Clinical Free Texts to un-coded texts and be able
to accurately give the correct ICD-9-CM code. This is done through
algorithms of Natural Language Processing and uses word and sentence
comparison to form strong matches between texts.
Requirements: Proficient
programming skill of Visual C++ and understanding of data structures.
5. Semantic Association Rule Analysis
Mentor: Dr. Ping Chen
Student: Walter Garcia (Midterm-presentation)
When applying association mining to
real datasets, a major obstacle is that often a huge number of rules
are generated even with very reasonable support and confidence.
Among these rules, many are trivial, redundant, semantically wrong,
or already known by end-users. Association rule post-processing aims
to remove these undesired rules. Existing work mainly focuses on
reducing redundant or finding unexpected association rules. In this
paper, we propose an innovative method based on semantic network. We
semantically divide association rules into five categories: trivial,
known and correct, unknown and correct, known and incorrect, unknown
and incorrect. Our method can be efficiently integrated with
existing rule reduction techniques to construct a concise,
high-quality, and user-specific association rule set. We evaluate
our approach on a real public-health dataset, the Heartfelt study,
and we can prune off $97.81\%$ of association rules that are trivial
or incorrect. The remaining rules are confirmed by either health
science literature or a high-quality biomedical knowledge base.
The Heartfelt Study is a medical study
that examined 383 children aged 11 to 16 years. The results of the
study were saved in a transactional data set. The first goal of
this project was to use the Maximal Frequent Itemset Algorithm
(MAFIA) to find the most frequent subsets from this data set. Once
this was completed a semantic analyzer (SemAna) was use to filter
out the trivial subsets from the non-trivial. We are currently
validating other studies that have been performed on the Heartfelt
Study to ensure that our method is working properly. Once this step
is completed we will try to find subsets that are interesting,
useful, correct, and that have not yet been discovered.
Requirements: Proficient
programming skill of Visual C++ and understanding of basic data
mining concepts.
Mentor: Dr. Hong Lin
Student: Wilberto J. Lopez Perez (Midterm-presentation)
The 10-week
summer research project performed by 2 students on the Grid Computing
Lab of the University of Houston-Downtown (UHD) will focus on completing
the construction of the cluster-centered computing grid that is
supported from the Major Research Instrumentation (MRI) grant
“Acquisition of a Computational Cluster-Grid for Research and Education
in Science and Mathematics”. More computers, networking and storage
equipments are being purchased through the remaining fund in this grant
and the fund from the NSF
CI-TEAM Minority Serving Institutions Cyber-Infrastructure Empowerment
Coalition (MSI-CIEC). The two students to be supported by
the REU grant will continue the construction of the grid to incorporate
the aforementioned new equipments and continue to serve faculty members
in their research projects while performing research on the
implementation of an multi-agent e-learning system, and continue to
assist students in computer science classes during the summer with the
usage of grid computing sources.
The GUI for
parallel computing lab is a graphical user interface that will permit
students to realize their lab online. This lab is designed to interact
with a computer cluster. This would permit students to experiment
themselves while learning in the process. This particular lab
concentrates in the performance of parallel programs but other labs can
be implemented from this due to the way it was implemented.
Requirements:
Understanding of computer organization and architecture concepts,
general computer hardware and operating systems.
7. A high order chemistry model of computation
Mentor: Dr. Hong Lin
Student:
Wilfredo Molina (Midterm-presentation)
This project will design and
implement a parser for a high order chemistry model of computation
implemented using IBM TSpaces and ran in parallel on the compute
cluster. Data are represented as molecules and contained in solutions
which are basically multisets. Molecules can interact with each other
through chemical reactions until they become inert or no more operations
can be performed to them.
8. Gamma programming language
Mentor: Dr. Hong Lin
Student:
Jeremy Kemp (Midterm-presentation)
This research involves
implementing a high level programming language called Gamma on top of
IBM's cross platform communications Java platform called Tspaces. Gamma
is based on a programming paradigm called Chemical Reaction model
programming, which is inherently parallel. This allows smaller, simpler
high level programs to be implemented with a high level of parallelism
across any platform.
Advances
in remote sensing of the Earth and other planets
result in availability of numerous geospatial
observations including multi-spectral images and
surface topography. From these measurements
environmental quantities such as, mineralogy, land
cover, soil properties, vegetation density, surface
temperature, and precipitation are derived as
geographically geospatial datasets. These variables
are highly related to each other such that if we
choose one of them (named class variable), we can
explore the controlling factors that are responsible
for its distribution by analyzing the explanatory
variables. The purpose of this project is to develop
and apply a tool that auto-analyzes a set of class
and explanatory geospatial variables that provides a
complete description of class variable dependence on
explanatory variables. The tool uses a fusion of
techniques, including association analysis,
reinforcement learning, and similarity measurement.
Requirements: Proficient programming skills of
Mathematica