UHD REU Program > Project Description

Project Description

1. Natural Language Processing and Knowledge Acquisition

Mentor: Dr. Ping Chen

Knowledge is critical in building intelligent systems. Without sufficient knowledge, computers often exhibit brittle behaviors, and can only carry out tasks that have been fully foreseen by their designers. However, many real-world applications require a large amount of high-quality knowledge, which results in a serious knowledge acquisition bottleneck. Recently large-scale knowledge acquisition has attracted a lot of interests in Artificial Intelligence, Computational Linguistics, and Web Mining. The goal of this project is to acquire and evaluate a large amount of lexical-dependency knowledge. Lexical-dependency knowledge consists of dependency relations generated by dependency parsing of text. A dependency relation is an asymmetric binary relation between two words, one called head or governor, and the other called dependent or modifier. In dependency grammars a sentence is represented as a set of dependency relations, which normally form a tree that connects all the words in a sentence. For example, ``blue sky'' contains one dependency relation: ``sky -> blue'', where ``sky'' is the head, and ``blue'' is the dependent. Lexical-dependency knowledge can be used in many lexicon-level Natural Language Processing applications. For example, parsing ``Colorless green ideas sleep furiously'' will generate the following dependency relations, ``idea -> green'', ``idea -> colorless'', ``sleep ->  idea'', and ``sleep -> furiously'', and we can conclude that the sentence makes totally no sense since none of its dependency relations are semantically valid. Similarly dependency knowledge can also be used to catch spelling errors by checking semantic dependency relation violations. Another application is word prediction and automatic completion by identifying connected words. Potentially lexical dependency knowledge can improve the performance of many Natural Language Processing (NLP) applications. The challenges of this project lie in collecting sufficient high-quality knowledge and efficient application of such knowledge to various NLP fields. Participants of this project will work on a complete cycle of research, including problem analysis, design, implementation, evaluation, and paper publishing.

Requirements: Proficient programming skill of Visual C++ and understanding of data structures.

(1) Co-reference resolution

Student: Andrew Tran (Midterm-presentation)

The purpose of the project is to research, develop and test a method of co-reference resolution, specifically pronoun resolution. Co-reference Resolution is a problem that is still unsolved within the field of computational linguistics. Our system will take a raw document, send it through a series of processes to analyze and then discern who or what any given pronoun in the document refers to.
 

(2) Dependency parsing visualization

Student: Antoine W.B. (Midterm-presentation)

The current format of the minipar output can be confusing and ascetically displeasing. The objective of my program is to display parsed sentenced structures input by the user in a tree format. This will be done by utilizing the functions and information provided in minipar and presenting it through programmed visual algorithms to achieve the tree structure.
 

2. Word sense disambiguation

Mentor: Dr. Ping Chen

Student: Max Choly (Midterm-presentation)

Word sense disambiguation is the process of determining the correct sense/meaning of a word within a given context. The current WSD method, developed at UHD's AI lab, employs a strategy called Tree Matching. This algorithm parses the glosses of to-be-disambiguated words into dependency tress. Each node/word of these glosses is located in a large context knowledge base, which itself contains dependency relations collected from web searches. If the same words that appear in the knowledge base also appear within the original sentence, a score is generated based on the frequency of these words within the knowledge base. The Tree Matching algorithm has been extended to additionally match synonyms of words, and my work has so far further extended this approach to match hypernyms and hyponyms of words.

Requirements: Proficient programming skill of Visual C++ and understanding of data structures.

3. Time Tagger

Mentor: Dr. Ping Chen

Student: Stanley Roberts (Midterm-presentation)

This project aims at finding time related phrases in natural language text. The program matches time phrases in text to a large collection of predefined templates. All time phrases are tagged according to the TimeML Annotation Guidelines, v. 1.0.  The program attempts to extract the value of the matched phrase and convert it to a standard format. A “Type” is also assigned to either “Time”, “Date” or “Duration”. This information can be used by automated processes to better understand and classify material.

Requirements: Proficient programming skill of Visual C++ and understanding of data structures.

4. Automatic diagnosis

Mentor: Dr. Ping Chen

Student: Chris Rhodes (Midterm-presentation)

This project aims at automated assigning of ICD-9-CM codes to Clinical Free Texts.  These texts are actual patient's doctor reports with clinical impressions and notes.  The purpose of this program is to form a system that compares pre-edited and coded Clinical Free Texts to un-coded texts and be able to accurately give the correct ICD-9-CM code.  This is done through algorithms of Natural Language Processing and uses word and sentence comparison to form strong matches between texts.

Requirements: Proficient programming skill of Visual C++ and understanding of data structures.

5. Semantic Association Rule Analysis

Mentor: Dr. Ping Chen

Student: Walter Garcia (Midterm-presentation)

When applying association mining to real datasets, a major obstacle is that often a huge number of rules are generated even with very reasonable support and confidence. Among these rules, many are trivial, redundant, semantically wrong, or already known by end-users. Association rule post-processing aims to remove these undesired rules. Existing work mainly focuses on reducing redundant or finding unexpected association rules. In this paper, we propose an innovative method based on semantic network. We semantically divide association rules into five categories: trivial, known and correct, unknown and correct, known and incorrect, unknown and incorrect. Our method can be efficiently integrated with existing rule reduction techniques to construct a concise, high-quality, and user-specific association rule set. We evaluate our approach on a real public-health dataset, the Heartfelt study, and we can prune off $97.81\%$ of association rules that are trivial or incorrect. The remaining rules are confirmed by either health science literature or a high-quality biomedical knowledge base.

The Heartfelt Study is a medical study that examined 383 children aged 11 to 16 years.  The results of the study were saved in a transactional data set.  The first goal of this project was to use the Maximal Frequent Itemset Algorithm (MAFIA) to find the most frequent subsets from this data set.  Once this was completed a semantic analyzer (SemAna) was use to filter out the trivial subsets from the non-trivial.  We are currently validating other studies that have been performed on the Heartfelt Study to ensure that our method is working properly.  Once this step is completed we will try to find subsets that are interesting, useful, correct, and that have not yet been discovered.
 

Requirements: Proficient programming skill of Visual C++ and understanding of basic data mining concepts.

6. High Performance Computing

Mentor: Dr. Hong Lin

Student: Wilberto J. Lopez Perez (Midterm-presentation)

The 10-week summer research project performed by 2 students on the Grid Computing Lab of the University of Houston-Downtown (UHD) will focus on completing the construction of the cluster-centered computing grid that is supported from the Major Research Instrumentation (MRI) grant “Acquisition of a Computational Cluster-Grid for Research and Education in Science and Mathematics”. More computers, networking and storage equipments are being purchased through the remaining fund in this grant and the fund from the NSF CI-TEAM Minority Serving Institutions Cyber-Infrastructure Empowerment Coalition (MSI-CIEC). The two students to be supported by the REU grant will continue the construction of the grid to incorporate the aforementioned new equipments and continue to serve faculty members in their research projects while performing research on the implementation of an multi-agent e-learning system, and continue to assist students in computer science classes during the summer with the usage of grid computing sources.

The GUI for parallel computing lab is a graphical user interface that will permit students to realize their lab online.  This lab is designed to interact with a computer cluster.  This would permit students to experiment themselves while learning in the process.  This particular lab concentrates in the performance of parallel programs but other labs can be implemented from this due to the way it was implemented.

Requirements: Understanding of computer organization and architecture concepts, general computer hardware and operating systems.

7. A high order chemistry model of computation

Mentor: Dr. Hong Lin

Student: Wilfredo Molina (Midterm-presentation)

This project will design and implement a parser for a high order chemistry model of computation implemented using IBM TSpaces and ran in parallel on the compute cluster. Data are represented as molecules and contained in solutions which are basically multisets. Molecules can interact with each other through chemical reactions until they become inert or no more operations can be performed to them.

8. Gamma programming language

Mentor: Dr. Hong Lin

Student: Jeremy Kemp (Midterm-presentation)

This research involves implementing a high level programming language called Gamma on top of IBM's cross platform communications Java platform called Tspaces. Gamma is based on a programming paradigm called Chemical Reaction model programming, which is inherently parallel. This allows smaller, simpler high level programs to be implemented with a high level of parallelism across any platform.

9. Spatial data mining

Mentor: Dr. Tomasz Stepinski

Student: Josue Salazar  (Midterm-presentation)

Advances in remote sensing of the Earth and other planets result in availability of numerous geospatial observations including multi-spectral images and surface topography. From these measurements environmental quantities such as, mineralogy, land cover, soil properties, vegetation density, surface temperature, and precipitation are derived as geographically geospatial datasets. These variables are highly related to each other such that if we choose one of them (named class variable), we can explore the controlling factors that are responsible for its distribution by analyzing the explanatory variables. The purpose of this project is to develop and apply a tool that auto-analyzes a set of class and explanatory geospatial variables that provides a complete description of class variable dependence on explanatory variables. The tool uses a fusion of techniques, including association analysis, reinforcement learning, and similarity measurement.

Requirements: Proficient programming skills of Mathematica

UHD REU site (CNS 0851984) is sponsored by National Science Foundation and Scholar Academy.
University of Houston-Downtown
Department of Computer & Mathematical Sciences | College of Sciences & Technology
Home | Applicants | Program Information | Project Description |
Steering Committee