No image available
No image available
No image available
· 2014
.Massive Open Online Courses (MOOC) have presented a completely new style of learning and teaching that also brings us a massive amount of student behavioral data. Some of this data is exclusive to the MOOC environment. It opens up many possibilities for educators to study a question they have always wanted to answer: how do students solve problems? In this thesis, we present and address some of the numerous challenges one encounters during the process of mining MOOC data to answer this seemingly simple question. We describe in detail, using the data from MITx's 6.002x Spring 2012 course offering, a large scale, mixed automated and manual process that starts with the re-organization of MOOCdb source data into relevant and retrieval-efficient abstractions we call student resource trajectories and answer type transition matrices. This step must be interleaved with meticulous and painstaking automatic and manual curation of the data to remove errors and irrelevancies while aggregating redundancies, reducing noise and assuring meaningful, trustworthy variables. Regardless, only an estimation of student resource usage behavior during problem solving is available. With all student trajectories for every problem of 6.002X extracted, we demonstrate some analyses of student behaviors for the whole student population. These offer some insight into a problem's level of difficulty and student behavior around a problem type, such as homework. Next, in order to study how students reached the correct solution to a problem, we categorize problem answers and consider how student move from one incorrect answer to their next attempt. This requires extensive filtering out of irrelevancies and rankings. Detailed knowledge of resources, as would be expected of an instructor, appears to be crucial to understanding the implications of the statistics we derive on frequency of resource usage in general and per attempt. We identify solution equivalence and interpretation also as significant hurdles in obtaining insights. Finally, we try to describe students' problem solving process in terms of resource use patterns by using hidden Markov modeling with original variable definitions and 3 different variable relationships (graphical structures). We evaluate how well these models actually describe the student trajectories and try to use them to predict upcoming student submission events on 24 different selected homework problems. The model with the most complex variable relationships proves to be most accurate.
No image available
No image available
No image available
No image available
· 2014
Ordered labelled trees are trees where each node has a label and the left-to-right order among siblings is significant. Ordered labelled forests are sequences of ordered labelled trees. Ordered labelled trees and forests are useful structures for hierarchical data representation. Given two ordered labelled forests F and G, the local forest similarity is to compute two sub-forests F' and G' of F and G respectively such that they are the most similar over all the possible F' and G'. Given a target forest F and a pattern forest G, the forest pattern matching problem is to compute a sub-forest F' of F which is the most similar to G over all the possible F'. This thesis presents novel efficient algorithms for the local forest similarity problem and forest pattern matching problem for sub-forest. An application of the algorithms is that it can be used to locate the structural regions in RNA secondary structures which is the necessity data in RNA secondary structure prediction and function investigation. RNA is a chain molecular, mathematically it is a string over a four letter alphabet; in computational molecular biology, labeled ordered trees are used to represent RNA secondary structures.
No image available
· 2013
The advances of high-throughput technologies, such as next-generation sequencing and microarrays, have rapidly improved the accessibility of molecular profiles in tumor samples. However, due to the immaturity of relevant theories, analyzing these data and systematically understanding the underlying mechanisms causing diseases, which are essential in the development of therapeutic applications, remain challenging. This dissertation attempts to clarify the effects of DNA copy number alterations (CNAs), which are known to be common mutations in genetic diseases, on steady- state gene expression values, time-course expression activities, and the effectiveness of targeted therapy. Assuming DNA copies operate as independent subsystems producing gene transcripts, queueing theory is applied to model the stochastic processes representing the arrival of transcription factors (TFs) and the departure of mRNA. The copy-number-gene-expression relationships are shown to be generally nonlinear. Based on the mRNA production rates of two transcription models, one corresponding to an unlimited state with prolific production and one corresponding to a restrictive state with limited production, the dynamic effects of CNAs on gene expression are analyzed. Simulations reveal that CNAs can alter the amplitudes of transcriptional bursting and transcriptional oscillation, suggesting the capability of CNAs to interfere with the regulatory signaling mechanism. With this finding, a string-structured Bayesian network that models a signaling pathway and incorporates the interference due to CNAs is proposed. Using mathematical induction, the upstream and downstream CNAs are found to have equal influence on drug effectiveness. Scoring functions for the detection of unfavorable CNAs in targeted therapy are consequently proposed. Rigorous experiments are keys to unraveling the etiology of genetic diseases such as cancer, and the proposed models can be applied to provide theory-supporting hypotheses for experimental design. The electronic version of this dissertation is accessible from http://hdl.handle.net/1969.1/149298
No image available
No image available