Data Mining
Learning Analytics is an emerging discipline concerned
with developing methods for exploring the unique types of data that
come from educational settings, and using those methods to better
understand students and the settings in which they learn.
One technique used in learning analytics is educational data mining, which employs
data mining techniques to analyze large corpora gathered from educational software tools.
By applying these, and other, techniques we are analyzing the Grade Grinder corpus. We aim
to develop techniques and approaches that will be subsequently reusable by colleagues wishing to
exploit the other large educational data sets that are likely to become ubiquitous as, for
example, learning management systems become more widely used. The following sections outline
what we're doing with the data and the questions we seek to address.
The taxonomisation of student errors: Educational data
mining techniques will be used to identify a detailed taxonomy of
errors made by students as they learn to reason in this formal domain.
Through detailed analyses of patterns of error instances across
students, and within individual students across time, we aim to
identify distinct types of error, such as misconceptions and slips. In
pilot work on a small subset of the data (Barker-Plummer, Cox, Dale
and Etchemendy, 2008), we have so far identified three top-level error
types (which we call "structural", "connective" and "atomic") that
account for a significant proportion of the data. We also propose to
discover the mal-rules that students appear to use in producing the
error patterns we observe. The error types and mal-rules will reflect
current cognitive theories of human problem solving, reasoning and
comprehension and will take into account individual differences in
reasoning style.
In pilot work (Barker-Plummer et al., 2008; Dale,
Barker-Plummer and Cox, 2009) we have established that a major source
of difficulty for students stems from the fact that conditionals (e.g.
if) and quantifiers (e.g. some) are used quite
differently in natural language compated to their use in logic. This result
mirrors existing work in mathematics education, and indicates the generality of the
results that we will obtain in this work. Although our corpus contains work in
undergraduate logic, the general task of learning to manipulate and use formal expressions
in a careful way underlies all of mathematics, technology, engineering and science. We are
confident that the results that we will obtain carry over into these other domains and
may be used to inform education beyond the undergraduate logic curriculum.
- Informing the design of student learning support: The
results of the error taxonomy analyses will, inter alia,
inform the design of automated diagnostic and remedial extensions to
the current e-assessment system. Our pilot work here has been
promising, with an approach to classification based on
regular expression patterns correctly identifying an
average of 85% of errors. The aim is for the e-assessment system to
ultimately provide highly-targetted, personalised support to
learners. We expect that the techniques we develop will also
generalise to a wide array of other domains and subject areas.
- The development of innovative language technologies: We will explore
the use of statistical and symbolic corpus analysis methods from
computational linguistics and language technology for the
purpose of generating appropriate English paraphrases of students'
submitted logic sentences. The goal here is to improve the effectiveness of
e-assessment system feedback, and in so doing to make it possible for more
students to come to grips with this traditionally difficult subject.
- Studying the time-course of student learning: Individual
student submissions are time-stamped. By analysing successive exercise
submissions by individuals, we can examine individual students'
learning trajectories, the time-course of their learning, and learning
impasses. In pilot work (Dale, Barker-Plummer and Cox, 2009) we have
identified a useful measure of learning that we term
stickiness. This is defined as the number of attempts it takes
for for a student determine a correct answer once they have made their
initial mistake. We would like to research this metric further and use
it as an outcome measure in learning evaluation studies.
- Studying the role of diagrams in learning: The corpus
contains diagrams as well as sentences of logic. Students use desktop
applications to build or manipulate "blocks worlds" such that
sentences of natural language or logic are true in them. Hence we are
able to triangulate students' performance in the linguistic domain
(natural language, logic) with their performance in the graphical
(diagrammatic) domain. A preliminary study of a small data subset
(Cox, Dale, Etchemendy and Barker-Plummer, 2008) has revealed
theoretically significant findings. For example, errors in diagramming
sentences such as not a small cube are manifested much more
frequently with respect to the object's size than with respect to its
shape.
We investigated this phenomenon further using a human subjects methodology which enabled us to learn more
about the factors influencing this effect. This work is reported in our
Readings and Realizations project.
- Unintended roles of content in exercise difficulty: Our blocks world language involves information
about the size and shape of blocks, as well as spatial relations between them. Previous research suggests that
processing information involving mixed spatial and visual properties is significantly more difficult than homogenoeus
information, perhaps due to competition for processing resources. Previous research has focussed on tasks involving reasoning with
mixed information. We data mined the translation exercises partitioning the
sentences according to the mix of information types that they contains, and determined that this effect is true even in the
case of simple English to FOL translation. This work is reported in (Barker-Plummer, Dale and Cox, 2011a)
- The construction of an open-access front-end: We would
like to make our corpus of data accessible to the wider academic community. To that end,
we propose to develop OpenFace, a user-friendly web-based
front-end designed to facilitate data filtering, sharing and
re-use. Users will be encouraged to "grow" the resource by submitting
the results of their analyses, and ancillary materials such as copies of publications. A discussion
forum will also be provided. We plan to accommodate interoperability
requirements (e.g. with existing data mining tools). We intend
to structure the corpus in terms of the learning tasks posed to the
learner and in terms of a philosophical logic curriculum (i.e.,
a hierarchy of conceptual pre- and co-requisites).
As a first step toward this goal, we have made available the subcorpus of translation exercises. This subcorpus
contains solution attempts for the translation exercises in Language, and Logic. Here, students are given English sentences
and are asked to provide the formal translation of the sentence into first-order logic. The corpus is described in
(Barker-Plummer, Dale and Cox, 2011b and 2011c), and is available by request.