As educational apps increase in popularity, vast amounts of student learning data become available, which can and should be used to drive personalized instruction. While there have been some recent advances in domains like mathematics, modeling second language acquisition (SLA) is more nuanced, involving the interaction of lexical knowledge, morpho-syntactic processing, and other skills. Furthermore, most work in NLP for second language (L2) learners has focused on intermediate-to-advanced students of English in assessment settings. Much less work has been done involving beginners, learners of languages other than English, or study over time.
This task aims to forge new territory by utilizing student trace data from users of Duolingo, the world's most popular online language-learning platform. Participating teams are provided with transcripts from millions of exercises completed by thousands of students over their first 30 days of learning on Duolingo. These transcripts are annotated for token (word) level mistakes, and the task is to predict what mistakes each learner will make in the future.
Novel and interesting research opportunities in this task:
By accurately modeling student mistake patterns, we hope this task will shed light on both (1) the inherent nature of L2 learning, and (2) effective ML/NLP engineering strategies to build personalized adaptive learning systems.
|Jan 10, 2018||Data release (phase 1): TRAIN and DEV sets (Dataverse)|
|Feb 19, 2018||Data release (phase 2): blind TEST set|
|Mar 19, 2018||Final predictions deadline|
|Mar 21, 2018||Final results announcement|
|Mar 28, 2018||Draft system papers due|
|Apr 16, 2018||Camera-ready system papers due|
|Jun 05, 2018||Workshop at NAACL-HLT in New Orleans!|
We have created a Google Group to foster discussion and answer questions related to this task:
The task organizers are:
Duolingo is a free, award-winning, online language learning platform. Since launching in 2012, more than 200 million students from all over the world have enrolled in one of Duolingo's 80+ game-like language courses, via the website or mobile apps. For comparison, that is more than the total number of students in the entire U.S. school system.
While the Duolingo app includes several interactive exercise formats designed to develop different language skills, this shared task will focus on the three (3) formats linked to written production. The figure below illustrates these formats for an English speaker who is learning French (using the iPhone app).
Exercise (a) is a
reverse_translate item, where students read a prompt written in the language they know (e.g., their native language), and translate it into the language they are learning (L2). Exercise (b) is a
reverse_tap item, an easier version of this format where students construct an answer from a bank of words and distractors. Exercise (c) is a
listen item, requiring students to listen to an utterance in the L2 they are learning, and transcribe it.
Since each exercise can have multiple correct answers (up to thousands each, due to synonyms, homophones, or ambiguities in number, tense, formality, etc.), Duolingo uses FSTs to align the student's answer to the most similar correct answer in the exercise's very large set of acceptable answers. Figure (a) above, for example, shows example corrective feedback based on such an alignment.
In these exercises, students construct answers in the L2 they are learning, and make various mistakes along the way. The goal of this task is to predict future mistakes that learners of English, Spanish, and French will make based on a history of the mistakes they have made in the past. More specifically, the data set contains more than 2 million tokens (words) from answers submitted by more than 6,000 Duolingo students over the course of their first 30 days.
We provide token-level labels and dependency parses for the most similar correct answer to each student submission. For example, the figure below shows a parse tree for the correct answer, the aligned student's answer, and the resulting labels (for a Spanish speaker learning English):
This student seems to be struggling with "my", "mother", and "father." Perhaps she has trouble with possessive pronouns? Or the orthography of English "th" sounds? A successful SLA modeling system should be able to pick up on these trends, predicting which words give the student trouble in a personalized way (that evolves over time).
Most tokens (about 83%) are perfect matches and are given the label
0 for "OK." Tokens that are missing or spelled incorrectly (ignoring capitalization, punctuation, and accents) are given the label
1 denoting a mistake.
Note: For this task, we provide labels but not actual student responses. We intend to release a more comprehensive version of the data set after the workshop, including student answers and other metadata.
The data format is inspired by the Universal Dependencies CoNNL-U format. Each student exercise is represented by a group of lines separated by a blank line: one token per line prepended with exercise-level metadata. Here are some examples (you may need to scroll horizontally to see all columns):
# user:D2inSf5+ countries:MX days:1.793 client:web session:lesson format:reverse_translate time:16 8rgJEAPw1001 She PRON Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs|fPOS=PRON++PRP nsubj 4 0 8rgJEAPw1002 is VERB Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|fPOS=VERB++VBZ cop 4 0 8rgJEAPw1003 my PRON Number=Sing|Person=1|Poss=Yes|PronType=Prs|fPOS=PRON++PRP$ nmod:poss 4 1 8rgJEAPw1004 mother NOUN Degree=Pos|fPOS=ADJ++JJ ROOT 0 1 8rgJEAPw1005 and CONJ fPOS=CONJ++CC cc 4 0 8rgJEAPw1006 he PRON Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs|fPOS=PRON++PRP nsubj 9 0 8rgJEAPw1007 is VERB Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|fPOS=VERB++VBZ cop 9 0 8rgJEAPw1008 my PRON Number=Sing|Person=1|Poss=Yes|PronType=Prs|fPOS=PRON++PRP$ nmod:poss 9 1 8rgJEAPw1009 father NOUN Number=Sing|fPOS=NOUN++NN conj 4 1 # user:D2inSf5+ countries:MX days:2.689 client:web session:practice format:reverse_translate time:6 oMGsnnH/0101 When ADV PronType=Int|fPOS=ADV++WRB advmod 4 1 oMGsnnH/0102 can AUX VerbForm=Fin|fPOS=AUX++MD aux 4 0 oMGsnnH/0103 I PRON Case=Nom|Number=Sing|Person=1|PronType=Prs|fPOS=PRON++PRP nsubj 4 1 oMGsnnH/0104 help VERB VerbForm=Inf|fPOS=VERB++VB ROOT 0 0
The first line of each exercise group (beginning with
#) contains the following metadata about the student, session, and exercise:
user: a B64 encoded, 8-digit, anonymized, unique identifier for each student (may include
countries: a pipe (
|) delimited list of 2-character country codes from which this user has done exercises
days: the number of days since the student started learning this language on Duolingo
client: the student's device platform (one of:
session: the session type (one of:
test; explanation below)
format: the exercise format (one of:
listen; see figures above)
time: the amount of time (in seconds) it took for the student to construct and submit their whole answer (note: for some exercises, this can be
nulldue to data logging issues)
These fields are separated by whitespaces on the same line, and key:value pairs are denoted with a colon (
lesson sessions (about 77% of the data set) are where new words or concepts are introduced, although lessons also include a lot previously-learned material (e.g., each exercise tries to introduce only one new word or tense, so all other tokens should have been seen by the student before). The
practice sessions (22%) should contain only previously-seen words and concepts. The
test sessions (1%) are quizzes that allow a student "skip" a particular skill unit of the curriculum (i.e., the student may have never seen this content before in the Duolingo app, but may well have had prior knowledge before starting the course).
The remaining lines in each exercise group represent each token (word) in the correct answer that is most similar to the student's answer, one token per line, arranged into seven (7) columns separated by whitespaces:
All dependency features (columns 3-6) are generated by the Google SyntaxNet dependency parser using the language-agnostic Universal Dependencies tagset. (In other words, these morpho-syntactic features should be comparable across all three tracks in the shared task. Note that SyntaxNet isn't perfect, so parse errors may occur.)
The only difference between TRAIN and DEV/TEST set formats is that the final column (labels) will be omitted from the DEV/TEST set files. The first column (unique instance IDs) are also used for the submission output format.
The data for this task are organized into three tracks:
en_es— English learners (who already speak Spanish)
es_en— Spanish learners (who already speak English)
fr_en— French learners (who already speak English)
The TRAIN, DEV, and TEST sets will be written to separate files for each of these tracks. Each track will have its own leaderboard for predictions (see "Submission & Evaluation" section below). Some teams may want to focus on a particular track/language, however, participation in all three tracks is encouraged!
Data for the task will be released in two phases:
After the workshop, there will be an updated third release of the data (including labels and additional metadata for all splits) to support ongoing research in the area of SLA modeling.
The track data sets (
data_*.tar.gz) and baseline code (
starter_code.tar.gz) are hosted on Dataverse:
The primary evaluation metric will be area under the ROC curve (AUROC). F1 score (with a threshold of
0.5) will also be reported and analyzed. As such, some teams may wish to attempt combined classification and ranking methods. Note that the label
1 (denoting a mistake) is considered the "positive" class for both metrics.
The official AUROC and F1 metrics (plus a few others) are implemented in
eval.py alongside the baseline code.
This is an "open" evaluation, meaning teams are allowed (and encouraged!) to experiment with additional features beyond those provided with the data release. Features that lead to interpretable/actionable insights about individual learning (e.g., "this user struggles with adjective-noun word order") are particularly encouraged.
See the "Tips & Related Work" section below for some ideas of where to start with additional feature engineering. Teams should thoroughly describe any new features in their system papers for the workshop proceedings (see below), and/or release source code to support replication and ongoing research.
All system submissions and evaluation will be done via CodaLab.
Once the TEST data is released (in phase 2), you can submit your predictions to CodaLab and they will appear on the shared task "leaderboard." More details on this after the phase 2 launch.
The submission file format is similar to those generated by the provided baseline model.
Submission should be a whitespace-delimited file with 1 row per instance and no header. The first column must be the instance ID, and the last column must be the prediction (consistent with the first and last columns of the TRAIN data). Other columns, blank lines, or lines beginning with
# will be ignored.
The prediction should be in the range
[0.0, 1.0], and will be interpreted as "the probability that the student makes a mistake," i.e., p(mistake|instance). The output file must be placed in a ZIP archive prior to submission.
Here is an example submission file (only the first few lines are shown):
DRihrVmh0901 0.025 DRihrVmh0902 0.08 DRihrVmh0903 0.454 DRihrVmh0904 0.044 TOeLHxLS0401 0.067 TOeLHxLS0402 0.03 TOeLHxLS0403 0.806 TOeLHxLS0404 0.066 xqtN1I5c0901 0 xqtN1I5c0902 0.074 xqtN1I5c0903 0.053 xqtN1I5c0904 0.016 ...
Your predictions on the TEST set should follow this format.
All shared task participants will be eligible — and strongly encouraged — to submit a system paper describing their approach and results, to be presented at the BEA Workshop at the NAACL-HLT conference in June 2018. This paper will be published in the workshop proceedings and available through the ACL Anthology website.
Following the evaluation, we will provide LaTeX/Word templates for typesetting your papers, as well as citation information for referring to the shared task report (you must use this reference to cite the data set).
Note that there are two paper deadlines: one for preliminary draft papers (shortly after final results are announced), and a final camera-ready deadline 2.5 weeks later. It is important for teams to meet both deadlines.
SLA modeling is a rich and complex task, and presents an opportunity to synthesize methods from various subfields in computational linguistics, machine learning, educational data mining, and psychometrics.
This page contains a few suggestions and pointers to related research areas, which could be useful for feature engineering or other modeling decisions.