This challenge is in conjunction with the 13th BEA Workshop and NAACL-HLT 2018 conference.
As educational apps increase in popularity, vast amounts of student learning data become available, which can and should be used to drive personalized instruction. While there have been some recent advances in domains like mathematics, modeling second language acquisition (SLA) is more nuanced, involving the interaction of lexical knowledge, morpho-syntactic processing, and other skills. Furthermore, most work in NLP for second language (L2) learners has focused on intermediate-to-advanced students of English in assessment settings. Much less work has been done involving beginners, learners of languages other than English, or study over time.
This task aims to forge new territory by utilizing student trace data from users of Duolingo, the world's most popular online language-learning platform. Participating teams are provided with transcripts from millions of exercises completed by thousands of students over their first 30 days of learning on Duolingo. These transcripts are annotated for token (word) level mistakes, and the task is to predict what mistakes each learner will make in the future.
Novel and interesting research opportunities in this task:
By accurately modeling student mistake patterns, we hope this task will shed light on both (1) the inherent nature of L2 learning, and (2) effective ML/NLP engineering strategies to build personalized adaptive learning systems.
(Updated April 23, 2018). The task attracted strong participation from 15 teams (11 that submitted system papers), representing both academia and industry from various fields including cognitive science, linguistics, and machine learning. The best systems were able to improve upon a simple logistic regression baseline by 12%. Check out the organizers' task overview paper and system results below.
Read the 2018 SLAM Task Overview Paper »
English | Spanish | French | Avg. | ||||
---|---|---|---|---|---|---|---|
Team | AUC | F1 | AUC | F1 | AUC | F1 | Rank |
SanaLabs ♢♣ | 0.861 | 0.561 | 0.838 | 0.530 | 0.857 | 0.573 | 1.0 |
singsound ♢ | 0.861 | 0.559 | 0.835 | 0.524 | 0.854 | 0.569 | 1.7 |
NYU ♣‡ | 0.859 | 0.468 | 0.835 | 0.420 | 0.854 | 0.493 | 2.3 |
TMU ♢‡ | 0.848 | 0.476 | 0.824 | 0.439 | 0.839 | 0.502 | 4.3 |
CECL ‡ | 0.846 | 0.414 | 0.818 | 0.390 | 0.843 | 0.487 | 4.7 |
Cambridge ♢ | 0.841 | 0.479 | 0.807 | 0.435 | 0.835 | 0.508 | 6.0 |
UCSD ♣ | 0.829 | 0.424 | 0.803 | 0.375 | 0.823 | 0.442 | 7.0 |
LambdaLab ♣ | 0.821 | 0.389 | 0.801 | 0.344 | 0.815 | 0.415 | 7.6 |
Grotoco | 0.817 | 0.462 | 0.791 | 0.452 | 0.813 | 0.502 | 9.0 |
nihalnayak | 0.821 | 0.376 | 0.790 | 0.338 | 0.811 | 0.431 | 9.0 |
jilljenn | 0.815 | 0.329 | 0.788 | 0.306 | 0.809 | 0.406 | 10.7 |
SLAM_baseline | 0.774 | 0.190 | 0.746 | 0.175 | 0.771 | 0.281 | 14.7 |
A summary of key findings from the organizers' meta-analyses are:
June 05, 2018 | Workshop at NAACL-HLT in New Orleans! (Link, Registration) |
April 16, 2018 | Camera-ready system papers due (Instructions) |
April 10, 2018 | System paper reviews returned |
March 28, 2018 | Draft system papers due |
March 21, 2018 | Final results announcement |
March 19, 2018 | Final predictions deadline (CodaLab) |
March 9, 2018 | Data release (phase 2): blind TEST set (Dataverse) |
January 10, 2018 | Data release (phase 1): TRAIN and DEV sets (Dataverse) |
We have created a Google Group to foster discussion and answer questions related to this task:
The task organizers are:
Duolingo is a free, award-winning, online language learning platform. Since launching in 2012, more than 200 million students from all over the world have enrolled in one of Duolingo's 80+ game-like language courses, via the website or mobile apps. For comparison, that is more than the total number of students in the entire U.S. school system.
While the Duolingo app includes several interactive exercise formats designed to develop different language skills, this shared task will focus on the three (3) formats linked to written production. The figure below illustrates these formats for an English speaker who is learning French (using the iPhone app).
Exercise (a) is a reverse_translate
item, where
students read a prompt written in the language they know (e.g.,
their native language), and translate it into the language they are
learning (L2). Exercise (b) is a reverse_tap
item, an
easier version of this format where students construct an answer
from a bank of words and distractors. Exercise (c) is a
listen
item, requiring students to listen to an
utterance in the L2 they are learning, and transcribe it.
Since each exercise can have multiple correct answers (up to thousands each, due to synonyms, homophones, or ambiguities in number, tense, formality, etc.), Duolingo uses FSTs to align the student's answer to the most similar correct answer in the exercise's very large set of acceptable answers. Figure (a) above, for example, shows example corrective feedback based on such an alignment.
In these exercises, students construct answers in the L2 they are learning, and make various mistakes along the way. The goal of this task is to predict future mistakes that learners of English, Spanish, and French will make based on a history of the mistakes they have made in the past. More specifically, the data set contains more than 2 million tokens (words) from answers submitted by more than 6,000 Duolingo students over the course of their first 30 days.
We provide token-level labels and dependency parses for the most similar correct answer to each student submission. For example, the figure below shows a parse tree for the correct answer, the aligned student's answer, and the resulting labels (for a Spanish speaker learning English):
This student seems to be struggling with "my", "mother", and "father." Perhaps she has trouble with possessive pronouns? Or the orthography of English "th" sounds? A successful SLA modeling system should be able to pick up on these trends, predicting which words give the student trouble in a personalized way (that evolves over time).
Most tokens (about 83%) are perfect matches and are given the label
0
for "OK." Tokens that are missing or spelled
incorrectly (ignoring capitalization, punctuation, and accents) are
given the label 1
denoting a mistake.
Note: For this task, we provide labels but not actual student responses. We intend to release a more comprehensive version of the data set after the workshop, including student answers and other metadata.
The data format is inspired by the Universal Dependencies CoNNL-U format. Each student exercise is represented by a group of lines separated by a blank line: one token per line prepended with exercise-level metadata. Here are some examples (you may need to scroll horizontally to see all columns):
# user:D2inSf5+ countries:MX days:1.793 client:web session:lesson format:reverse_translate time:16 8rgJEAPw1001 She PRON Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs|fPOS=PRON++PRP nsubj 4 0 8rgJEAPw1002 is VERB Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|fPOS=VERB++VBZ cop 4 0 8rgJEAPw1003 my PRON Number=Sing|Person=1|Poss=Yes|PronType=Prs|fPOS=PRON++PRP$ nmod:poss 4 1 8rgJEAPw1004 mother NOUN Degree=Pos|fPOS=ADJ++JJ ROOT 0 1 8rgJEAPw1005 and CONJ fPOS=CONJ++CC cc 4 0 8rgJEAPw1006 he PRON Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs|fPOS=PRON++PRP nsubj 9 0 8rgJEAPw1007 is VERB Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|fPOS=VERB++VBZ cop 9 0 8rgJEAPw1008 my PRON Number=Sing|Person=1|Poss=Yes|PronType=Prs|fPOS=PRON++PRP$ nmod:poss 9 1 8rgJEAPw1009 father NOUN Number=Sing|fPOS=NOUN++NN conj 4 1 # user:D2inSf5+ countries:MX days:2.689 client:web session:practice format:reverse_translate time:6 oMGsnnH/0101 When ADV PronType=Int|fPOS=ADV++WRB advmod 4 1 oMGsnnH/0102 can AUX VerbForm=Fin|fPOS=AUX++MD aux 4 0 oMGsnnH/0103 I PRON Case=Nom|Number=Sing|Person=1|PronType=Prs|fPOS=PRON++PRP nsubj 4 1 oMGsnnH/0104 help VERB VerbForm=Inf|fPOS=VERB++VB ROOT 0 0
The first line of each exercise group (beginning with
#
) contains the following metadata about the student,
session, and exercise:
user
: a
B64 encoded,
8-digit, anonymized, unique identifier for each student (may
include /
or +
characters)
countries
: a pipe (|
) delimited list of
2-character country codes
from which this user has done exercises
days
: the number of days since the student started
learning this language on Duolingo
client
: the student's device platform (one of:
android
, ios
, or web
)
session
: the session type (one of:
lesson
, practice
, or test
;
explanation below)
format
: the exercise format (one of:
reverse_translate
, reverse_tap
, or
listen
; see figures above)
time
: the amount of time (in seconds) it took for the
student to construct and submit their whole answer (note: for some
exercises, this can be null
due to data logging
issues)
These fields are separated by whitespaces on the same line, and
key:value pairs are denoted with a colon (:
).
The lesson
sessions (about 77% of the data set) are
where new words or concepts are introduced, although lessons also
include a lot previously-learned material (e.g., each exercise tries
to introduce only one new word or tense, so all other tokens should
have been seen by the student before). The
practice
sessions (22%) should contain only
previously-seen words and concepts. The test
sessions
(1%) are quizzes that allow a student "skip" a particular skill unit
of the curriculum (i.e., the student may have never seen this
content before in the Duolingo app, but may well have had prior
knowledge before starting the course).
The remaining lines in each exercise group represent each token (word) in the correct answer that is most similar to the student's answer, one token per line, arranged into seven (7) columns separated by whitespaces:
0
or
1
)
All dependency features (columns 3-6) are generated by the Google SyntaxNet dependency parser using the language-agnostic Universal Dependencies tagset. (In other words, these morpho-syntactic features should be comparable across all three tracks in the shared task. Note that SyntaxNet isn't perfect, so parse errors may occur.)
The only difference between TRAIN and DEV/TEST set formats is that the final column (labels) will be omitted from the DEV/TEST set files. The first column (unique instance IDs) are also used for the submission output format.
time
field. These occur in the TRAIN set for
en_es
(17 cases), and the DEV sets for
en_es
(2 cases) and es_en
(1 case). These
logging errors are due to client-side browser bugs on web (i.e., not
iOS or Android apps) and race conditions caused by slow internet
connections. If you plan on using response times as part of your
modeling, we advise you to check for negative values and treat them
as null
.
The data for this task are organized into three tracks:
en_es
— English learners (who already
speak Spanish)
es_en
— Spanish learners (who already
speak English)
fr_en
— French learners (who already
speak English)
The TRAIN, DEV, and TEST sets will be written to separate files for each of these tracks. Each track will have its own leaderboard for predictions (see "Submission & Evaluation" section below). Some teams may want to focus on a particular track/language, however, participation in all three tracks is encouraged!
Data for the task will be released in two phases:
After the workshop, there will be an updated third release of the data (including labels and additional metadata for all splits) to support ongoing research in the area of SLA modeling.
The track data sets (data_*.tar.gz
) and baseline code
(starter_code.tar.gz
) are hosted on Dataverse:
Download Data & Baseline Code »
The primary evaluation metric will be area under the
ROC curve
(AUC).
F1 score (with
a threshold of 0.5
) will also be reported and analyzed.
As such, some teams may wish to attempt combined classification and
ranking methods. Note that the label 1
(denoting a
mistake) is considered the "positive" class for both metrics.
The official AUC and F1 metrics (plus a few others) are implemented
in eval.py
alongside the baseline code.
This is an "open" evaluation, meaning teams are allowed (and encouraged!) to experiment with additional features beyond those provided with the data release. Features that lead to interpretable/actionable insights about individual learning (e.g., "this user struggles with adjective-noun word order") are particularly encouraged.
See the "Tips & Related Work" section below for some ideas of where to start with additional feature engineering. Teams should thoroughly describe any new features in their system papers for the workshop proceedings (see below), and/or release source code to support replication and ongoing research.
All system submissions and evaluation will be done via CodaLab.
To participate, follow these steps:
Go to the CodaLab Competition »
Your CodaLab evaluation results will appear on the shared task leaderboards under the "Results" tab. There is a separate leaderboard for each track. These report three metrics: AUC, F1, and also average log-loss.
To submit your team's predictions:
.zip
containing prediction files for
all the tracks your team is participating in
The grading script will automatically look for and evaluate files
for each track separately (by matching filenames against
en_es
, es_en
, or fr_en
). For
example, your submission might contain the following structure:
my_test_submission_file.zip test.en_es.preds test.es_en.preds test.fr_en.preds
Prediction files should be in the "root" of the
.zip
file, (that is, not in a subdirectory). Any files
that do not match the three expected track IDs will be ignored, and
if multiple files match (e.g., two en_es
files), only
one of them will be evaluated.
Be careful to include only one prediction file per track! If
you are not participating in all three tracks, simply omit any
irrelevant prediction file(s) from the .zip
archive.
You will be able to submit up to 10 total runs (up to 2 runs per day) during the 10-day TEST Phase. This will enable you to evaluate a few variants of your system or parameters, if you so desire. Evaluation results should be available for review within a few seconds.
The submission file format is similar to those generated by the provided baseline model.
Submission should be a whitespace-delimited file with 1 row per
instance and no header. The first column must be the instance ID,
and the last column must be the prediction (consistent with the
first and last columns of the TRAIN data). Other columns, blank
lines, or lines beginning with #
will be ignored.
Values in the prediction column should be in the range
[0.0, 1.0]
, and will be interpreted as "the probability
that the student makes a mistake," i.e., p(mistake|instance). The
prediction files for all tracks (up to three) must be placed in a
single .zip
archive prior to submission.
Here is an example prediction file (only the first few lines are shown):
DRihrVmh0901 0.025 DRihrVmh0902 0.08 DRihrVmh0903 0.454 DRihrVmh0904 0.044 TOeLHxLS0401 0.067 TOeLHxLS0402 0.03 TOeLHxLS0403 0.806 TOeLHxLS0404 0.066 xqtN1I5c0901 0 xqtN1I5c0902 0.074 xqtN1I5c0903 0.053 xqtN1I5c0904 0.016 ...
Your predictions should follow this format.
All teams are expected to submit a system paper describing their approach and results, to be published in the workshop proceedings and available through the ACL Anthology website. Please do so even if you are unable to travel to the BEA Workshop at the NAACL-HLT conference in June 2018.
Note that we are interested not only in top-performing systems (i.e., metrics), but also meaningful findings (i.e., insights for language and/or learning). Teams are encouraged to focus on both in their write-ups!
Papers should follow the the NAACL 2018 submission guidelines. Teams are invited to submit a full paper (4-8 pages of content, with unlimited pages for references). We recommend using the official style templates:
All submissions must in PDF format and should not be anonymized. Supplementary files (hyperparameter settings, external features or ablation results too extensive to fit in the main paper, etc.) are also welcome, so long as they follow the NAACL 2018 Guidelines. Final camera ready versions of accepted papers will be given up to one additional page of content (9 pages plus references) to address reviewer comments. Papers must include the following citation:
B. Settles, C. Brust, E. Gustafson, M. Hagiwara, and N. Madnani. 2018. Second Language Acquisition Modeling. In Proceedings of the NAACL-HLT Workshop on Innovative Use of NLP for Building Educational Applications (BEA). ACL.
@inproceedings{slam18, Author = {B. Settles and C. Brust and E. Gustafson and M. Hagiwara and N. Madnani}, Booktitle = {Proceedings of the NAACL-HLT Workshop on Innovative Use of NLP for Building Educational Applications (BEA)}, Publisher = {ACL}, Title = {Second Language Acquisition Modeling}, Year = {2018}}
We will use the START conference system to manage submissions via, the BEA Workshop. Under "Submission Category" be sure to select "SLAM Shared Task Paper" (see screenshot below):
Submit your paper through START »
\aclfinalcopy
line is uncommented, and that the final
PDF includes author names and contact information.
SLA modeling is a rich and complex task, and presents an opportunity to synthesize methods from various subfields in computational linguistics, machine learning, educational data mining, and psychometrics.
This page contains a few suggestions and pointers to related research areas, which could be useful for feature engineering or other modeling decisions.