QTI 3 + AI = Time Savings
The QTI has long been the standard behind assessment interoperability. It’s also a very complicated specification due to the complex nature of assessment. The latest version of QTI, QTI 3, has been in development for around a decade. It is brilliant work. It handles various types of markup, is properly accessible, and is sensitive to the needs of learners that speak all languages. The very breadth and depth of the standard is powerful. With great power comes great complexity, and the QTI 3 standard is very complex. I believe it is well understood by the brilliant minds who authored it. The challenge now is to encourage the adoption of the specification outside the group of original creators.
At the 2024 1EdTech Learning Impact Conference, Colin Smythe presented “Is AI the Answer for Edtech Interoperability?“ I found it very informative, and I think/hope that artificial intelligence will help make adoption of complex standards a much smoother and easier process. It would be very convenient if an instructor could simply ask an LLM to generate a quiz, structure it best for the content and learning objectives, and then output the result as QTI-compliant XML that could be easily imported into an assessment tool. Colin suggested that since the current generation of LLMs haven’t been trained on standards-related data, they lack the ability to produce content that is compliant with standards.
I agree with Colin. After the last Learning Impact Conference, I attempted to build a custom GPT cleverly named "QTI Assessment Architect." Naming things is hard. Generating QTI 3-compliant output is also very hard. Play around with the tool if you have a chance. Here’s some example output:
<assessmentItem xmlns="http://www.imsglobal.org/xsd/imsqtiasi_v3p0" identifier="item1" title="Who is Luke Skywalker's father?">
<itemBody>
<choiceInteraction responseIdentifier="RESPONSE" shuffle="false" maxChoices="1">
<prompt>Who is Luke Skywalker's father?</prompt>
<simpleChoice identifier="A">Han Solo</simpleChoice>
<simpleChoice identifier="B">Obi-Wan Kenobi</simpleChoice>
<simpleChoice identifier="C">Darth Vader</simpleChoice>
<simpleChoice identifier="D">Yoda</simpleChoice>
</choiceInteraction>
</itemBody>
<responseDeclaration identifier="RESPONSE" cardinality="single" baseType="identifier">
<correctResponse>
<value>C</value>
</correctResponse>
</responseDeclaration>
</assessmentItem>
<assessmentItem xmlns="http://www.imsglobal.org/xsd/imsqtiasi_v3p0" identifier="item2" title="What is the name of Han Solo's ship?">
<itemBody>
<choiceInteraction responseIdentifier="RESPONSE" shuffle="false" maxChoices="1">
<prompt>What is the name of Han Solo's ship?</prompt>
<simpleChoice identifier="A">X-wing</simpleChoice>
<simpleChoice identifier="B">Star Destroyer</simpleChoice>
<simpleChoice identifier="C">Millennium Falcon</simpleChoice>
<simpleChoice identifier="D">TIE Fighter</simpleChoice>
</choiceInteraction>
</itemBody>
<responseDeclaration identifier="RESPONSE" cardinality="single" baseType="identifier">
<correctResponse>
<value>C</value>
</correctResponse>
</responseDeclaration>
</assessmentItem>
<assessmentItem xmlns="http://www.imsglobal.org/xsd/imsqtiasi_v3p0" identifier="item3" title="Which planet is Princess Leia from?">
<itemBody>
<choiceInteraction responseIdentifier="RESPONSE" shuffle="false" maxChoices="1">
<prompt>Which planet is Princess Leia from?</prompt>
<simpleChoice identifier="A">Tatooine</simpleChoice>
<simpleChoice identifier="B">Naboo</simpleChoice>
<simpleChoice identifier="C">Endor</simpleChoice>
<simpleChoice identifier="D">Alderaan</simpleChoice>
</choiceInteraction>
</itemBody>
<responseDeclaration identifier="RESPONSE" cardinality="single" baseType="identifier">
<correctResponse>
<value>D</value>
</correctResponse>
</responseDeclaration>
</assessmentItem>
That’s not valid QTI 3. As Colin said in his presentation, the current generation of LLMs just don’t understand QTI 3.
With some experimentation, we can do a bit better. Let the QTI Assessment Architect know that the output isn’t valid QTI 3 and give it the following example code from the QTI 3 specification:
<?xml version="1.0" encoding="UTF-8"?>
<qti-assessment-item xmlns="http://www.imsglobal.org/xsd/imsqtiasi_v3p0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.imsglobal.org/xsd/imsqtiasi_v3p0
https://purl.imsglobal.org/spec/qti/v3p0/schema/xsd/imsqti_asiv3p0_v1p0.xsd"
identifier="data-attribute-example" title="Use of data-attribute"
adaptive="false" time-dependent="false" xml:lang="en-US">
<qti-response-declaration identifier="RESPONSE" cardinality="single" base-type="identifier">
<qti-correct-response>
<qti-value>B</qti-value>
</qti-correct-response>
</qti-response-declaration>
<qti-outcome-declaration identifier="SCORE" cardinality="single" base-type="float">
<qti-default-value>
<qti-value>0</qti-value>
</qti-default-value>
</qti-outcome-declaration>
<qti-item-body>
<p>Select the error in the following passage of text (or "No Error" if there is none).</p>
<qti-hottext-interaction response-identifier="RESPONSE" max-choices="1"
data-group-name="group1" data-deselection-allowed="false" data-dont-word-wrap="false" >
<p>Sponsors of the Olympic Games <qti-hottext identifier="A">who bought</qti-hottext>
advertising time on United States television
<qti-hottext identifier="B">includes</qti-hottext>
<qti-hottext identifier="C">at least</qti-hottext> a dozen international firms
<qti-hottext identifier="D">whose</qti-hottext> names are familiar to American consumers.<br/>
<qti-hottext identifier="E">No error.</qti-hottext>
</p>
</qti-hottext-interaction>
</qti-item-body>
<qti-response-processing
template="https://purl.imsglobal.org/spec/qti/v3p0/rptemplates/match_correct.xml"/>
</qti-assessment-item>
This time it outputs something that looks vaguely like QTI 3:
<?xml version="1.0" encoding="UTF-8"?>
<qti-assessment-item xmlns="http://www.imsglobal.org/xsd/qti/imsqti_asiv3p0_v1p0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.imsglobal.org/xsd/qti/imsqti_itemv3p0_v1p0 https://purl.imsglobal.org/spec/qti/v3p0/schema/xsd/imsqti_asiv3p0_v1p0.xsd"
identifier="item1" title="Who is Luke Skywalker's father?" adaptive="false" time-dependent="false" xml:lang="en">
<qti-response-declaration identifier="RESPONSE" cardinality="single" base-type="identifier">
<qti-correct-response>
<qti-value>C</qti-value>
</qti-correct-response>
</qti-response-declaration>
<qti-outcome-declaration identifier="SCORE" cardinality="single" base-type="float"/>
<qti-item-body>
<qti-choice-interaction response-identifier="RESPONSE" max-choices="1">
<qti-prompt>Who is Luke Skywalker's father?</qti-prompt>
<qti-simple-choice identifier="A">Han Solo</qti-simple-choice>
<qti-simple-choice identifier="B">Obi-Wan Kenobi</qti-simple-choice>
<qti-simple-choice identifier="C">Darth Vader</qti-simple-choice>
<qti-simple-choice identifier="D">Yoda</qti-simple-choice>
</qti-choice-interaction>
</qti-item-body>
<qti-response-processing template="https://purl.imsglobal.org/spec/qti/v3p0/rptemplates/match_correct"/>
</qti-assessment-item>
Does this mean my GPT can now generate the correct output for any QTI 3 assessment? No, I don’t think so, but it does show that there is potential to achieve our goals even if the QTI 3 examples aren’t in the original training data for the LLM. This is a starting point and shows that LLMs have great potential for doing the heavy lifting of complex standards.
How can we improve the output further? Of course, giving an LLM more examples will help a great deal. This is a simple first step but does require access to examples. In his presentation, Colin mentioned that they have around 1000 QTI 3 examples. I contacted him to see if it would be possible to gain access to this large pool of examples. Unfortunately, the thousands of QTI 3 examples he was talking about are owned by large companies and are not publicly available.
The next step towards improving the output would be to build an agent framework capable of generating QTI, generating feedback, and then generating updated QTI. I originally was headed in this direction. Paul Grudnitski is the CEO of Amp-up.io, a brilliant mind when it comes to assessment, and one of the chairs of the QTI 3 working group. He has released an open-source QTI 3 Player. My original hope was to send the QTI 3 from QTI Assessment Architect to the player and then hack into the player to get error output from invalid QTI 3 input. Unfortunately, I didn’t have the time to get the player running, so I never finished the experiment.
I think it is possible to build a successful QTI 3 authoring environment using this approach without the need for OpenAI to include QTI 3-specific data in their training sets. (Although it wouldn’t be a bad thing if they did include it). We would need to gather a lot more examples of valid QTI 3. Colin claims that there are a “thousand of them” available. Providing those examples to the LLM during a request would improve the quality and accuracy of the generated QTI data. If we get really ambitious and we can persuade the right people to give us access, OpenAI will let you fine-tune some of their models. It is expensive, so that wouldn’t be my first approach. However, if the typical retrieval augmented generation (RAG) setup fails to yield acceptable results, fine-tuning could be an option.
LLMs are notoriously bad at math. However, give the LLM API access to a calculator, and suddenly, it can become an effective math tutor. Providing AI with access to tools can dramatically improve desired outputs. If there were a QTI 3 validator that provided reasonable error output, that could be used to iterate with the LLM to eventually arrive at an accurate QTI 3 output.
I’ve spent a lot of my professional career reading through specifications in pursuit of building systems that could generate data compliant with the specifications. I’ve also spent a lot of brain cycles reading through complex data files, trying to figure out why some “specification compliant” system couldn’t read the files, and then writing more code to massage the data to work. I hope to never have to do that again. This is an area where artificial intelligence will help us regain our time and our sanity. It’s a space that should be handled by machines that are capable of understanding vast amounts of information and then translating that understanding into the desired output. I look forward to the day when all of us in the education ecosystem can focus on the student’s experience and less on the technology needed to help them achieve their goals.