Quantitative Analysis

Measuring Inclusive Teaching Signals Using LLM-Assisted Syllabus Analysis

Summary

I led a quantitative research project examining how instructors communicate inclusive teaching practices through their course syllabi. Using an 88-item evidence-based inventory and a large-language-model (LLM) analysis pipeline, I quantified inclusion signals across 30 syllabi and linked them to students’ perceptions of instructor's considerateness. Specifically, I

  • Developed an automated LLM-based coding pipeline (few-shot + chain-of-thought prompting) to classify 88 inclusive practices with good reliability (κ = 0.617).

  • Engineered an Inclusive Practice Score (IPS) and ran statistical modeling (regression, PCA, sentiment analysis, text readability).

  • Found a significant relationship: more inclusive syllabi → higher perceived instructor considerateness (p < .01).

  • Created visualizations (Python Matplotlib, Seaborn) to surface actionable insights across 8 pedagogical dimensions.

  • Translating this into UX value: demonstrated a scalable, objective method to assess complex human-centered content using LLMs + quantitative analysis.

The analysis revealed strong engagement and transparency signals across disciplines, but also consistent blind spots in belonging cues and diversity representation—areas shown to matter for student success. Insights from the project informed practical improvement recommendations and showed how AI can support equitable teaching practices at scale.

The Problem

Although inclusive teaching has been emphasized for years, institutions still lack reliable and scalable ways to measure whether instructors actually implement inclusive practices. Most instructors self-report that they teach inclusively, but:

  • inclusive teaching practices are often invisible in classroom materials,

  • existing rubrics require manual review, which does not scale,

  • and there is no quantitative method to analyze inclusion across large numbers of courses.

As a result, departments cannot easily compare teaching practices, track improvement, or identify which inclusive strategies meaningfully shape student experience.

My project addresses this long-standing gap by creating an automated, LLM-driven approach to quantify inclusive practices (a wicked problem) in syllabi using the Inclusive Syllabi Inventory (ISI). This allows us to examine real patterns in how instructors communicate inclusion—and to test whether inclusive language actually influences how students perceive their instructor.

For this project, the problem becomes:

What patterns of inclusive teaching practices emerge across inclusive instructors?

How inclusivity shapes perceived instructor impression?

Goal

Build and validate a scalable, semi-automated pipeline that uses LLMs to:

  1. Identify inclusive teaching practices embedded in syllabus text.

  2. Quantify those practices across eight research-based pedagogical dimensions (e.g., Engagement Opportunities, Growth Mindset, Cultivating Belonging).

  3. Visualize patterns and gaps in a way that surfaces improvement opportunities.

  4. Assess downstream impact—whether more inclusive syllabi are associated with more positive perceptions of the instructor.

Research Process

  1. Data Collection

    • 30 course syllabi volunteered by instructors across 8 disciplines.

    • Each document treated as a "product touchpoint" conveying expectations, tone, and support resources.

  2. Coding Framework

    • Adapted the Inclusive Syllabi Inventory (ISI) into 88 binary yes/no items.

    • Items mapped to 8 behavioral dimensions aligned with instructional psychology (e.g., transparency, resources, exploration).

    • Constructed a synthetic Inclusive Practice Score (IPS) per syllabus.

  3. LLM-Based Analysis Pipeline

    • Used ChatGPT-4o-mini for yes/no classification via:

      • Few-shot prompting

      • Chain-of-thought reasoning

      • Prompt optimization to increase classification accuracy

    • Human–LLM inter-rater reliability: κ = 0.617 (good for exploratory quant research).

  4. Text Quality & Sentiment Analysis

    Computed additional document-level metrics:

    • Readability indicators

      • Flesch Reading Ease

      • SMOG Grade Level

    • Sentiment & tone measures

      • VADER

      • TextBlob polarity & subjectivity

  5. Statistical Modeling

    • Ran multiple regressions to test whether inclusive practices predict positive instructor impressions.

    • Instructor impressions were LLM-rated (Accessible, Generous, Lenient, Tries-to-Engage, Serious).

    • PCA + Cronbach’s alpha used to derive a latent Considerateness metric.

  6. Visualization (exploratory data storytelling)

    • Built visualizations (with Python Seaborn and Matplotlib):

      • Category-level bar charts

      • Full-hierarchy sunburst charts

      • Simplified per-category views

    • Designed to help users (instructors) interpret signals and pinpoint gaps.

Findings & Insights

  1. Inclusive Practices Vary Widely Across Courses

    Across 30 syllabi, Inclusive Practice Scores (IPS) ranged from ~32% to ~78%.

    This means most instructors demonstrate some inclusive behaviors, but the depth and consistency vary by individual, not discipline.

    Implication:

    Inclusion-supportive signals are not embedded systematically; instructors need clearer frameworks and support.

  2. Some Inclusion Dimensions Are Strongly Represented; Others Are Missing

    High-frequency dimensions:

    • Engagement Opportunities

    • Exploration

    • Requirement Transparency

    • Growth Mindset signals

    Low-frequency dimension:

    • Diversity Scholarship (almost entirely absent)

    Implication:

    Inclusion “hot spots” and “blind spots” exist. Any toolkit or intervention should prioritize the gaps that matter most for students (belonging, representation).

  3. More Inclusive Syllabi Are Perceived as More “Considerate”

    Regression showed IPS significantly predicts Considerateness ratings (LLM-coded):

    A 10-point increase in IPS → ~0.19 increase in Considerateness (p < .002)

    Other predictors (word count, SMOG, Flesch score) were weaker or inconsistent.

    Implication for design:

    Inclusive language and structure actually shape early student impressions of the instructor—before class even begins.

  4. Inclusion Isn’t Discipline-Dependent

    STEM vs. Humanities differences were small.

    High-inclusivity and low-inclusivity syllabi existed in all fields.

    Implication:

    Interventions can be standardized across departments rather than discipline-specific.

Outcome & Impact

  1. Provided a Scalable Way to Audit Course Inclusivity

    Manual review of 30 syllabi would require 40–60+ hours.

    The LLM-based pipeline produced quant findings in minutes, enabling curriculum leaders to audit large departments at scale.

  2. Demonstrated that Inclusive Practices Linked to Perceived Instructor Considerateness

    This identifies an actionable behavioral outcome:

    → More inclusive syllabi → more considerate instructor → better student engagement.

    This creates a business case for adoption.

  3. Produced Evidenced-based Recommendations for Instructor Professional Development

    Insights highlighted three high-impact improvements:

    • Increase belonging cues

    • Reduce reading complexity

    • Add diversity scholarship signals

    These recommendations can inform faculty development programs or AI-supported instructor support tools.

  4. Created a Framework That Can Generalize to Other Educational Content

    The approach can extend to:

    • assignment sheets

    • discussion prompts

    • LMS pages

    • lecture slides

    This opens a new direction for data-driven teaching analytics.

What I learned

Through this project, I learned how to take an ambiguous, human-centered problem—“how do we measure inclusive teaching?”—and translate it into a rigorous, scalable analytical pipeline. I deepened my ability to design measurable constructs, engineer features from unstructured text, and validate automated coding using inter-rater reliability.

I strengthened my quantitative toolkit by applying exploratory data analysis, readability metrics, sentiment analysis, PCA, and regression modeling to uncover meaningful behavioral patterns.

Most importantly, I learned how to turn complex statistical findings into clear, actionable insights for nontechnical stakeholders, and how to prototype an AI-supported evaluation tool that demonstrates both technical execution and product thinking.