"Hierarchical Modeling of Human Language Processing with Large Language Model for Multimodal Depression Detection (Coming soon)"
Depression severely impacts quality of life, making accurate early diagnosis crucial for mitigating long-term consequences. Previous studies have leveraged audio and text but typically treat them in isolation, overlooking the cognition-interpretation-expression processes underlying conversation. We present the Cognition-Interpretation-Expression Depression Network (CIEDep-Net), a multimodal framework that functionally models this three-stage structure. Each stage analyzes inputs with features tailored to its role, and a strategy combining score-conditioned fusion and cross-attention captures inter-stage interactions. In the interpretation stage, a large language model equipped with Chain-of-Thought prompting and a self-consistency strategy is used to generate depression scores and an inner summary from the interview transcript. The generated outputs show statistically significant agreement with the ground-truth data (Pearson correlation coefficient, r = 0.69, p < 0.01, BERTScore = 0.8), supporting the predictive validity of the proposed approach. CIEDep-Net achieved MAE 1.98, CCC 0.884 on DAIC-WOZ, and MAE 2.72, CCC 0.78 on E-DAIC, a 4.56% MAE reduction over the strongest prior multimodal baseline. Ablations confirm that removing any stage degrades performance, underscoring complementary contributions of cognition, interpretation, and expression. By embedding human language processing mechanisms into multimodal learning, CIEDep-Net delivers reliable, consistent depression-severity prediction across datasets. The approach suggests a pathway toward clinically meaningful and scalable assessment through the integrating of linguistic and paralinguistic cues.