【主讲】杨漠尘,印第安纳大学凯莱商学院助理教授
【主题】通过随机森林产生工具变量来解决数据挖掘变量预测(度量)错误产生的内生性问题
【时间】2018年11月29日(周四)15:00-16:30pm
【地点】清华经管学院伟伦楼513
【语言】英语
【主办】管理科学与工程系
【简历】杨漠尘老师简历
【Speaker】Monchen Yang, Indiana University Kelley School of Business,Assistant Professor
【Topic】Generating Instrumental Variables via Random Forest to Address Endogeneity due to Prediction (Measurement) Error in Data-Mined Variables
【Time】Thursday, Nov. 29, 2018, 15:00-16:30pm
【Venue】Room 513, Weilun Building, Tsinghua SEM
【Language】English
【Organizer】Department of Management Science and Engineering
【Abstract】The practice of combining machine learning with econometric analysis is increasingly prevalent in both research and practice. In this work, we address one common example: the use of predictive modeling techniques to "mine" variables of interest from unstructured data, e.g., predicting sentiment from text, followed by the inclusion of those variables into an econometric framework, with the objective of making statistical inferences. We consider recent work, which highlights that, because the predictions from machine learning models are inevitably imperfect, econometric analyses involving the predicted variables will suffer from biases and endogeneity deriving from measurement error. We propose a novel approach that mitigates these biases, leveraging instrumental variables generated from an ensemble learning technique known as the random forest. The random forest algorithm performs best when comprised of a set of trees that are individually accurate in their predictions, and which make "different" mistakes, i.e., have weakly correlated prediction errors. A key observation is that these properties are close analogs for the relevance and exclusion requirements for a valid instrumental variable. We design a data-driven procedure to select tuples of individual trees from a random forest, in which one tree serves as the endogenous covariate and the other trees as its instruments. Simulation experiments demonstrate the efficacy of the proposed approach in mitigating estimation biases, and its superior performance relative to an alternative method (simulation-extrapolation) proposed in prior work for addressing this problem.