Location: Haines A25
Yanhui Wu, USC Marshall School of Business
The rise of social media enables millions of citizens to generate information on sensitive political issues and social events, which is scarce in authoritarian countries and is tremendously valuable for surveillance and social studies. In the enormous efforts to utilize social media information, censorship stands as a formidable obstacle for informative description and accurate statistical inference. Likewise, in medical research, disease type proportions in the samples might not represent the proportions in the general population. To solve the information distortion problem caused by unconscious data distortion, such as non-predictable censorship and non representative sampling, we propose a new distortion-invariant statistical approach to parse data, based on the Neyman-Pearson (NP) classification paradigm. Under general conditions, we derive explicit formulas for the after-distortion oracle classifier with explicit dependency on the distortion rates β0 and β1 on Class 0 and Class 1 respectively, and show that the NP oracle classifier is independent of the distortion scheme. We illustrate the working of this new method by combining the recently developed NP umbrella algorithm with topic modeling to automatically detect posts that are related to strikes and corruption in samples of randomly selected posts extracted from Sina Weibo – the Chinese equivalent to Twitter. In situations where type I errors are unacceptably large under the classical classification framework, the use of our proposed approach allows for controlling type I errors under a desirable upper bound.