In every election, after the polls close and the votes are counted, there comes a time for reflection. Pundits appear on cable news to offer theories, columnists pen op-eds with warnings and advice for the winners and losers, and parties conduct postmortems.
The 2020 U.S. presidential election in which Donald Trump lost to Joe Biden was no exception.
For Caltech undergrad Sreemanti Dey, the election offered a chance to do her own sort of reflection. Dey, an undergrad majoring in computer science, has a particular interest in using computers to better understand politics. Working with Michael Alvarez, professor of political and computational social science, Dey used machine learning and data collected during the 2020 election to find out what actually motivated people to vote for one presidential candidate over another.
In December, Dey presented her work on the topic at the fourth-annual International Conference on Applied Machine Learning and Data Analytics, which was held remotely and was recognized by the organizers as having the best paper at the conference.
We recently chatted with Dey and Alvarez, who is co-chair of the Caltech-MIT Voting Project, about their research, what machine learning can offer to political scientists, and what it is like for undergrads doing research at Caltech.
Sreemanti, you are majoring in computer science and data science. So why research an election? What is interesting to you about this particular topic?
Sreemanti Dey: I think that how elections are run has become a really salient issue in the past couple of years. Politics is in the forefront of people's minds because things have gotten so, I guess, strange and chaotic recently. That, along with a lot of factors in 2020, made people care a lot more about voting. That makes me think it's really important to study how elections work and how people choose candidates in general.
What did you hope to learn by using machine learning to study the election results of 2020? What was your goal?
Sreemanti: I've learned from Mike that a lot of social science studies are deductive in nature. So, you pick a hypothesis and then you pick the data that would best help you understand the hypothesis that you've chosen. We wanted to take a more open-ended approach and see what the data itself told us. And, of course, that's precisely what machine learning is good for.
What makes machine learning well suited to this research? Why can't people just look at the data themselves?
In this particular case, it was a matter of working with a large amount of data that you can't filter through yourself without introducing a lot of bias. And that could be just you choosing to focus on the wrong issues. Machine learning and the model that we used are a good way to reduce the amount of information you're looking at without bias.
That model you used is called the Fuzzy Forest algorithm. Can you explain how it works?
Basically it's a way of reducing high-dimensional data sets to the most important factors in the data set. So it goes through a couple steps. It first groups all the features of the data into these modules so that the features within a module are very correlated with each other, but there is not much correlation between modules. Then, since each module represents the same type of features, it reduces how many features are in each module. And then at the very end, it combines all the modules together and then takes one last pass to see if it can be reduced by anything else.
Mike: This technique was developed by Christina Ramirez (MS' 96, PhD '99), a PhD graduate of our program now at UCLA. Christina is someone who I've collaborated with quite a bit. Sreemanti and I were meeting pretty regularly with Christina and getting some advice from her along the way about this project and some others that we're thinking about.
You found that partisan polarization was the strongest factor that influenced why people vote for a candidate. Was that surprising at all?
Sreemanti: I think we got pretty much what we expected, except for what the most partisan-coded issues are. Those I found a little bit surprising. The most partisan questions turned out to be about filling the Supreme Court seats. I thought that it was interesting.
Sreemanti, what is it like to do this level of research as an undergrad?
Sreemanti: It's really incredible. I find it astonishing that a person like Professor Alvarez has the time to focus so much on the undergraduates in lab. I did research in high school, and it was an extremely competitive environment trying to get attention from professors or even your mentor.
It's a really nice feature of Caltech that professors are very involved with what their undergraduates are doing. I would say it's a really incredible opportunity.
Mike, it sounds like it is not unusual for undergraduates to be doing this kind of work in your group. Why do you feel it is important to give undergrads these opportunities?
Mike: I and most of my colleagues work really hard to involve the Caltech undergraduates in a lot of the research that we do. A lot of that happens in the SURF [Summer Undergraduate Research Fellowship] program in the summers. But it also happens throughout the course of the academic year.
What's unusual a little bit here is that undergraduate students typically take on smaller projects. They typically work on things for a quarter or a summer. And while they do a good job on them, they don't usually reach the point where they produce something that's potentially publication quality.
Sreemanti started this at the beginning of her freshman year and we worked on it through her entire freshman year. That gave her the opportunity to really learn the tools, read the political science literature, read the machine learning literature, and take this to a point where at the end of the year, she had produced something that was of publication quality.
Was it at all challenging to present at an online conference?
Sreemanti: It was a little bit strange, first of all, because of the time zone issue. This conference was in a completely different time zone, so I ended up waking up at 4 a.m. for it. And then I had an audio glitch halfway through that I had to fix, so I had some very typical Zoom-era problems and all that.
Mike: This is a pandemic-era story with how we were all working to cope and trying to maintain the educational experience that we want our undergraduates to have. We were all trying to make sure that they had the experience that they deserved as a Caltech undergraduate and trying to make sure they made it through the freshman year.
We have the most amazing students imaginable, and to be able to help them understand what the research experience is like is just an amazing opportunity. Working with students like Sreemanti is the sort of thing that makes being a Caltech faculty member very special. And it's a large part of the reason why people like myself like to be professors at Caltech.
Sreemanti, is there anywhere that you hope to take this particular line of research? Where do you want your research to go from here?
Sreemanti: I think I would want to continue studying how people make their choices about candidates but maybe in a slightly different way with different data sets. Right now, from my other projects, I think I'm learning how to not rely on surveys and rely on more organic data, for example, from social media. I would be interested in trying to find a way to study their candidate—people's candidate choice from their more organic interactions with other people.
Sreemanti's paper, titled, "Fuzzy Forests for Feature Selection in High-Dimensional Survey Data: An Application to the 2020 U.S. Presidential Election," was presented in December at the fourth-annual International Conference on Applied Machine Learning and Data Analytics," where it won the best paper award.