Analyzing Data, Finding Hidden Structure
by Jackie Swift
“Statistics is unique; it’s like a language of science,” says Sumanta Basu, Statistics and Data Science/Computational Biology. “It gives researchers a rigorous framework to clearly present the logic behind their algorithms and analyses, as well as the assumptions they are making, and helps them communicate their findings to their peers in the scientific community.”
Basu focuses on high-dimensional time series analysis, which looks at nonlinear interaction through time. He works with colleagues from a range of disciplines whose research demands the analysis of large amounts of data. “My research is primarily about using data to understand the structure of a complex system and predicting what will happen to the system at a future time point,” Basu explains.
Looking for Strong Associations in Financial Markets
Basu first began looking at systemic risk, or the interconnectivity of players in financial markets, in graduate school at the University of Michigan. With collaborators from the areas of statistics and economics, he analyzed the connections between the monthly stock returns of a host of firms around the time of the 2008 financial crash. “We asked, ‘When one firm’s stock went down, which other firms also went down?’ The goal is to understand not a single bank’s risk or a single firm’s risk, but the negative impact posed by one firm on another,” he says.
This system-wide view of the financial network allows regulators and policymakers to make decisions that affect the whole network, Basu explains. In the case of the 2008 crash, policymakers had to decide which financial firms to bail out without a clear picture of how the firms were interconnected. “If I’m looking at whether to bail out Lehman Brothers, for instance, and I see that nine out of ten times when Lehman stock moves up or down, two other firms’ stocks move in tandem with it, then I need to think about the impact on those other two firms if Lehman goes under,” he says.
This type of macroprudential policymaking depends on analysis that searches for strong associations among hundreds of firms. The researchers developed a method to do this, which they call the Lasso Penalized Vector Autoregressive Model. “This kind of analysis requires a new type of statistical methodology because you can’t know anything for sure,” Basu says. “You have to account for uncertainty and devise algorithms that can give you results reasonably fast. You also need to understand the properties of your algorithms — when they will work and when they might fail.”
Currently, Basu has joined with David Easley, Economics, to look at high-frequency financial market data. “What happens if we look at data that is more high-resolution?” Basu asks. “We want to understand to what extent the patterns we’re seeing in the monthly stock return analysis will hold up if we dive deeper into intraday stock prices and trading activity because people make financial decisions on a much finer time scale. If you see something happening in the market, you don’t wait a month to make a decision; you probably change your investing strategy in a matter of days.”
Exploring the Fruit Fly Immune System
The mathematical machinery Basu has developed to analyze the financial markets can also be applied to other types of questions in other disciplines. In collaboration with Andrew Clark, Molecular Biology and Genetics, and Martin Wells, Statistics and Data Science/Social Statistics/Computational Biology, Basu is modifying his core work to answer questions regarding the structure of the immune system in Drosophila melanogaster, the common fruit fly. “There are genetic similarities between drosophila and humans,” Basu says. “If we can better understand how the fly’s immune system works, this may help us in the long-term with our understanding of the human immune system as well.”
The Clark lab injects fruit flies with a bacteria-like compound and then, every 30 minutes over many hours, measures the expression of approximately 15,000 genes present in the insects’ DNA. “At the end of the experiment, they have the time series data of the expressions of those genes spanning, say, five days,” Basu says. “So from my perspective, I’m looking at the time series of a number of gene expressions in the fly body rather than the time series of many banks’ stock returns.”
Basu and his colleagues plan to analyze this data and generate new hypotheses regarding which genes are cross-talking with each other. “A gene usually does not work on its own,” Basu says. “They interact with each other to trigger a biological response in the body. We can’t prove anything without experimentation, but we hope that if we look for these types of lead/lag patterns among gene trajectories, we can come up with testable hypotheses.”
Mapping Genes Responsible for Metabolism
In another project, Basu and graduate student Kara J. Karpman, PhD ’20 Applied Mathematics, joined with researchers in the lab of Kivanc Birsoy at Rockefeller University to search for new genes that are responsible for different types of metabolism in the human body. To do that, they looked at the essentiality scores — the importance of a gene for core processes and cell viability — of thousands of genes in the human genome across many different cell lines, trying to map a network of gene connections.
“[Two genes] may be connected directly or…through a third gene…The idea is to set up an algorithm where you look for gene association but account for what is going on with all the other genes.”
“If two genes are moving together consistently across different cell lines, then it is reasonable to assume there is some connection between them,” says Basu. “The challenge here is reminiscent of the problem with financial markets: If you only analyze correlations between two genes, you will find many genes connected to each other and you will be lost in a maze. You won’t know which connections to go after first. The key lesson is that if two genes are connected, they may be connected directly or they may be connected through a third gene which you haven’t seen yet. The idea is to set up an algorithm where you look for gene association but account for what is going on with all the other genes. You will see many of the earlier associations start to disappear because they are actually spurious. Only a few will remain.”
Basu and his collaborators applied a statistical technique known as graphical modeling to search for such strong, direct connections among the genes. In a paper forthcoming in the journal Nature Metabolism, their algorithm was used to discover an essential regulatory role of the gene C12orf49 in cholesterol and fatty acid metabolism.
Working with statistics is particularly enjoyable for Basu because it allows him to connect with colleagues across many disciplines. “My days are filled with interesting conversations with many amazing, smart people,” he says. “They’re thinking about such important and exciting questions in their fields. I learn a lot from them, and they’re wonderful company.”