주요메뉴 바로가기 본문 바로가기

주메뉴

IBS Conferences

인과관계 추정 정확도를 높인 새로운 방법론

Time series data, recorded based on the passage of time, is utilized in various fields such as weather forecasting, economics, and medicine. Especially as the advent of wearable devices like smartwatches allows for the ease of collecting health data in daily life, the importance of time series data analysis in the medical field is increasing. The Center for Mathematical and Computational Sciences within the Institute for Basic Science has developed a new methodology to enhance the accuracy of causal relationship estimation in time series data. There are high expectations that this research will present a new paradigm in causal relationship studies. In order to better explain the significance of this research, the IBS researchers will directly explain the new concepts as well as past methodologies.

What is causality?

The world is composed of numerous elements, interacting with each other, giving rise to various phenomena. Causality refers to a relationship where two factors directly influence each other, in other words, a cause-and-effect relationship. For instance, as the temperature rises, ice cream consumption tends to increase, indicating a clear cause-and-effect relationship. On the other hand, as the temperature rises, crime rates also tend to increase. Therefore, although there is a similar trend between ice cream consumption and crime rates, there is no direct causal relationship between them. The temperature acts as a common cause for two different factors (ice cream consumption and crime rates), leading to a similar tendency between the two.

In this way, understanding how everything operates in the world is challenging. Finding accurate causality can be the first step in unraveling the mechanism behind a specific phenomenon. That's why phenomena are meticulously recorded, transformed into data, and analyzed to estimate causality in various fields. Particularly, time-series data are recorded based on the passage of time, which proves valuable in estimating causality not only in weather forecasting and economics but also in the field of medicine. A classic example is identifying the direct cause of a heart attack through electrocardiogram (ECG) measurements in hospitalized patients. Recently, the significance of time-series data analysis in the medical field has further increased thanks to the ease of collecting health data in daily life through wearable devices like smartwatches.


[Figure 1] Causal Inference of Time-Series Data
[Figure 1] Causal Inference of Time-Series Data
When presented with time-series data from different sources, estimating whether there is a causal relationship among them is a crucial problem that has been extensively researched across various fields in both social and natural sciences.


Nobel Prize in Economics, Granger Causality Test

The Granger causality test is a prominent method for estimating causality in time-series data, which was proposed by Professor Clive GRANGER of UC San Diego who was awarded the Nobel Prize in Economics in 2003. The main idea behind the Granger causality test is straightforward. Let's assume we have time-series data recording the daily average temperature and greenhouse gas concentrations on Earth. If we attempt to predict tomorrow's temperature using only past temperature data, how accurately can we predict it? Now, consider predicting tomorrow's temperature using both past temperature data and greenhouse gas concentration data. Will the prediction be more accurate than before? If there is a genuine causal relationship between greenhouse gases and Earth's temperature, using both sets of data should enhance the accuracy of predictions. On the other hand, if there is no causal relationship between greenhouse gases and temperature, adding greenhouse gas data might not significantly improve prediction accuracy. This reliance on the presence or absence of information to assess whether the accuracy of a statistical model changes significantly is the core idea behind the Granger causality test. This test has found applications in various fields, including predicting future economic indicators, analyzing factors in diseases, and understanding the causes of global warming. Furthermore, since the development of the Granger causality test, various information theory-based methods for estimating causality have been introduced.


Inherent Issues in Causal Relationship Estimation Methods

However, there are several inherent problems with the existing methods for estimating causal relationships. Firstly, many of the previously used methodologies tended to incorrectly predict causality if time-series data exhibited similar periodic changes. For instance, both temperature and ocean salinity oscillate with roughly a daily cycle, but they are not directly related. Nevertheless, Granger causality tests often incorrectly predict a causal relationship between temperature and ocean salinity. Additionally, these methods struggle to distinguish between direct and indirect causal relationships. Consider the example of grass being the food source for deer, and deer being the prey for tigers. If there is an increase in the amount of grass, the population of deer that feed on the grass may increase, leading to a subsequent increase in the population of tigers that prey on deer. While the quantity of grass can indirectly influence the tiger population, there is no direct causal relationship between the two. However, many existing causal relationship estimation methods tend to make the error of incorrectly assuming that the quantity of grass directly affects the population of tigers.


The Necessity of Mathematical Models and Their Limitations

To address the previously mentioned challenges—specifically, accurately estimating causality from simultaneity and indirect effects—mathematical models can be effectively employed. Let's revisit the example of tigers and deer. As deer are the prey for tigers, the number of deer ([Deer]) that get caught by tigers is proportional to both the tiger population ([Tiger]) and the deer population ([Deer]). Additionally, considering that deer reproduce, the rate of change of the deer population ([Deer]’) can be expressed by the following equation:

[Deer]’=a×[Deer]–b×[Deer]×[Tiger]

Similarly, the tiger population changes more rapidly with higher numbers of deer and tigers. Accounting for the decrease in the tiger population due to deaths, the rate of change of the tiger population ([Tiger]’) is given by:

[Tiger]’=c×[Deer]×[Tiger]–d×[Tiger]

These equations which depict the predator-prey relationship are called the Lotka-Volterra equations. Now, let's assume we have time-series data for tigers and deer. By adjusting the parameters a, b, c, and d and checking if the Lotka-Volterra equations adequately describe the time-series data, we can determine whether there is a predatory relationship between tigers and deer. In essence, this methodology allows us to assess causality between tiger and deer populations. The advantage of such approaches is that, as long as the mathematical model is accurate, it doesn't confuse simultaneity and indirect effects with causality. However, in most cases, an accurate mathematical model is not well-known, and even if it is, there is an additional constraint of complex computations when estimating causality.


A Novel Approach to Causal Inference: GOBI (General ODE-Based Inference)

To address the limitations of model-based methodologies, particularly when an accurate mathematical model is unknown, one can consider new approaches that are applicable even in the absence of such precise models. Let's return to the example of tigers and deer. Since tigers prey on deer, we can assume that the tiger population has a negative impact on the deer population. Conversely, the deer population has a positive impact on the tiger population. To estimate causality, it is sufficient to determine the presence or absence of positive or negative influences. Consider a point in time when the deer population is increasing. If the rate of change in the tiger population is also increasing at that point, we can speculate that deer positively influence tigers. In other words:

D[Deer]×D[Tiger]’

If this value consistently remains positive, it implies that deer are positively influencing tigers. Conversely, if this value is not always positive, then deer are not positively impacting tigers. A similar approach can be used to check if tigers negatively influence deer. Leveraging these patterns, a theoretical framework can be developed to ascertain if time-series data can be represented by a general form of a mathematical model. Based on this theory, a methodology named GOBI (General ODE-Based Inference) has been devised. GOBI allows for the estimation of causality from time-series data without the need for assumptions specific to a model or intricate computations.

GOBI Enables More Accurate Causal Inference

GOBI methodology for causal inference has been demonstrated to result in superior performance across diverse datasets, ranging from interactions among molecules within cells, ecological networks, to meteorological systems. Results from the analysis of various systems using this methodology have shown remarkable performance compared to traditional causal inference methodologies. For instance, it was confirmed that nitrogen dioxide and respirable particulate matter were the real causes of cardiovascular diseases among many air pollutants. Importantly, unlike conventional causal inference methodologies, GOBI successfully infers causality even in time-series data with simultaneity and indirect effects.

[Figure 2] Comparison of Causal Inference Results between Conventional and Proposed Methodologies
[Figure 2] Comparison of Causal Inference Results between Conventional and Proposed Methodologies
(a) The time-series data represents a system combining unrelated predator-prey systems (P and D) and an intracellular protein interaction system (σ28 and TetR). Conventional methodologies such as Granger causality tests (GC) tend to incorrectly infer causality among almost all targets if there is simultaneity in time-series data. In contrast, GOBI accurately estimates only the real causal relationships.
(b) Time-series data representing the number of cardiovascular disease patients in Hong Kong and the concentration of air pollutants. Unlike other methodologies, GOBI correctly estimates that only nitrogen dioxide (NO2) and respirable particulate matter (Rspar) have an impact on cardiovascular diseases, regardless of the length of the time-series data used (2 or 3 years).


university of wisconsin madison 수학과 대학원생 박세호


Research

Are you satisfied with the information on this page?

Content Manager
Communications Team : Kwon Ye Seul   042-878-8237
Last Update 2023-11-28 14:20