Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (2024)

Zizhuo MengUniversity of Technology SydneySydneyAustraliaZizhuo.Meng@student.uts.edu.au,Ke WanUniversity of Illinois Urbana-ChampaignIllinoisUnited Stateskewan2@illinois.edu,Yadong HuangBeijing Academy of Blockchain and Edge Computing BeijingChinahuang-yd14@tsinghua.org.cn,Zhidong LiUniversity of Technology SydneySydneyAustraliaZhidong.Li@uts.edu.au,Yang WangUniversity of Technology SydneySydneyAustraliaYang.Wang@uts.edu.auandFeng Zhou§Center for Applied Statistics and School of Statistics, Renmin University of ChinaBeijingChinafeng.zhou@ruc.edu.cn

(2024)

Abstract.

Social networks represent complex ecosystems where the interactions between users or groups play a pivotal role in information dissemination, opinion formation, and social interactions.Effectively harnessing event sequence data within social networks to unearth interactions among users or groups has persistently posed a challenging frontier within the realm of point processes.Current deep point process models face inherent limitations within the context of social networks, constraining both their interpretability and expressive power. These models encounter challenges in capturing interactions among users or groups and often rely on parameterized extrapolation methods when modeling intensity over non-event intervals, limiting their capacity to capture intricate intensity patterns, particularly beyond observed events.To address these challenges, this study proposes modifications to Transformer Hawkes processes (THP), leading to the development of interpretable Transformer Hawkes processes (ITHP). ITHP inherits the strengths of THP while aligning with statistical nonlinear Hawkes processes, thereby enhancing its interpretability and providing valuable insights into interactions between users or groups. Additionally, ITHP enhances the flexibility of the intensity function over non-event intervals, making it better suited to capture complex event propagation patterns in social networks.Experimental results, both on synthetic and real data, demonstrate the effectiveness of ITHP in overcoming the identified limitations. Moreover, they highlight ITHP’s applicability in the context of exploring the complex impact of users or groups within social networks.

copyright: acmlicensedjournalyear: 2024conference: ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 25–29; Barcelona, Spain**footnotetext: Equal contributions.§§footnotetext: Corresponding author.

1. Introduction

Event sequences are pervasive in social networks(Zhang etal., 2022a; Kong etal., 2023), including platforms such as Stack Overflow, Amazon, and Taobao. Understanding and mining these event sequences to uncover interactions between different users or groups within social networks is a critical research topic(Zipkin etal., 2016; Farajtabar etal., 2015). This analysis can help identify influential users, user groups, and trending topics, offering practical insights for platform optimization and user engagement strategies(Zhao etal., 2015; Zhou etal., 2013).For instance, consider the Stack Overflow platform, where developers ask and answer questions related to programming. Event sequences in this context could consist of events such as question postings, answers, comments, and votes. Analyzing this data can reveal insights into user interactions.

Temporal point processes (TPP)(Daley and Vere-Jones, 2003) play a fundamental role in modeling event sequences. The Poisson process(Daley and Vere-Jones, 2007), a basic temporal point process, assumes that events occur uniformly and independently over time. Besides, the Hawkes process(Hawkes, 1971) is an extension of the Poisson process that allows for event dependencies.While these models have been useful in many scenarios, they may not always capture the complexities present in real-world event sequences, which often exhibit more intricate dependencies and interactions.Therefore, more sophisticated and flexible models are needed.

With the advancement of deep learning, deep architectures have demonstrated remarkable performance in modeling sequence data. For example, models utilizing either vanilla RNN(Du etal., 2016) or long short-term memory (LSTM) networks(Mei and Eisner, 2017) have exhibited improved likelihood fitting and event prediction compared to earlier parameterized models.Moreover, models relying on transformer architectures or self-attention mechanisms(Zuo etal., 2020; Zhang etal., 2020) have shown even better performance. These deep learning approaches have opened up new possibilities for effectively capturing intricate patterns within event sequences, enhancing the overall predictive accuracy and efficiency in various applications.

However, current deep point process models still have some inherent limitations, which restrict the interpretability and expressive power.Firstly,such models are unable to explicitly capture the interactions between different event types.Deep point process models often model interactions between event types implicitly, which may hinder their interpretability due to the lack of explicit representation for these interactions.Understanding the interactions between different event types is crucial in social networks.For example, on Amazon, event sequences encompass a wide range of user activities, which can be considered events of various types, including product searches, purchases, reviews, and recommendations. Analyzing the interactions among these types can yield valuable insights into user-level and product-level interactions, providing Amazon with strategic advantages.Secondly, most existing deep point process models only perform encoding for the positions where events have occurred. For non-event intervals, intensity functions are modelled using parameterized extrapolation methods. For examples, Eq. (11) inDu etal. (2016), Eq. (7) inMei and Eisner (2017), and Eq. (6) inZuo etal. (2020). This approach introduces a parameterized assumption, which restricts the model’s expressive power.

In order to address the aforementioned issues,we propose a novel interpretable TPP model based on Transformer Hawkes processes (THP). The proposed model aligns THP perfectly with the statistical nonlinear Hawkes processes, greatly enhancing the interpretability. Thus, we refer to this enhanced model as interpretable Transformer Hawkes processes (ITHP). In ITHP, the attention mechanism’s product of the historical event’s key and the subsequent event’s query corresponds precisely to a time-varying trigger kernel in the statistical nonlinear Hawkes processes.By establishing a clear correspondence with statistical Hawkes processes, ITHP offers valuable insights into the interactions between different event types. This advancement is significant for enhancing the interpretability of THP in social network applications.Meanwhile, for the intensity function over non-event intervals, we do not adopt a simple parameterized extrapolation method. Instead, we utilize a “fully attention mechanism” to express the conditional intensity function at any position. This improvement increases the flexibility of the intensity function over non-event intervals, consequently elevating the model’s expressive power.Specifically, our contributions are as follows:

  • ITHP explicitly captures interactions between event types, providing insights into interactions and improving model interpretability;

  • ITHP’s fully attention mechanism for the conditional intensity function over non-event intervals enhances model flexibility, allowing it to capture complex intensity patterns beyond the observed events;

  • ITHP is validated with synthetic and real social network data, demonstrating its superior ability to interpret event interactions and outperform alternatives in expressiveness.

2. Related Work

Enhancing the expressive power of point process models has long been a challenging endeavor. Currently, mainstream approaches fall into two categories.The first approach entails the utilization of statistical non-parametric methods to augment their expressive capacity. For instance, methodologies grounded in both frequentist and Bayesian nonparametric paradigms are employed to model the intensity function of point processes(Lewis and Mohler, 2011; Zhou etal., 2013; Lloyd etal., 2015; Donner and Opper, 2018; Zhou etal., 2020a, 2021; Pan etal., 2021).The second significant category is deep point process models. These models harness the capabilities of deep learning architectures to infer the intensity function from data,including RNNs(Du etal., 2016), LSTM(Mei and Eisner, 2017; Xiao etal., 2017b), Transformers(Zuo etal., 2020; Zhang etal., 2020, 2022b), normalizing flow(Shchur etal., 2020b), adversarial learning(Xiao etal., 2017a; Noorbakhsh and Rodriguez, 2022), reinforcement learning(Upadhyay etal., 2018), deep kernel(Okawa etal., 2019; Zhu etal., 2021; Dong etal., 2022), and intensity-free frameworks(Shchur etal., 2020a). These architectural choices empower the modeling of temporal dynamics within event sequences and unveil the underlying patterns. However, in contrast to statistical point process models, the enhanced expressive power of deep point process models comes at the cost of losing interpretability, rendering deep point process models akin to “black-box” constructs.To the best of our knowledge, there has been limited exploration into explicitly capturing interactions between event types and enhancing the interpretability of deep point process models(Zhou and Yu, 2023; Wei etal., 2023). This paper introduces an innovative attention-based ITHP model, whose intensity function aligns seamlessly with statistical nonlinear Hawkes processes, substantially enhancing the interpretability. Our work serves as a catalyst for advancing the interpretability of deep point process models, greatly promoting their utility in uncovering interactions between different users or groups within social networks.

3. Preliminary Knowledge

In this section, we provide some background knowledge on some relevant key concepts.

3.1. Hawkes Process

The multivariate Hawkes process(Hawkes, 1971) is a widely used temporal point process model for capturing interactions among multiple event types.The key feature of the multivariate Hawkes process lies in its conditional intensity function. The conditional intensity function λk(t|t)subscript𝜆𝑘conditional𝑡subscript𝑡\lambda_{k}(t|\mathcal{H}_{t})italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for event type k𝑘kitalic_k at time t𝑡titalic_t is defined as the instantaneous event rate conditioned on the historical information t={(ti,ki)|ti<t}subscript𝑡conditional-setsubscript𝑡𝑖subscript𝑘𝑖subscript𝑡𝑖𝑡\mathcal{H}_{t}=\{(t_{i},k_{i})|t_{i}<t\}caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_t }:

λk(t|t)=μk+ti<tϕk,ki(tti),subscript𝜆𝑘conditional𝑡subscript𝑡subscript𝜇𝑘subscriptsubscript𝑡𝑖𝑡subscriptitalic-ϕ𝑘subscript𝑘𝑖𝑡subscript𝑡𝑖\lambda_{k}(t|\mathcal{H}_{t})=\mu_{k}+\sum_{t_{i}<t}\phi_{k,k_{i}}(t-t_{i}),italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_k , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

where μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the base rate for event type k𝑘kitalic_k and ϕk,ki(tti)subscriptitalic-ϕ𝑘subscript𝑘𝑖𝑡subscript𝑡𝑖\phi_{k,k_{i}}(t-t_{i})italic_ϕ start_POSTSUBSCRIPT italic_k , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the trigger kernel representing the excitation effect from event tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with type kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to t𝑡titalic_t with type k𝑘kitalic_k. It expresses the expected number of occurrences of event type k𝑘kitalic_k at time t𝑡titalic_t given the past history of events.

The interpretability of Hawkes processes stems from its explicit representation of event dependencies through the trigger kernel. The model allows us to quantify the impact of past events with different event types on the occurrence of a specific event, providing insights into the interactions between event types. As a result, the multivariate Hawkes process serves as a powerful tool in social network applications where understanding the interactions between event types (users or groups) is of utmost importance.

3.2. Nonlinear Hawkes Process

In contrast to the original Hawkes process, which assumes only non-negative trigger kernels (excitatory interactions) between events to avoid generating negative intensities, the nonlinear Hawkes process(Brémaud and Massoulié, 1996) offers a more flexible modeling framework by incorporating both excitatory and inhibitory effects among events.In the nonlinear Hawkes process, the conditional intensity function for event type k𝑘kitalic_k at time t𝑡titalic_t is defined as:

λk(t|t)=σ(μk+ti<tϕk,ki(tti)),subscript𝜆𝑘conditional𝑡subscript𝑡𝜎subscript𝜇𝑘subscriptsubscript𝑡𝑖𝑡subscriptitalic-ϕ𝑘subscript𝑘𝑖𝑡subscript𝑡𝑖\lambda_{k}(t|\mathcal{H}_{t})=\sigma\left(\mu_{k}+\sum_{t_{i}<t}\phi_{k,k_{i}%}(t-t_{i})\right),italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_σ ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_k , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ,

where σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) is a nonlinear mapping from \mathbb{R}blackboard_R to +superscript\mathbb{R}^{+}blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, ensuring the non-negativity of the intensity. Hence this trigger kernel can be positive (excitatory) or negative (inhibitory), thus enabling the modeling of complex interactions between different event types.

In the aforementioned models, the trigger kernel depends solely on the relative time tti𝑡subscript𝑡𝑖t-t_{i}italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, implying that the trigger kernel is shift-invariant. However, in dynamic Hawkes process models(Zhou etal., 2020b, a; Bhaduri etal., 2021; Zhou etal., 2022), the trigger kernel is further extended to vary with absolute time, denoted as ϕ(tti,ti)italic-ϕ𝑡subscript𝑡𝑖subscript𝑡𝑖\phi(t-t_{i},t_{i})italic_ϕ ( italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). By incorporating the absolute time, the trigger kernel becomes capable of capturing time-varying patterns, offering the model more degrees of freedom in its representation.

3.3. Transformer Hawkes Process

Our work is built upon THP(Zuo etal., 2020), so we concisely introduce the framework of THP here.Given a sequence 𝒮={(ti,ki)}i=1L𝒮superscriptsubscriptsubscript𝑡𝑖subscript𝑘𝑖𝑖1𝐿\mathcal{S}=\{(t_{i},k_{i})\}_{i=1}^{L}caligraphic_S = { ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT where each event is characterized by a timestamp tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and an event type kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, THP leverages two types of embeddings, namely temporal embedding and event type embedding, to represent these two kinds of information.To encode event timestamps, THP represents each timestamp tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using an embedding vector 𝐳(ti)M𝐳subscript𝑡𝑖superscript𝑀\mathbf{z}(t_{i})\in\mathbb{R}^{M}bold_z ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT:

zj(ti)={cos(ti/10000j1M)ifjis odd,sin(ti/10000jM)ifjis even,subscript𝑧𝑗subscript𝑡𝑖casessubscript𝑡𝑖superscript10000𝑗1𝑀if𝑗is oddotherwisesubscript𝑡𝑖superscript10000𝑗𝑀if𝑗is evenotherwisez_{j}(t_{i})=\begin{cases}\cos(t_{i}/10000^{\frac{j-1}{M}})\text{ if }j\text{ %is odd},\\\sin(t_{i}/10000^{\frac{j}{M}})\text{ if }j\text{ is even},\end{cases}italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { start_ROW start_CELL roman_cos ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / 10000 start_POSTSUPERSCRIPT divide start_ARG italic_j - 1 end_ARG start_ARG italic_M end_ARG end_POSTSUPERSCRIPT ) if italic_j is odd , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_sin ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / 10000 start_POSTSUPERSCRIPT divide start_ARG italic_j end_ARG start_ARG italic_M end_ARG end_POSTSUPERSCRIPT ) if italic_j is even , end_CELL start_CELL end_CELL end_ROW

where zj(ti)subscript𝑧𝑗subscript𝑡𝑖z_{j}(t_{i})italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the j𝑗jitalic_j-th entry of 𝐳(ti)𝐳subscript𝑡𝑖\mathbf{z}(t_{i})bold_z ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and j=0,,M1𝑗0𝑀1j=0,\ldots,M-1italic_j = 0 , … , italic_M - 1.The collection of time embeddings is represented as 𝐙=[𝐳(t1),,𝐳(tL)]L×M𝐙superscript𝐳subscript𝑡1𝐳subscript𝑡𝐿topsuperscript𝐿𝑀\mathbf{Z}=\left[\mathbf{z}(t_{1}),\ldots,\mathbf{z}(t_{L})\right]^{\top}\in%\mathbb{R}^{L\times M}bold_Z = [ bold_z ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , bold_z ( italic_t start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M end_POSTSUPERSCRIPT.For encoding event types, the model utilizes a learnable matrix 𝐔M×K𝐔superscript𝑀𝐾\mathbf{U}\in\mathbb{R}^{M\times K}bold_U ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_K end_POSTSUPERSCRIPT, where K𝐾Kitalic_K is the number of event types. For each event type kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, its embedding 𝐞(ki)𝐞subscript𝑘𝑖\mathbf{e}(k_{i})bold_e ( italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is computed as:

𝐞(ki)=𝐔𝐲iM,𝐞subscript𝑘𝑖subscript𝐔𝐲𝑖superscript𝑀\mathbf{e}(k_{i})=\mathbf{U}\mathbf{y}_{i}\in\mathbb{R}^{M},bold_e ( italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = bold_Uy start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ,

where 𝐲isubscript𝐲𝑖\mathbf{y}_{i}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the one-hot encoding of the event type kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The collection of type embeddings is 𝐄=[𝐞(k1),,𝐞(kL)]L×M𝐄superscript𝐞subscript𝑘1𝐞subscript𝑘𝐿topsuperscript𝐿𝑀\mathbf{E}=[\mathbf{e}(k_{1}),\ldots,\mathbf{e}(k_{L})]^{\top}\in\mathbb{R}^{L%\times M}bold_E = [ bold_e ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , bold_e ( italic_k start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M end_POSTSUPERSCRIPT.The final embedding is the summation of the temporal and event type embeddings:

(1)𝐗=𝐙+𝐄L×M,𝐗𝐙𝐄superscript𝐿𝑀\mathbf{X}=\mathbf{Z}+\mathbf{E}\in\mathbb{R}^{L\times M},bold_X = bold_Z + bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M end_POSTSUPERSCRIPT ,

where each row of 𝐗𝐗\mathbf{X}bold_X represents the complete embedding of a single event in the sequence 𝒮𝒮\mathcal{S}caligraphic_S.

After embedding, the model focuses on learning the dependence among events using self-attention mechanism. The attention output 𝐒𝐒\mathbf{S}bold_S is computed as:

𝐒=softmax(𝐐𝐊MK)𝐕L×MV,𝐒softmaxsuperscript𝐐𝐊topsubscript𝑀𝐾𝐕superscript𝐿subscript𝑀𝑉\displaystyle\mathbf{S}=\text{softmax}\left(\frac{{\mathbf{Q}\mathbf{K}}^{\top%}}{\sqrt{M_{K}}}\right)\mathbf{V}\in\mathbb{R}^{L\times M_{V}},bold_S = softmax ( divide start_ARG bold_QK start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG end_ARG ) bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,
𝐐=𝐗𝐖QL×MK,𝐊=𝐗𝐖KL×MK,𝐕=𝐗𝐖VL×MV,formulae-sequence𝐐superscript𝐗𝐖𝑄superscript𝐿subscript𝑀𝐾𝐊superscript𝐗𝐖𝐾superscript𝐿subscript𝑀𝐾𝐕superscript𝐗𝐖𝑉superscript𝐿subscript𝑀𝑉\displaystyle\mathbf{Q}=\mathbf{XW}^{Q}\in\mathbb{R}^{L\times M_{K}},\mathbf{K%}=\mathbf{XW}^{K}\in\mathbb{R}^{L\times M_{K}},\mathbf{V}=\mathbf{XW}^{V}\in%\mathbb{R}^{L\times M_{V}},bold_Q = bold_XW start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_K = bold_XW start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_V = bold_XW start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

where 𝐐𝐐\mathbf{Q}bold_Q, 𝐊𝐊\mathbf{K}bold_K, 𝐕𝐕\mathbf{V}bold_V are the query, key and value matrices.Matrices 𝐖QM×MKsuperscript𝐖𝑄superscript𝑀subscript𝑀𝐾\mathbf{W}^{Q}\in\mathbb{R}^{M\times M_{K}}bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝐖KM×MKsuperscript𝐖𝐾superscript𝑀subscript𝑀𝐾\mathbf{W}^{K}\in\mathbb{R}^{M\times M_{K}}bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐖VM×MVsuperscript𝐖𝑉superscript𝑀subscript𝑀𝑉\mathbf{W}^{V}\in\mathbb{R}^{M\times M_{V}}bold_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are the learnable parameters.To preserve causality and prevent future events from influencing past events, we mask out the entries in the upper triangular region of 𝐐𝐊superscript𝐐𝐊top\mathbf{QK}^{\top}bold_QK start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT.

Finally, the attention output 𝐒𝐒\mathbf{S}bold_S is passed through a two-layer MLP to produce the hidden state 𝐇𝐇\mathbf{H}bold_H:

𝐇=ReLU(𝐒𝐖1+𝐛1)𝐖2+𝐛2L×M,𝐇ReLUsubscript𝐒𝐖1subscript𝐛1subscript𝐖2subscript𝐛2superscript𝐿𝑀\mathbf{H}=\text{ReLU}(\mathbf{SW}_{1}+\mathbf{b}_{1})\mathbf{W}_{2}+\mathbf{b%}_{2}\in\mathbb{R}^{L\times M},bold_H = ReLU ( bold_SW start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M end_POSTSUPERSCRIPT ,

where 𝐖1M×MHsubscript𝐖1superscript𝑀subscript𝑀𝐻\mathbf{W}_{1}\in\mathbb{R}^{M\times M_{H}}bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_M start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝐖2MH×Msubscript𝐖2superscriptsubscript𝑀𝐻𝑀\mathbf{W}_{2}\in\mathbb{R}^{M_{H}\times M}bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT × italic_M end_POSTSUPERSCRIPT, 𝐛1MHsubscript𝐛1superscriptsubscript𝑀𝐻\mathbf{b}_{1}\in\mathbb{R}^{M_{H}}bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐛2Msubscript𝐛2superscript𝑀\mathbf{b}_{2}\in\mathbb{R}^{M}bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT are the learnable parameters.The k𝑘kitalic_k-type conditional intensity function of THP is designed as:

(2)λk(t|t)=softplus(αk(tti)/ti+𝐰k𝐡(ti)+bk),subscript𝜆𝑘conditional𝑡subscript𝑡softplussubscript𝛼𝑘𝑡subscript𝑡𝑖subscript𝑡𝑖superscriptsubscript𝐰𝑘top𝐡subscript𝑡𝑖subscript𝑏𝑘\lambda_{k}(t|\mathcal{H}_{t})=\text{softplus}\left(\hbox{\pagecolor{red!40}$%\alpha_{k}(t-t_{i})/t_{i}$}+\mathbf{w}_{k}^{\top}\mathbf{h}(t_{i})+b_{k}\right),italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = softplus ( italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_h ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,

where tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the last event before t𝑡titalic_t, αk,𝐰k,bksubscript𝛼𝑘subscript𝐰𝑘subscript𝑏𝑘\alpha_{k},\mathbf{w}_{k},b_{k}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are learnable parameters, the nonlinear function is chosen to be softplus to ensure that the intensity is non-negative, 𝐡(ti)𝐡subscript𝑡𝑖\mathbf{h}(t_{i})bold_h ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the transpose of the i𝑖iitalic_i-th row of 𝐇𝐇\mathbf{H}bold_H expressing the historical impact on event tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

4. Interpretable Transformer Hawkes Processes

As mentioned earlier, THP has two prominent limitations:(1) THP implicitly model the dependency between events, which hinders the explicit representation of interactions between different event types and makes it challenging to understand the interactions among event types. (2) Like many other deep point process models, THP applies attention encoding only to the event occurrence positions, while using parameterized extrapolation methods to model the intensity on non-event intervals (the red term in Eq.2). This approach introduces a parameterized assumption restricting the model’s expressive power.

To enhance the model’s interpretability and expressiveness, our work introduces modifications to the THP model. Specifically, we make modifications to (1) the event embedding, (2) the attention module and (3) the conditional intensity function in THP.Interestingly, the modified THP corresponds perfectly to the statistical nonlinear Hawkes processes.This leads to significantly improved interpretability and a better characterization of the interactions between event types.Additionally, new design of the conditional intensity function can avoid the restrictions imposed by parameterized extrapolation, enabling the model to effectively capture complex intensity patterns beyond the observed events.

In following sections, we outline the step-by-step process of modifying THP to achieve the aforementioned goals. For each modification, we provide theoretical proofs to demonstrate the rationality and validity of the respective changes.

4.1. Modified Event Embedding

In ITHP, we maintain the same temporal embedding and event type embedding methods as in THP.However, our modification lies in replacing the summation operation in Eq.1 with concatenation:

(3)𝐗=𝐙+𝐄L×M𝐗=[𝐙,𝐄]L×2M.𝐗𝐙𝐄superscript𝐿𝑀𝐗𝐙𝐄superscript𝐿2𝑀\mathbf{X}=\mathbf{Z}+\mathbf{E}\in\mathbb{R}^{L\times M}\Rightarrow\mathbf{X}%=[\mathbf{Z},\mathbf{E}]\in\mathbb{R}^{L\times 2M}.bold_X = bold_Z + bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M end_POSTSUPERSCRIPT ⇒ bold_X = [ bold_Z , bold_E ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × 2 italic_M end_POSTSUPERSCRIPT .

The reason for this modification is that the original summation operation introduces a similarity between timestamps and event types (or vice versa) of the preceding and succeeding events.However, in statistical Hawkes processes, known for their interpretability, the interaction between two events is the magnitude of a kernel determined by the similarity (correlation) between their types and the similarity (distance) between their timestamps. No cross-similarity is introduced.To maintain a similar level of interpretability, we replace summation with concatenation here.

Theorem 4.1.

In Eq.3, the concatenation operation enables us to explicitly capture the desired temporal and event type similarities, while simultaneously avoiding any cross-similarities between timestamps and event types.

Proof.

Suppose we define 𝐗𝐗\mathbf{X}bold_X using concatenation, and in subsequent attention module computation, it is necessary to calculate the product 𝐗𝐗superscript𝐗𝐗top\mathbf{X}\mathbf{X}^{\top}bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT to measure the similarity between different data points. The similarity between the i𝑖iitalic_i-th point and the j𝑗jitalic_j-th point can be expressed as follows:

𝐗i𝐗j=[𝐙i,𝐄i][𝐙j,𝐄j]=𝐙i𝐙j+𝐄i𝐄j.subscript𝐗𝑖superscriptsubscript𝐗𝑗topsubscript𝐙𝑖subscript𝐄𝑖superscriptsubscript𝐙𝑗subscript𝐄𝑗topsubscript𝐙𝑖superscriptsubscript𝐙𝑗topsubscript𝐄𝑖superscriptsubscript𝐄𝑗top\mathbf{X}_{i}\mathbf{X}_{j}^{\top}=[\mathbf{Z}_{i},\mathbf{E}_{i}][\mathbf{Z}%_{j},\mathbf{E}_{j}]^{\top}=\mathbf{Z}_{i}\mathbf{Z}_{j}^{\top}+\mathbf{E}_{i}%\mathbf{E}_{j}^{\top}.bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = [ bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] [ bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

Instead, if we define 𝐗𝐗\mathbf{X}bold_X using addition, 𝐗i𝐗jsubscript𝐗𝑖superscriptsubscript𝐗𝑗top\mathbf{X}_{i}\mathbf{X}_{j}^{\top}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is as follows:

(𝐙i+𝐄i)(𝐙j+𝐄j)=𝐙i𝐙j+𝐄i𝐄j+𝐙i𝐄j+𝐄i𝐙j.subscript𝐙𝑖subscript𝐄𝑖superscriptsubscript𝐙𝑗subscript𝐄𝑗topsubscript𝐙𝑖superscriptsubscript𝐙𝑗topsubscript𝐄𝑖superscriptsubscript𝐄𝑗topsubscript𝐙𝑖superscriptsubscript𝐄𝑗topsubscript𝐄𝑖superscriptsubscript𝐙𝑗top(\mathbf{Z}_{i}+\mathbf{E}_{i})(\mathbf{Z}_{j}+\mathbf{E}_{j})^{\top}=\mathbf{%Z}_{i}\mathbf{Z}_{j}^{\top}+\mathbf{E}_{i}\mathbf{E}_{j}^{\top}+\mathbf{Z}_{i}%\mathbf{E}_{j}^{\top}+\mathbf{E}_{i}\mathbf{Z}_{j}^{\top}.( bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + bold_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT .

It is evident that by defining 𝐗𝐗\mathbf{X}bold_X through concatenation, temporal and event type similarities are captured separately. Otherwise, the cross-similarities emerge.∎

4.2. Modified Attention Module

In ITHP, we still use self-attention to capture the influences of historical events on subsequent events. However, unlike THP, in the modified attention module, we use distinct query and key matrices:

(4)𝐒=softmax(𝐐𝐊MK)𝐕L×MV,𝐐=𝐗𝐖QL×MK𝐐=𝐗L×2M,𝐊=𝐗𝐖KL×MK𝐊=𝐗L×2M,𝐕=𝐗𝐖VL×MV,formulae-sequence𝐒softmaxsuperscript𝐐𝐊topsubscript𝑀𝐾𝐕superscript𝐿subscript𝑀𝑉𝐐superscript𝐗𝐖𝑄superscript𝐿subscript𝑀𝐾𝐐𝐗superscript𝐿2𝑀𝐊superscript𝐗𝐖𝐾superscript𝐿subscript𝑀𝐾𝐊𝐗superscript𝐿2𝑀𝐕superscript𝐗𝐖𝑉superscript𝐿subscript𝑀𝑉\begin{gathered}\mathbf{S}=\text{softmax}\left(\frac{{\mathbf{Q}\mathbf{K}}^{%\top}}{\sqrt{M_{K}}}\right)\mathbf{V}\in\mathbb{R}^{L\times M_{V}},\\\mathbf{Q}=\mathbf{XW}^{Q}\in\mathbb{R}^{L\times M_{K}}\Rightarrow\mathbf{Q}=%\mathbf{X}\in\mathbb{R}^{L\times 2M},\\\mathbf{K}=\mathbf{XW}^{K}\in\mathbb{R}^{L\times M_{K}}\Rightarrow\mathbf{K}=%\mathbf{X}\in\mathbb{R}^{L\times 2M},\\\mathbf{V}=\mathbf{XW}^{V}\in\mathbb{R}^{L\times M_{V}},\end{gathered}start_ROW start_CELL bold_S = softmax ( divide start_ARG bold_QK start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_ARG end_ARG ) bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_Q = bold_XW start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⇒ bold_Q = bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × 2 italic_M end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_K = bold_XW start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⇒ bold_K = bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × 2 italic_M end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL bold_V = bold_XW start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , end_CELL end_ROW

where the i𝑖iitalic_i-th row of 𝐒𝐒\mathbf{S}bold_S, 𝐒isubscript𝐒𝑖\mathbf{S}_{i}bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, represents the historical influence on the i𝑖iitalic_i-th event. The calculation of 𝐒isubscript𝐒𝑖\mathbf{S}_{i}bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be explicitly expressed as the summation over all events preceding event i𝑖iitalic_i, where the attention weights are normalized by the softmax:

(5)𝐒i=j<isoftmaxj<i(𝐗i𝐗j2M)𝐕jMV.subscript𝐒𝑖subscript𝑗𝑖𝑗𝑖softmaxsubscript𝐗𝑖superscriptsubscript𝐗𝑗top2𝑀subscript𝐕𝑗superscriptsubscript𝑀𝑉\mathbf{S}_{i}=\sum_{j<i}\underset{j<i}{\text{softmax}}\left(\frac{\mathbf{X}_%{i}\mathbf{X}_{j}^{\top}}{\sqrt{2M}}\right)\mathbf{V}_{j}\in\mathbb{R}^{M_{V}}.bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j < italic_i end_POSTSUBSCRIPT start_UNDERACCENT italic_j < italic_i end_UNDERACCENT start_ARG softmax end_ARG ( divide start_ARG bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_M end_ARG end_ARG ) bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

The reason for this modification is that after removing 𝐖Qsuperscript𝐖𝑄\mathbf{W}^{Q}bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and 𝐖Ksuperscript𝐖𝐾\mathbf{W}^{K}bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, the attention weights can be simply represented as 𝐗𝐗superscript𝐗𝐗top\mathbf{XX}^{\top}bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. Compared to the original 𝐐𝐊superscript𝐐𝐊top\mathbf{QK}^{\top}bold_QK start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, 𝐗𝐗superscript𝐗𝐗top\mathbf{XX}^{\top}bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT has a clearer physical meaning. In 𝐗𝐗superscript𝐗𝐗top\mathbf{XX}^{\top}bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, the entry in the i𝑖iitalic_i-th row and j𝑗jitalic_j-th column can be expressed as a shift-invariant function gki,kj(titj)subscript𝑔subscript𝑘𝑖subscript𝑘𝑗subscript𝑡𝑖subscript𝑡𝑗g_{k_{i},k_{j}}(t_{i}-t_{j})italic_g start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ).This representation allows for a more meaningful interpretation of the relationship between events. In contrast, 𝐐𝐊superscript𝐐𝐊top\mathbf{QK}^{\top}bold_QK start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT does not achieve this clarity.

Theorem 4.2.

Assuming that 𝐗𝐗\mathbf{X}bold_X is obtained through the concatenation operation in Eq.3, after omitting 𝐖Qsuperscript𝐖𝑄\mathbf{W}^{Q}bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and 𝐖Ksuperscript𝐖𝐾\mathbf{W}^{K}bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT in Eq.4, the entry at the i𝑖iitalic_i-th row and j𝑗jitalic_j-th column of 𝐗𝐗superscript𝐗𝐗top\mathbf{XX}^{\top}bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT can be expressed as a shift-invariant function gki,kj(titj)subscript𝑔subscript𝑘𝑖subscript𝑘𝑗subscript𝑡𝑖subscript𝑡𝑗g_{k_{i},k_{j}}(t_{i}-t_{j})italic_g start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), where tj<tisubscript𝑡𝑗subscript𝑡𝑖t_{j}<t_{i}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Proof.

When we use Eq.3 to obtain 𝐗𝐗\mathbf{X}bold_X and remove 𝐖Qsuperscript𝐖𝑄\mathbf{W}^{Q}bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and 𝐖Ksuperscript𝐖𝐾\mathbf{W}^{K}bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT in Eq.4, the similarity between the i𝑖iitalic_i-th and j𝑗jitalic_j-th points, denoted as 𝐗i𝐗jsubscript𝐗𝑖superscriptsubscript𝐗𝑗top\mathbf{X}_{i}\mathbf{X}_{j}^{\top}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, is expressed as 𝐗i𝐗j=𝐙i𝐙j+𝐄i𝐄jsubscript𝐗𝑖superscriptsubscript𝐗𝑗topsubscript𝐙𝑖superscriptsubscript𝐙𝑗topsubscript𝐄𝑖superscriptsubscript𝐄𝑗top\mathbf{X}_{i}\mathbf{X}_{j}^{\top}=\mathbf{Z}_{i}\mathbf{Z}_{j}^{\top}+%\mathbf{E}_{i}\mathbf{E}_{j}^{\top}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT.If we assume that M𝑀Mitalic_M is even, 𝐙i𝐙jsubscript𝐙𝑖superscriptsubscript𝐙𝑗top\mathbf{Z}_{i}\mathbf{Z}_{j}^{\top}bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT can be further represented as:

𝐙i𝐙jsubscript𝐙𝑖superscriptsubscript𝐙𝑗top\displaystyle\mathbf{Z}_{i}\mathbf{Z}_{j}^{\top}bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT=m=1M2+1cos(tiωm)cos(tjωm)+sin(tiωm)sin(tjωm)absentsuperscriptsubscript𝑚1𝑀21subscript𝑡𝑖subscript𝜔𝑚subscript𝑡𝑗subscript𝜔𝑚subscript𝑡𝑖subscript𝜔𝑚subscript𝑡𝑗subscript𝜔𝑚\displaystyle=\sum_{m=1}^{\frac{M}{2}+1}\cos(t_{i}\omega_{m})\cos(t_{j}\omega_%{m})+\sin(t_{i}\omega_{m})\sin(t_{j}\omega_{m})= ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_M end_ARG start_ARG 2 end_ARG + 1 end_POSTSUPERSCRIPT roman_cos ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) roman_cos ( italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) + roman_sin ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) roman_sin ( italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )
=m=1M2+1cos((titj)ωm),absentsuperscriptsubscript𝑚1𝑀21subscript𝑡𝑖subscript𝑡𝑗subscript𝜔𝑚\displaystyle=\sum_{m=1}^{\frac{M}{2}+1}\cos((t_{i}-t_{j})\omega_{m}),= ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG italic_M end_ARG start_ARG 2 end_ARG + 1 end_POSTSUPERSCRIPT roman_cos ( ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ,

where ωm=1/100002(m1)/Msubscript𝜔𝑚1superscript100002𝑚1𝑀\omega_{m}=1/10000^{2(m-1)/M}italic_ω start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 1 / 10000 start_POSTSUPERSCRIPT 2 ( italic_m - 1 ) / italic_M end_POSTSUPERSCRIPT.It is clear that 𝐗i𝐗jsubscript𝐗𝑖superscriptsubscript𝐗𝑗top\mathbf{X}_{i}\mathbf{X}_{j}^{\top}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT can be expressed as a shift-invariant function gki,kj(titj)subscript𝑔subscript𝑘𝑖subscript𝑘𝑗subscript𝑡𝑖subscript𝑡𝑗g_{k_{i},k_{j}}(t_{i}-t_{j})italic_g start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), with titjsubscript𝑡𝑖subscript𝑡𝑗t_{i}-t_{j}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT originating from 𝐙i𝐙jsubscript𝐙𝑖superscriptsubscript𝐙𝑗top\mathbf{Z}_{i}\mathbf{Z}_{j}^{\top}bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, and the subscripts ki,kjsubscript𝑘𝑖subscript𝑘𝑗k_{i},k_{j}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT arising from 𝐄i𝐄jsubscript𝐄𝑖superscriptsubscript𝐄𝑗top\mathbf{E}_{i}\mathbf{E}_{j}^{\top}bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. While retaining 𝐖Qsuperscript𝐖𝑄\mathbf{W}^{Q}bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and 𝐖Ksuperscript𝐖𝐾\mathbf{W}^{K}bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT leads to the following expression:

𝐐i𝐊j=[𝐙i,𝐄i]𝐖Q𝐖K[𝐙j,𝐄j],subscript𝐐𝑖superscriptsubscript𝐊𝑗topsubscript𝐙𝑖subscript𝐄𝑖superscript𝐖𝑄superscriptsuperscript𝐖𝐾topsuperscriptsubscript𝐙𝑗subscript𝐄𝑗top\mathbf{Q}_{i}\mathbf{K}_{j}^{\top}=[\mathbf{Z}_{i},\mathbf{E}_{i}]\mathbf{W}^%{Q}{\mathbf{W}^{K}}^{\top}[\mathbf{Z}_{j},\mathbf{E}_{j}]^{\top},bold_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = [ bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_E start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ,

where the introduction of 𝐖Q𝐖Ksuperscript𝐖𝑄superscriptsuperscript𝐖𝐾top\mathbf{W}^{Q}{\mathbf{W}^{K}}^{\top}bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT can once again introduce undesired cross-similarities and render the temporal similarity term 𝐙i[𝐖Q𝐖K]𝐙i𝐙j𝐙jsubscript𝐙𝑖subscriptdelimited-[]superscript𝐖𝑄superscriptsuperscript𝐖𝐾topsubscript𝐙𝑖subscript𝐙𝑗superscriptsubscript𝐙𝑗top\mathbf{Z}_{i}[\mathbf{W}^{Q}{\mathbf{W}^{K}}^{\top}]_{\mathbf{Z}_{i}\mathbf{Z%}_{j}}\mathbf{Z}_{j}^{\top}bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT [ bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT unable to be expressed in a shift-invariant function form.∎

Corollary 4.3.

Given Theorem4.2, we can further simplify Eq.5 as follows:

(6)𝐒i=j<i𝐠ki,kj(titj,tj)MV,subscript𝐒𝑖subscript𝑗𝑖subscriptsuperscript𝐠topsubscript𝑘𝑖subscript𝑘𝑗subscript𝑡𝑖subscript𝑡𝑗subscript𝑡𝑗superscriptsubscript𝑀𝑉\mathbf{S}_{i}=\sum_{j<i}\mathbf{g}^{\top}_{k_{i},k_{j}}(t_{i}-t_{j},t_{j})\in%\mathbb{R}^{M_{V}},bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j < italic_i end_POSTSUBSCRIPT bold_g start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

where 𝐠𝐠\mathbf{g}bold_g is an MVsubscript𝑀𝑉M_{V}italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT dimensional vector function.

Proof.

According to Theorem4.2, 𝐗i𝐗jsubscript𝐗𝑖superscriptsubscript𝐗𝑗top\mathbf{X}_{i}\mathbf{X}_{j}^{\top}bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT can be expressed as a shift-invariant function gki,kj(titj)subscript𝑔subscript𝑘𝑖subscript𝑘𝑗subscript𝑡𝑖subscript𝑡𝑗g_{k_{i},k_{j}}(t_{i}-t_{j})italic_g start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). After normalization through softmax in Eq.5, we obtain g~ki,kj(titj)subscript~𝑔subscript𝑘𝑖subscript𝑘𝑗subscript𝑡𝑖subscript𝑡𝑗\tilde{g}_{k_{i},k_{j}}(t_{i}-t_{j})over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) satisfying j<ig~ki,kj(titj)=1subscript𝑗𝑖subscript~𝑔subscript𝑘𝑖subscript𝑘𝑗subscript𝑡𝑖subscript𝑡𝑗1\sum_{j<i}\tilde{g}_{k_{i},k_{j}}(t_{i}-t_{j})=1∑ start_POSTSUBSCRIPT italic_j < italic_i end_POSTSUBSCRIPT over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 1. When multiplied by 𝐕jsubscript𝐕𝑗\mathbf{V}_{j}bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, g~ki,kj(titj)𝐕jsubscript~𝑔subscript𝑘𝑖subscript𝑘𝑗subscript𝑡𝑖subscript𝑡𝑗subscript𝐕𝑗\tilde{g}_{k_{i},k_{j}}(t_{i}-t_{j})\mathbf{V}_{j}over~ start_ARG italic_g end_ARG start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT yields a vectorized, time-varying and non-normalized function 𝐠ki,kj(titj,tj)subscriptsuperscript𝐠topsubscript𝑘𝑖subscript𝑘𝑗subscript𝑡𝑖subscript𝑡𝑗subscript𝑡𝑗\mathbf{g}^{\top}_{k_{i},k_{j}}(t_{i}-t_{j},t_{j})bold_g start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) where the additional tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT stems from the introduction of 𝐕jsubscript𝐕𝑗\mathbf{V}_{j}bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.∎

4.3. Modified Conditional Intensity Function

The form of Eq.6 naturally reminds us of the trigger kernel summation in statistical Hawkes processes. The only difference is that 𝐠𝐠\mathbf{g}bold_g in Eq.6 is a vector, whereas the trigger kernel in statistical Hawkes processes is a scalar function. Taking inspiration from this, we propose a more interpretable conditional intensity function:

(7)λk(t|t)=softplus(ti<t𝐠k,ki(tti,ti)𝐰k+bk)subscript𝜆𝑘conditional𝑡subscript𝑡softplussubscriptsubscript𝑡𝑖𝑡subscriptsuperscript𝐠top𝑘subscript𝑘𝑖𝑡subscript𝑡𝑖subscript𝑡𝑖subscript𝐰𝑘subscript𝑏𝑘\displaystyle\lambda_{k}(t|\mathcal{H}_{t})=\text{softplus}\left(\sum_{t_{i}<t%}\hbox{\pagecolor{yellow!40}$\mathbf{g}^{\top}_{k,k_{i}}(t-t_{i},t_{i})\mathbf%{w}_{k}$}+\hbox{\pagecolor{green!40}$b_{k}$}\right)italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = softplus ( ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT bold_g start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
=softplus(ti<tsoftmaxti<t(𝐗t𝐗i2M)𝐕i𝐰k+bk),absentsoftplussubscriptsubscript𝑡𝑖𝑡subscript𝑡𝑖𝑡softmaxsubscript𝐗𝑡superscriptsubscript𝐗𝑖top2𝑀subscript𝐕𝑖subscript𝐰𝑘subscript𝑏𝑘\displaystyle={\text{softplus}}\left(\sum_{t_{i}<t}\hbox{\pagecolor{yellow!40}%$\underset{t_{i}<t}{\text{softmax}}\left(\frac{\mathbf{X}_{t}\mathbf{X}_{i}^{%\top}}{\sqrt{2M}}\right)\mathbf{V}_{i}\mathbf{w}_{k}$}+\hbox{\pagecolor{green!%40}$b_{k}$}\right),= softplus ( ∑ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT start_UNDERACCENT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_t end_UNDERACCENT start_ARG softmax end_ARG ( divide start_ARG bold_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_M end_ARG end_ARG ) bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,

where 𝐰ksubscript𝐰𝑘\mathbf{w}_{k}bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a learnable parameter used to aggregate the vector 𝐠𝐠\mathbf{g}bold_g into a scalar value and bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is a learnable bias term.The newly designed conditional intensity aligns perfectly with the nonlinear Hawkes processes with a time-varying trigger kernel.The green term bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in Eq.7 corresponds to the base rate μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in Section3.2, the yellow term 𝐠k,ki(tti,ti)𝐰ksubscriptsuperscript𝐠top𝑘subscript𝑘𝑖𝑡subscript𝑡𝑖subscript𝑡𝑖subscript𝐰𝑘\mathbf{g}^{\top}_{k,k_{i}}(t-t_{i},t_{i})\mathbf{w}_{k}bold_g start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in Eq.7 corresponds to the time-varying trigger kernel ϕk,ki(tti,ti)subscriptitalic-ϕ𝑘subscript𝑘𝑖𝑡subscript𝑡𝑖subscript𝑡𝑖\phi_{k,k_{i}}(t-t_{i},t_{i})italic_ϕ start_POSTSUBSCRIPT italic_k , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in Section3.2, and the softplus function serves as a non-linear mapping ensuring the non-negativity of the intensity. This leads to improved interpretability as the trigger kernel can be explicitly expressed in our design, in contrast to the original THP.

4.4. Fully Attention-based Intensity Function

In point process model training with maximum likelihood estimation (MLE), it is vital to compute the intensity integral over the entire time domain, which requires modeling the intensity both at event positions and on non-event intervals.In the RNN-based deep point process models(Du etal., 2016; Mei and Eisner, 2017), due to the limitations of the RNN framework in solely modeling latent representations at event positions, the aforementioned works adopted parameterized extrapolation methods to model the intensity on non-event intervals, see Eq. (11) in (Du etal., 2016) and Eq. (7) in (Mei and Eisner, 2017). THP (Zuo etal., 2020) also adopted the same approach to model the intensity on non-event intervals (the red term in Eq.2).However, we emphasize that attention-based deep point process models do not necessarily require the parameterized extrapolation methods to model the intensity on non-event intervals.Our design Eq.7 employs the attention mechanism to model the intensity function whether it is at event positions or not. Therefore, we refer to it as a “fully attention-based intensity function”. The fully attention-based intensity function circumvents the limitations of parameterization and ensures that the model can effectively capture intricate intensity patterns at non-event positions, thus enhancing the model’s expressive power.

4.5. Model Training

For a given sequence 𝒮={(ti,ki)}i=1L𝒮superscriptsubscriptsubscript𝑡𝑖subscript𝑘𝑖𝑖1𝐿\mathcal{S}=\{(t_{i},k_{i})\}_{i=1}^{L}caligraphic_S = { ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT on [0,T]0𝑇[0,T][ 0 , italic_T ], the point process model training can be performed by the MLE approach. The log-likelihood of a point process is expressed in the following form:

(8)(𝒮)=i=1Llogλki(ti|ti)0Tλ(t|t)𝑑t,𝒮superscriptsubscript𝑖1𝐿subscript𝜆subscript𝑘𝑖conditionalsubscript𝑡𝑖subscriptsubscript𝑡𝑖superscriptsubscript0𝑇𝜆conditional𝑡subscript𝑡differential-d𝑡\mathcal{L}(\mathcal{S})=\sum_{i=1}^{L}\log\lambda_{k_{i}}(t_{i}|\mathcal{H}_{%t_{i}})-\int_{0}^{T}\lambda(t|\mathcal{H}_{t})dt,caligraphic_L ( caligraphic_S ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_log italic_λ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ ( italic_t | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_t ,

where λ(t|t)=k=1Kλk(t|t)𝜆conditional𝑡subscript𝑡superscriptsubscript𝑘1𝐾subscript𝜆𝑘conditional𝑡subscript𝑡\lambda(t|\mathcal{H}_{t})=\sum_{k=1}^{K}\lambda_{k}(t|\mathcal{H}_{t})italic_λ ( italic_t | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

For ITHP, we estimate its parameters by maximizing the log-likelihood. Regarding the first term, we only need to compute the intensity function at event positions using Eq.7. As for the second term, the intensity integral generally lacks an analytical expression. Here, we employ numerical integration by discretizing the time axis into a sufficiently fine grid and calculating the intensity function at each grid point using Eq.7.

Complexity:The utilization of a fine grid does not significantly increase computational time. This is because the attention mechanism facilitates parallel computation of attention outputs for each point. This parallelized computationimproves the scalability of ITHP.Parallel computation with more grid points would require additional memory. Fortunately, for one-dimensional temporal point processes, a large number of grids is not necessary. In subsequent experiments, all datasets can run smoothly with only 8GB memory.

5. Experiment

We assess the performance of ITHP using both synthetic and public datasets. With the synthetic dataset, our objective is to validate the interpretability of our model by accurately identifying the underlying ground-truth trigger kernel. For the public datasets, we conduct a comprehensive evaluation of ITHP by comparing its performance against popular baseline models.The goal here is twofold: to quantitatively demonstrate the superior expressive power of ITHP and to qualitatively analyze its interpretability on real datasets.

5.1. Synthetic Data

We validate the interpretability of ITHP using two sets of 2-variate Hawkes processes data. Each dataset is simulated from a 2-variate Hawkes processes described inSection3.1, using the thinning algorithm(Ogata, 1998). Both datasets share a common base rate (μ=0.2𝜇0.2\mu=0.2italic_μ = 0.2), but they possess distinct trigger kernels:

  • Exponential Decay Kernel This kernel assumes that the influence of historical events decays exponentially as time elapses. The kernel function is given by: ϕij(τ)=αijexp(βijτ)subscriptitalic-ϕ𝑖𝑗𝜏subscript𝛼𝑖𝑗subscript𝛽𝑖𝑗𝜏\phi_{ij}(\tau)=\alpha_{ij}\exp(-\beta_{ij}\tau)italic_ϕ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_τ ) = italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_exp ( - italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_τ ) for τ>0𝜏0\tau>0italic_τ > 0, where j𝑗jitalic_j is the source type and i𝑖iitalic_i is the target type. Specifically, α11=α22=3subscript𝛼11subscript𝛼223\alpha_{11}=\alpha_{22}=3italic_α start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT = 3, α12=2subscript𝛼122\alpha_{12}=2italic_α start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT = 2, α21=1subscript𝛼211\alpha_{21}=1italic_α start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT = 1 and βij=5subscript𝛽𝑖𝑗5\beta_{ij}=5italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 5 for all i,j𝑖𝑗i,jitalic_i , italic_j.

  • Half Sinusoidal Kernel This kernel assumes the influence of historical events follows a sinusoidal pattern as time elapses and disappears when the interval surpasses π𝜋\piitalic_π. The kernel function is given by: ϕij(τ)=αijsin(τ)subscriptitalic-ϕ𝑖𝑗𝜏subscript𝛼𝑖𝑗𝜏\phi_{ij}(\tau)=\alpha_{ij}\sin(\tau)italic_ϕ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_τ ) = italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_sin ( italic_τ ) for 0<τ<π0𝜏𝜋0<\tau<\pi0 < italic_τ < italic_π. Likewise, j𝑗jitalic_j is the source type and i𝑖iitalic_i is the target type. Specifically, α11=α22=0.33subscript𝛼11subscript𝛼220.33\alpha_{11}=\alpha_{22}=0.33italic_α start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT = 0.33, α12=0.1subscript𝛼120.1\alpha_{12}=0.1italic_α start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT = 0.1, and α21=0.05subscript𝛼210.05\alpha_{21}=0.05italic_α start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT = 0.05.

Further elaboration on the simulation process and statistical aspects of the synthetic dataset can be found in AppendixA.

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (1)(a) Trigger Kernel Recover (Exp)

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (2)(b) Trigger Kernel Recover (Sin)

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (3)(c) Intensity Recover (Exp)

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (4)(d) Intensity Recover (Sin)

Results:We validate the interpretability of ITHP by reconstructing the trigger kernel. In ITHP, the trigger kernel is represented as 𝐠k,ki(tti,ti)𝐰ksubscriptsuperscript𝐠top𝑘subscript𝑘𝑖𝑡subscript𝑡𝑖subscript𝑡𝑖subscript𝐰𝑘\mathbf{g}^{\top}_{k,k_{i}}(t-t_{i},t_{i})\mathbf{w}_{k}bold_g start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which is time-varying. To uncover the time-invariant trigger kernel inherent in the synthetic dataset, we evaluate trigger kernels at various time points and compute their mean. This approach enables us to extract the desired time-invariant trigger kernel(Zhang etal., 2020). The results are presented in Figs.1 and1, revealing a noticeable alignment between the learned kernel trends and the patterns exhibited by the ground-truth kernels.Moreover, as depicted in Figs.1 and1, the learned intensity function from ITHP exhibits a striking resemblance to the ground-truth intensity function. This observation underscores ITHP’s capability to accurately capture the true conditional intensity function for both exponential decay and half sinusoidal Hawkes processes.

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (5)

We also visualize the learned attention map of ITHP, which provides a deeper insight into the influence patterns. As depicted in Fig.2, this is the attention weight matrix of a testing sequence in the context of exponential decay Hawkes process data. The sequence encompasses both event timestamps and grids within the non-event intervals.In the matrix, the rows and columns correspond to events and grids on the sequence (arranged chronologically). The horizontal axis represents the source point, while the vertical axis represents the target point.Only events have the potential to impact subsequent points, whereas grids, lacking actual event occurrences, cannot affect future points. As a result, it is evident that numerous columns corresponding to grids have values of 00. Due to a masking operation, the upper triangular section, including the diagonal, is set to 00, which restricts events from influencing the past.Moreover, the color of event columns becomes progressively lighter as time advances, which aligns with the characteristics of the ground-truth exponential decay trigger kernel.

5.2. Public Data

In this section, we extensively evaluate ITHP by comparing it to baseline models across several public datasets.We have selected several network-sequence datasets, including social media (StackOverflow), online shopping (Amazon, Taobao), traffic networks (Taxi), and a widely used public synthetic dataset (Conttime).

5.2.1. Datasets

We investigate five public datasets, each accompanied by a concise description. More details can be found in AppendixB.

  • StackOverflow§§§https://snap.stanford.edu/data/(Leskovec and Krevl, 2014): This dataset has two years of user awards on a question-answering website: StackOverflow. Each user received a sequence of badges (Nice Question, Good Answer, \ldots) and there are K=22𝐾22K=22italic_K = 22 kinds of badges.

  • Amazon§§§https://nijianmo.github.io/amazon/(Ni etal., 2019): This dataset includes user online shopping behavior events on Amazon website (browsing, purchasing, \dots) and there are in total K=16𝐾16K=16italic_K = 16 event types.

  • Taobao§§§https://tianchi.aliyun.com/dataset/649(Zhu etal., 2018): This dataset is released for the 2018 Tianchi Big Data Competition and comprises user activities on Taobao website (browsing, purchasing, \dots) and there are in total K=17𝐾17K=17italic_K = 17 event types.

  • Taxi§§§https://chriswhong.com/open-data/foil_nyc_taxi/(Whong, 2014): While our main focus is social networks, our model can also be applied to other domains. This dataset comprises traffic-network sequences, including taxi pick-up and drop-off incidents across five boroughs of New York City. Each borough, whether involved in a pick-up or drop-off event, represents an event type and there are in total K=2×5=10𝐾2510K=2\times 5=10italic_K = 2 × 5 = 10 event types.

  • Conttime(Mei and Eisner, 2017): This dataset is a popular public synthetic dataset designed for Hawkes processes, which comprises ten thousand event sequences with event types K=5𝐾5K=5italic_K = 5.

5.2.2. Baselines

In the experiments, we conduct a comparative analysis against the following popular baseline models:

  • RMTPP(Du etal., 2016) is a RNN-based model. It learns the representation of influences from historical events and takes event intervals as input explicitly.

  • NHP(Mei and Eisner, 2017) utilizes a continuous-time LSTM network, which incorporates intensity decay, allowing for a more natural representation of temporal dynamics without requiring explicit encoding of event intervals as inputs to the LSTM.

  • SAHP(Zhang etal., 2020) uses self-attention to characterize the influence of historical events and enhance its predictive capabilities by capturing intricate dependencies within the data.

  • THP(Zuo etal., 2020) is another attention-based model that utilizes Transformer to capture event dependencies while maintaining computational efficiency.

Modelstackoverflowamazontaobaotaxiconttime
TLL(\uparrow)ACC(\uparrow)TLL(\uparrow)ACC(\uparrow)TLL(\uparrow)ACC(\uparrow)TLL(\uparrow)ACC(\uparrow)TLL(\uparrow)ACC(\uparrow)
RMTPP2.87±0.02subscript2.87plus-or-minus0.02-2.87_{\pm 0.02}- 2.87 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT0.43±0.01subscript0.43plus-or-minus0.010.43_{\pm 0.01}0.43 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT2.68±0.03subscript2.68plus-or-minus0.03-2.68_{\pm 0.03}- 2.68 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT0.30±0.01subscript0.30plus-or-minus0.010.30_{\pm 0.01}0.30 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT3.81±0.05subscript3.81plus-or-minus0.05-3.81_{\pm 0.05}- 3.81 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT0.44±0.03subscript0.44plus-or-minus0.030.44_{\pm 0.03}0.44 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT0.17±0.04subscript0.17plus-or-minus0.040.17_{\pm 0.04}0.17 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT0.91±0.01subscript0.91plus-or-minus0.010.91_{\pm 0.01}0.91 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT1.88±0.03subscript1.88plus-or-minus0.03-1.88_{\pm 0.03}- 1.88 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT0.38±0.01subscript0.38plus-or-minus0.010.38_{\pm 0.01}0.38 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT
NHP2.80±0.01subscript2.80plus-or-minus0.01-2.80_{\pm 0.01}- 2.80 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT0.43±0.02subscript0.43plus-or-minus0.020.43_{\pm 0.02}0.43 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT2.70±0.05subscript2.70plus-or-minus0.05-2.70_{\pm 0.05}- 2.70 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT0.27±0.01subscript0.27plus-or-minus0.010.27_{\pm 0.01}0.27 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT3.10¯±0.02subscript¯3.10plus-or-minus0.02\underline{-3.10}_{\pm 0.02}under¯ start_ARG - 3.10 end_ARG start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT0.45±0.01subscript0.45plus-or-minus0.010.45_{\pm 0.01}0.45 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT0.24¯±0.04subscript¯0.24plus-or-minus0.04\underline{0.24}_{\pm 0.04}under¯ start_ARG 0.24 end_ARG start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT0.93¯±0.04subscript¯0.93plus-or-minus0.04\underline{0.93}_{\pm 0.04}under¯ start_ARG 0.93 end_ARG start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT1.54¯±0.01subscript¯1.54plus-or-minus0.01\underline{-1.54}_{\pm 0.01}under¯ start_ARG - 1.54 end_ARG start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT0.41¯±0.03subscript¯0.41plus-or-minus0.03\underline{0.41}_{\pm 0.03}under¯ start_ARG 0.41 end_ARG start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT
SAHP1.96±0.02subscript1.96plus-or-minus0.02\mathbf{-1.96}_{\pm 0.02}- bold_1.96 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT0.45¯±0.01subscript¯0.45plus-or-minus0.01\underline{0.45}_{\pm 0.01}under¯ start_ARG 0.45 end_ARG start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT1.42±0.04subscript1.42plus-or-minus0.04\mathbf{-1.42}_{\pm 0.04}- bold_1.42 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT0.35¯±0.01subscript¯0.35plus-or-minus0.01\underline{0.35}_{\pm 0.01}under¯ start_ARG 0.35 end_ARG start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT4.70±0.03subscript4.70plus-or-minus0.03-4.70_{\pm 0.03}- 4.70 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT0.46¯±0.01subscript¯0.46plus-or-minus0.01\underline{0.46}_{\pm 0.01}under¯ start_ARG 0.46 end_ARG start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT0.21±0.03subscript0.21plus-or-minus0.030.21_{\pm 0.03}0.21 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT0.94±0.01subscript0.94plus-or-minus0.01\mathbf{0.94}_{\pm 0.01}bold_0.94 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT2.22±0.02subscript2.22plus-or-minus0.02-2.22_{\pm 0.02}- 2.22 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT0.42±0.01subscript0.42plus-or-minus0.01\mathbf{0.42}_{\pm 0.01}bold_0.42 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT
THP3.41±0.01subscript3.41plus-or-minus0.01-3.41_{\pm 0.01}- 3.41 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT0.46±0.01subscript0.46plus-or-minus0.01\mathbf{0.46}_{\pm 0.01}bold_0.46 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT3.26±0.21subscript3.26plus-or-minus0.21-3.26_{\pm 0.21}- 3.26 start_POSTSUBSCRIPT ± 0.21 end_POSTSUBSCRIPT0.34±0.01subscript0.34plus-or-minus0.010.34_{\pm 0.01}0.34 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT4.76±0.11subscript4.76plus-or-minus0.11-4.76_{\pm 0.11}- 4.76 start_POSTSUBSCRIPT ± 0.11 end_POSTSUBSCRIPT0.44±0.05subscript0.44plus-or-minus0.050.44_{\pm 0.05}0.44 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT0.22±0.05subscript0.22plus-or-minus0.050.22_{\pm 0.05}0.22 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT0.93¯±0.02subscript¯0.93plus-or-minus0.02\underline{0.93}_{\pm 0.02}under¯ start_ARG 0.93 end_ARG start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT3.16±0.19subscript3.16plus-or-minus0.19-3.16_{\pm 0.19}- 3.16 start_POSTSUBSCRIPT ± 0.19 end_POSTSUBSCRIPT0.34±0.01subscript0.34plus-or-minus0.010.34_{\pm 0.01}0.34 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT
ITHP2.50¯±0.03subscript¯2.50plus-or-minus0.03\underline{-2.50}_{\pm 0.03}under¯ start_ARG - 2.50 end_ARG start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT0.46±0.01subscript0.46plus-or-minus0.01\mathbf{0.46}_{\pm 0.01}bold_0.46 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT2.10¯±0.02subscript¯2.10plus-or-minus0.02\underline{-2.10}_{\pm 0.02}under¯ start_ARG - 2.10 end_ARG start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT0.36±0.01subscript0.36plus-or-minus0.01\mathbf{0.36}_{\pm 0.01}bold_0.36 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT3.09±0.02subscript3.09plus-or-minus0.02\mathbf{-3.09}_{\pm 0.02}- bold_3.09 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT0.47±0.01subscript0.47plus-or-minus0.01\mathbf{0.47}_{\pm 0.01}bold_0.47 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT0.25±0.05subscript0.25plus-or-minus0.05\mathbf{0.25}_{\pm 0.05}bold_0.25 start_POSTSUBSCRIPT ± 0.05 end_POSTSUBSCRIPT0.94±0.01subscript0.94plus-or-minus0.01\mathbf{0.94}_{\pm 0.01}bold_0.94 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT1.43±0.01subscript1.43plus-or-minus0.01\mathbf{-1.43}_{\pm 0.01}- bold_1.43 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT0.38±0.01subscript0.38plus-or-minus0.010.38_{\pm 0.01}0.38 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT
Ex-ITHP3.58±0.01subscript3.58plus-or-minus0.01-3.58_{\pm 0.01}- 3.58 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT0.43±0.03subscript0.43plus-or-minus0.030.43_{\pm 0.03}0.43 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT4.65±0.02subscript4.65plus-or-minus0.02-4.65_{\pm 0.02}- 4.65 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT0.33±0.01subscript0.33plus-or-minus0.010.33_{\pm 0.01}0.33 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT4.80±0.01subscript4.80plus-or-minus0.01-4.80_{\pm 0.01}- 4.80 start_POSTSUBSCRIPT ± 0.01 end_POSTSUBSCRIPT0.41±0.02subscript0.41plus-or-minus0.020.41_{\pm 0.02}0.41 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT0.18±0.03subscript0.18plus-or-minus0.030.18_{\pm 0.03}0.18 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT0.85±0.03subscript0.85plus-or-minus0.030.85_{\pm 0.03}0.85 start_POSTSUBSCRIPT ± 0.03 end_POSTSUBSCRIPT3.74±0.04subscript3.74plus-or-minus0.04-3.74_{\pm 0.04}- 3.74 start_POSTSUBSCRIPT ± 0.04 end_POSTSUBSCRIPT0.31±0.02subscript0.31plus-or-minus0.020.31_{\pm 0.02}0.31 start_POSTSUBSCRIPT ± 0.02 end_POSTSUBSCRIPT

5.2.3. Metrics

We assess ITHP and other baseline models using two distinct metrics:

  • TLL: the log-likelihood on the test data which quantifies the model’s ability to capture the underlying data distribution and effectively predict future events.

  • ACC: the event type prediction accuracy on the test data which characterizes the model’s accuracy in predicting the specific types of events, thereby gauging its capacity to discriminate between different event categories.

5.2.4. Quantitative Analysis

We conduct a comparative experiment across five datasets using all baseline models. The results, as shown in Table1, demonstrate that ITHP can achieve competitive performance. A more intuitive visualization is presented in Fig.3, where each model’s TLL is standardized by subtracting the TLL of ITHP.It is worth noting that, to achieve interpretability, ITHP undergoes a certain degree of simplification, resulting in a reduction of its number of parameters. Interestingly, we observe that ITHP achieves comparable performance to other models with larger number of parameters. ITHP can equivalently be regarded as a non-parametric, time-varying, and nonlinear statistical Hawkes process. The results in Fig.3 provide some reflections: while deep point processes claim to outperform statistical point processes, it is evident that a sufficiently flexible (non-parametric, time-varying, and nonlinear) statistical point process can also achieve competitive performance.Furthermore, ITHP maintains excellent interpretability, both at the event level and the event type level. Fig.3 displays the attention weight matrix of a testing sequence from StackOverflow, illustrating the impact between events: the influence from past events tends to decrease as time elapsed.Moreover, ITHP can describe the influence functions between event types. Take StackOverflow as an example: Fig.3 presents the learned influence functions from types 1,3,4,5,9,12 to type 4, which is the most prevalent type. Generally, these influences tend to decay over time.

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (6)(a) TLL Comparison

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (7)(b) Attention Map (StackOverflow)

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (8)(c) Estimated ϕ^(τ)^italic-ϕ𝜏\hat{\phi}(\tau)over^ start_ARG italic_ϕ end_ARG ( italic_τ ) (StackOverflow)

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (9)(d) Amazon Statistics

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (10)(e) Heatmap of StackOverflow

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (11)(f) Heatmap of Amazon

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (12)(g) Heatmap of Taobao

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (13)(h) Heatmap of Taxi

5.2.5. Qualitative Analysis

Our model can provide useful insights into the interaction among event types. To demonstrate this, we first quantify the magnitude of influence between event types. We compute ϕij(τ)𝑑τsubscriptitalic-ϕ𝑖𝑗𝜏differential-d𝜏\int\phi_{ij}(\tau)d\tau∫ italic_ϕ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_τ ) italic_d italic_τ for each influence function, representing the extent of influence from type j𝑗jitalic_j (source type) to type i𝑖iitalic_i (target type). Specifically, each learned ϕij(τ)𝑑τsubscriptitalic-ϕ𝑖𝑗𝜏differential-d𝜏\int\phi_{ij}(\tau)d\tau∫ italic_ϕ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_τ ) italic_d italic_τ is a scalar and can be demonstrated in heat maps. In this section, we analyse these datasets by looking into their learned heat maps: Figs.3, 3, 3 and3.

StackOverflow: In this dataset, there are 22 event types related to “badges” awarded to users based on their actions. As depicted in Fig.3, many of these types have a strong positive influence on both type 4 (“Popular Question”) and type 9 (“Notable Question”). This observation aligns with the fact that “Popular Question” and “Notable Question” are the two most frequent events. Our model captures this trend and associates a significant positive impact from other types to them.Furthermore, a noticeable link between type 6 (“Nice Answer”, awarded when a user’s answer first achieves a score of 10) and type 14 (“Enlighten”, given when a user’s answer reaches a score of 10) is identified, which have nearly identical meanings. This mirrors the real-world progression from receiving a “Nice Answer” badge to later earning an “Enlighten” badge. This congruence demonstrates that our model accurately captures the dataset’s characteristics and effectively highlights the interplay between different event types.

Amazon, Taobao: Both of these datasets pertain to customer behavior on shopping platforms and share some commonalities. Each event type represents a category of the browsing item (Taobao) or purchased item (Amazon), with Taobao having K=16𝐾16K=16italic_K = 16 types and Amazon having K=17𝐾17K=17italic_K = 17 types. The learned heat maps are presented in Figs.3 and3. Interestingly, our model uncovers two common insights:(1) The dark diagonals observed indicate strong self-excitation for each type. This suggests customers tend to browse items of the same category consecutively in a short period. In Taobao and Amazon, with over 15 types in total, there are approximately 58.3%percent58.358.3\%58.3 % and 21.4%percent21.421.4\%21.4 % of events involving subsequent events of the same type. This behavior reflects how customers often browse items of the same category in a short period to decide which one to purchase. Additionally, Amazon’s subscription purchases exemplify this pattern: vendors offer extra savings to customers who subscribe. These items are then regularly scheduled for delivery.(2) In Figs.3 and3, rows 1 and 17 appear the darkest, indicating that these two types receive the most significant excitation from others. In reality, these two categories are the most prevalent in their respective datasets, implying that they should also have the highest intensity. What our model learns aligns empirically with the ground truth patterns in the datasets.Moreover, we conducted a statistical analysis on Amazon in Fig.3, calculating the percentages of various event types (“Total”), the percentages of the next event being of the same type (“Same type follower”), and the percentages of the next event being of type 1 (“Type 1 follower”). It is evident that the latter two constitute a significant portion (50%similar-toabsentpercent50\sim 50\%∼ 50 %), indicating strong self-excitation effects and a pronounced exciting effect on type 1, which aligns with the learned heatmap in Fig.3.

Taxi: In this dataset, there are 10 types of events representing taxi pick-up and drop-off across the five boroughs of New York City. Types 1-5 categorize “drop-off” actions, whereas types 6-10 correspond to “pick-up” actions in the respective boroughs. The learned heatmap (Fig.3) reveals three key insights:(1) Among the “drop-off” actions (types 1-5), type 4 experiences the most significant influence from types 6-10 (“pick-up”). This aligns with the fact that type 4 (drop-off in Manhattan) is the most common drop-off event, accounting for over 40%percent4040\%40 % and thereby possessing the highest intensity.(2) The “pick-up” and “drop-off” events always occur alternately. One driver can’t pick up or drop off consecutively. As Fig.3 shows, type 6-10 (“pick-up”) have much more excitation on type 1-5 (“drop-off”) rather than on themselves because a “pick-up” action will stimulate a consecutive “drop-off” action rather than another “pick-up” action. Likewise, type 1-5 have much less excitation on themselves.(3) Type 9 and 4, pick-ups and drop-offs in Manhattan, display the most significant mutual influence, as indicated by the two darkest cells in Fig.3. This is consistent because most pick-up (44.61%percent44.6144.61\%44.61 %) and drop-off (42.89%percent42.8942.89\%42.89 %) actions occur in Manhattan. Furthermore, these two types always occur in tandem: 90.8%percent90.890.8\%90.8 % of passengers picked up in Manhattan are also dropped off there, and 96.2%percent96.296.2\%96.2 % of drivers who complete a trip in Manhattan will pick up their next customer within the same borough. This behavior is a clear short-term pattern captured by our model and is evident in the dataset.

5.3. Ablation Study

Our model has reduced the parameter count but still achieves comparable or even better results compared to THP. This improvement is attributed to the “fully attention-based intensity function” (Section4.4). THP relies on the parameterized extrapolated intensity, assuming that the intensity function on non-event intervals follows an approximately linear pattern (red term in Eq.2). However, such an assumption does not align with the actual patterns in real data and can impact the expressive capability of the model.We conduct further ablation studies to illustrate the limitations of the parameterized extrapolation method in Table1.We implement an extra revised model Ex-ITHP which essentially is “interpretable Transformer” + “extrapolated intensity”. More details about Ex-ITHP is provided in AppendixC.THP, ITHP, and Ex-ITHP naturally constitute an ablation study.Ex-ITHP has fewer parameters as it removes the parameters 𝐖Qsuperscript𝐖𝑄\mathbf{W}^{Q}bold_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and 𝐖Ksuperscript𝐖𝐾\mathbf{W}^{K}bold_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, and uses a less flexible extrapolated intensity.In Table1, the Ex-ITHP exhibits the poorest performance due to its fewer parameters and restricted intensity flexibility. THP performs moderately, having more parameters but still restricted intensity flexibility. Conversely, the ITHP, despite having fewer parameters, outperforms THP on most datasets owing to its more flexible intensity expression.

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (14)

Additionally, we visualize the difference between the learned fully attention-based intensity and the learned extrapolated intensity for a segment of the sequence in Half Sinusoidal Kernel Hawkes Synthetic dataset (Section5.1). As depicted in Fig.4,on non-event intervals, THP, constrained by the approximately linear extrapolation, struggles to capture the fluctuating intensity patterns and can only learn an intensity that is approximately linear. Additionally, due to the limited variation in intensity on non-event intervals, large jumps are required when a new event occurs to maintain a height similar to the ground-truth intensity.In contrast, our proposed ITHP demonstrates greater flexibility, successfully capturing the fluctuating pattern on non-event intervals, and accurately fitting the scale level.

5.4. Hyperparameter Analysis

Our model’s configuration primarily encompasses two dimensions: the encoding dimension, denoted as M𝑀Mitalic_M, and the Value dimension, denoted as MVsubscript𝑀𝑉M_{V}italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT.We maintain the skip connection within the implementation of the encoder which necessitates that MVsubscript𝑀𝑉M_{V}italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT must be equal to 2M2𝑀2M2 italic_M.We test the sensitivity of model performance to hyperparameters by using various hyperparameter configurations on one toy dataset and one public dataset: the half-sine and Taxi datasets. The results of our experiments are shown inTable2. The results indicate that our model is not significantly affected by the hyperparameter variation. Additionally, it can achieve reasonably good performance even with fewer parameters.

ConfigTaxiHalf-Sine
TLLACCTLLACC
M=64𝑀64M=64italic_M = 64, MV=128subscript𝑀𝑉128M_{V}=128italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT = 1280.25130.97-0.77140.58
M=128𝑀128M=128italic_M = 128, MV=256subscript𝑀𝑉256M_{V}=256italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT = 2560.25010.97-0.79090.58
M=256𝑀256M=256italic_M = 256, MV=512subscript𝑀𝑉512M_{V}=512italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT = 5120.25200.97-0.78220.58
M=512𝑀512M=512italic_M = 512, MV=1024subscript𝑀𝑉1024M_{V}=1024italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT = 10240.24980.97-0.78520.59

6. Conclusion

To model interactions in social networks using event sequence data, we introduce ITHP as a novel approach to enhance the interpretability and expressive power of deep point processes model.Specifically, ITHP not only inherits the strengths of Transformer Hawkes processes but also aligns with statistical nonlinear Hawkes processes, offering practical insights into user or group interactions. It further enhances the flexibility of intensity functions over non-event intervals.Our experiments have demonstrated the effectiveness of ITHP in overcoming inherent limitations in existing deep point process models. Our findings open new avenues for research in understanding and modeling the complex dynamics of social ecosystems, ultimately contributing to the broader understanding of these intricate networks.

Acknowledgments

This work was supported by NSFC Project (No. 62106121), the MOE Project of Key Research Institute of Humanities and Social Sciences (22JJD110001), and the Public Computing Cloud, Renmin University of China.

Appendix

Appendix A Toy Data

We simulate two synthetic datasets: the exponential decay Hawkes processes and the half sinusoidal Hawkes processes. In the case of the exponential decay Hawkes processes, we set the maximum observed length as T=20𝑇20T=20italic_T = 20. For the half sinusoidal Hawkes processes, the maximum observed length is set to T=100𝑇100T=100italic_T = 100.To perform simulation, we employ the thinning algorithm(Ogata, 1998) which is outlined in the algorithm below. Note that the kernel function ϕmn(τ)subscriptitalic-ϕ𝑚𝑛𝜏\phi_{mn}(\tau)italic_ϕ start_POSTSUBSCRIPT italic_m italic_n end_POSTSUBSCRIPT ( italic_τ ) is defined inSection5.1, where m𝑚mitalic_m is the target type and n𝑛nitalic_n is the source type. Note that ϕm(τ)subscriptitalic-ϕ𝑚𝜏\phi_{m*}(\tau)italic_ϕ start_POSTSUBSCRIPT italic_m ∗ end_POSTSUBSCRIPT ( italic_τ ) indicates all kernels whose target type is m𝑚mitalic_m. The statistics of our toy datasets are listed inTable3.Additionally, we present the attention weight matrix of a testing sequence in the half-sine toy data, as depicted inFig.6.

{μnsubscript𝜇𝑛\mu_{n}italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, ϕmn()subscriptitalic-ϕ𝑚𝑛\phi_{mn}(\cdot)italic_ϕ start_POSTSUBSCRIPT italic_m italic_n end_POSTSUBSCRIPT ( ⋅ )} for m,n=1,2,,Mformulae-sequence𝑚𝑛12𝑀m,n=1,2,\dots,Mitalic_m , italic_n = 1 , 2 , … , italic_M, the observation window [0,T]0𝑇[0,T][ 0 , italic_T ]

Initialize 𝒯1==𝒯M=superscript𝒯1superscript𝒯𝑀\mathcal{T}^{1}=\dots=\mathcal{T}^{M}=\emptysetcaligraphic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = ⋯ = caligraphic_T start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT = ∅, n1==nM=0superscript𝑛1superscript𝑛𝑀0n^{1}=\dots=n^{M}=0italic_n start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = ⋯ = italic_n start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT = 0, s=0𝑠0s=0italic_s = 0;

whiles<T𝑠𝑇s<Titalic_s < italic_Tdo

Set λ¯=m=1Mλm(s)+max(ϕm())¯𝜆superscriptsubscript𝑚1𝑀superscript𝜆𝑚superscript𝑠maxsubscriptitalic-ϕ𝑚\bar{\lambda}=\sum_{m=1}^{M}\lambda^{m}(s^{-})+\text{max}(\phi_{m*}(\cdot))over¯ start_ARG italic_λ end_ARG = ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) + max ( italic_ϕ start_POSTSUBSCRIPT italic_m ∗ end_POSTSUBSCRIPT ( ⋅ ) );

Generate wexponential(1/λ¯)similar-to𝑤exponential1¯𝜆w\sim\text{exponential}(1/\bar{\lambda})italic_w ∼ exponential ( 1 / over¯ start_ARG italic_λ end_ARG );

Set s=s+w𝑠𝑠𝑤s=s+witalic_s = italic_s + italic_w;

Generate Duniform(0,1)similar-to𝐷uniform01D\sim\text{uniform}(0,1)italic_D ∼ uniform ( 0 , 1 );

ifDλ¯m=1Mλm(s)𝐷¯𝜆superscriptsubscript𝑚1𝑀superscript𝜆𝑚𝑠D\bar{\lambda}\leq\sum_{m=1}^{M}\lambda^{m}(s)italic_D over¯ start_ARG italic_λ end_ARG ≤ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_s )then

kcategorical([λ1(s),,λM(s)]/λ¯)similar-to𝑘categoricalsuperscript𝜆1𝑠superscript𝜆𝑀𝑠¯𝜆k\sim\text{categorical}([\lambda^{1}(s),\ldots,\lambda^{M}(s)]/\bar{\lambda})italic_k ∼ categorical ( [ italic_λ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_s ) , … , italic_λ start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_s ) ] / over¯ start_ARG italic_λ end_ARG );

nk=nk+1superscript𝑛𝑘superscript𝑛𝑘1n^{k}=n^{k}+1italic_n start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = italic_n start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + 1;

tnkk=ssubscriptsuperscript𝑡𝑘superscript𝑛𝑘𝑠t^{k}_{n^{k}}=sitalic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_s

𝒯k=𝒯k{tnkk}superscript𝒯𝑘superscript𝒯𝑘subscriptsuperscript𝑡𝑘superscript𝑛𝑘\mathcal{T}^{k}=\mathcal{T}^{k}\cup\{t^{k}_{n^{k}}\}caligraphic_T start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = caligraphic_T start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∪ { italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT }

endif

endwhile

iftnkkTsubscriptsuperscript𝑡𝑘superscript𝑛𝑘𝑇t^{k}_{n^{k}}\leq Titalic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ italic_Tthen

return {𝒯m}superscript𝒯𝑚\{\mathcal{T}^{m}\}{ caligraphic_T start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } for m=1,2,,M𝑚12𝑀m=1,2,\dots,Mitalic_m = 1 , 2 , … , italic_M

else

return 𝒯1𝒯k/{tnkk}𝒯Msuperscript𝒯1superscript𝒯𝑘subscriptsuperscript𝑡𝑘superscript𝑛𝑘superscript𝒯𝑀\mathcal{T}^{1}\dots\mathcal{T}^{k}/\{t^{k}_{n^{k}}\}\dots\mathcal{T}^{M}caligraphic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT … caligraphic_T start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT / { italic_t start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } … caligraphic_T start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT

endif

DatasetSplit# of EventsSequence LengthEvent Interval
MaxMinMean(Std)MaxMinMean(Std)
Exponential-Decaytraining7064487747282.58 (150.25)21.861.91e-060.28 (150.25)
validation3652187747292.17 (153.94)20.901.91e-060.27 (153.94)
test3371689459269.73 (144.97)20.313.81e-060.29 (144.97)
Half-Sinetraining95714858158382.86 (96.29)21.213.83e-070.52 (96.29)
validation48814858158390.51 (106.12)21.213.83e-070.51 (106.12)
test50376717223403.01 (102.19)19.575.48e-060.49 (102.19)

Appendix B Public Data

B.1. Public Data Statistics

In this section, we cover the main statistics of five public datasets: StackOverflow, Taobao, Amazon, Taxi and Conttime, which is listed inTable4.Note that all the public data are multivariate. The visualizations of the event percentages in each dataset are depicted inFig.5. Each subplot inFig.5 displays the distribution of event types in the training, validation, and testing sets, respectively.

DatasetSplit# of EventsSequence LengthEvent Interval
MaxMinMean(Std)MaxMinMean(Std)
STACKOVERFLOWtraining904971014164.59 (20.46)20.341.22e-40.88 (20.46)
validation253131014163.12 (19.85)16.681.22e-40.90 (19.85)
test265181014166.13(20.77)17.131.22e-40.85 (20.77)
TAOBAOtraining75205644057.85 (6.64)2.009.99e-050.22 (6.64)
validation11497644057.49 (6.82)1.999.99e-050.22 (6.82)
test28455643256.91 (7.82)1.004.21e-060.05 (7.82)
AMAZONtraining288377941444.68 (17.88)0.800.0100.51 (17.88)
validation40088941543.48 (16.60)0.800.0100.50 (16.60)
test84048941445.41 (18.19)0.800.0100.51(18.19)
TAXItraining51854383637.04 (1.00)5.722.78e-40.22 (1.00)
validation7422383637.11 (1.00)5.522.78e-40.22 (1.00)
test14820383637.05 (1.00)5.258.33e-40.22 (1.00)
CONTTIMEtraining4794671002059.93 (23.13)4.031.91e-060.24 (23.13)
validation601411002060.14 (22.97)3.942.86e-060.24 (22.97)
test617811002061.78 (23.21)4.479.54e-070.24 (23.21)

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (15)(a) StackOverflow

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (16)(b) Taobao

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (17)(c) Amazon

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (18)(d) Taxi

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (19)(e) Conttime

B.2. Additional Attention Map

We present additional attention weight matrices for the four public datasets, as depicted inFig.6, which is more intuitive to demonstrate how events affect each other in the sequence.

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (20)(a) Half-sine

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (21)(b) Taobao

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (22)(c) Amazon

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (23)(d) Taxi

Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (24)(e) Conttime

Appendix C Implementation of Ex-ITHP

The Ex-ITHP, namely, "extrapolation iTHP" is the iTHP(removing parameters WQsuperscript𝑊𝑄W^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, WKsuperscript𝑊𝐾W^{K}italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT) utilising ’extrapolation intensity’. In this section, we introduce the implementation of Ex-ITHP.Given a sequence 𝒮={(ti,ki)}i=1L𝒮superscriptsubscriptsubscript𝑡𝑖subscript𝑘𝑖𝑖1𝐿\mathcal{S}=\{(t_{i},k_{i})\}_{i=1}^{L}caligraphic_S = { ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT where each event is characterized by a timestamp tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and an event type kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Ex-ITHP utilize the same temporal embedding and type embedding and concatenates them as iTHP does:

(9)𝐗=[𝐙,𝐄]L×2M,𝐗𝐙𝐄superscript𝐿2𝑀\mathbf{X}=[\mathbf{Z},\mathbf{E}]\in\mathbb{R}^{L\times 2M},bold_X = [ bold_Z , bold_E ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × 2 italic_M end_POSTSUPERSCRIPT ,

where 𝐙L×M𝐙superscript𝐿𝑀\mathbf{Z}\in\mathbb{R}^{L\times M}bold_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M end_POSTSUPERSCRIPT and 𝐄L×M𝐄superscript𝐿𝑀\mathbf{E}\in\mathbb{R}^{L\times M}bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_M end_POSTSUPERSCRIPT are the temporal encoding and type encoding of 𝒮𝒮\mathcal{S}caligraphic_S. The encoder output 𝐒𝐒\mathbf{S}bold_S is calculated in the same way asEq.4. 𝐒isubscript𝐒𝑖\mathbf{S}_{i}bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates the i𝑖iitalic_i-th row of 𝐒𝐒\mathbf{S}bold_S, which is the representation of event i𝑖iitalic_i:

(10)𝐒i=j<isoftmaxj<i(𝐗i𝐗j2M)𝐕jMV.subscript𝐒𝑖subscript𝑗𝑖𝑗𝑖softmaxsubscript𝐗𝑖superscriptsubscript𝐗𝑗top2𝑀subscript𝐕𝑗superscriptsubscript𝑀𝑉\mathbf{S}_{i}=\sum_{j<i}\underset{j<i}{\text{softmax}}\left(\frac{\mathbf{X}_%{i}\mathbf{X}_{j}^{\top}}{\sqrt{2M}}\right)\mathbf{V}_{j}\in\mathbb{R}^{M_{V}}.bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j < italic_i end_POSTSUBSCRIPT start_UNDERACCENT italic_j < italic_i end_UNDERACCENT start_ARG softmax end_ARG ( divide start_ARG bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_M end_ARG end_ARG ) bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

The encoder output 𝐒isubscript𝐒𝑖\mathbf{S}_{i}bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is then passed through a MLP to get the final representation of event i𝑖iitalic_i:

𝐇i=ReLU(𝐒i𝐖1+𝐛1)𝐖2+𝐛2M.subscript𝐇𝑖ReLUsubscript𝐒𝑖subscript𝐖1subscript𝐛1subscript𝐖2subscript𝐛2superscript𝑀\mathbf{H}_{i}=\text{ReLU}(\mathbf{S}_{i}\mathbf{W}_{1}+\mathbf{b}_{1})\mathbf%{W}_{2}+\mathbf{b}_{2}\in\mathbb{R}^{M}.bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ReLU ( bold_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) bold_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT .

Then, given a type k𝑘kitalic_k and time t (ti<tti+1subscript𝑡𝑖𝑡subscript𝑡𝑖1t_{i}<t\leq t_{i+1}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_t ≤ italic_t start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT), the corresponding intensity is given by the extrapolation method:

(11)λk(t|t)=softplus(αkttiti+𝐰k𝐇i+bk).subscript𝜆𝑘conditional𝑡subscript𝑡softplussubscript𝛼𝑘𝑡subscript𝑡𝑖subscript𝑡𝑖superscriptsubscript𝐰𝑘topsubscript𝐇𝑖subscript𝑏𝑘\lambda_{k}(t|\mathcal{H}_{t})=\text{softplus}(\alpha_{k}\frac{t-t_{i}}{t_{i}}%+\mathbf{w}_{k}^{\top}\mathbf{H}_{i}+b_{k}).italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = softplus ( italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT divide start_ARG italic_t - italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG + bold_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) .

Finally, the log-likelihood to be optimized is given by:

(12)(𝒮)=i=1Llogλki(ti|ti)k=1K0Tλk(t|t)𝑑t,𝒮superscriptsubscript𝑖1𝐿subscript𝜆subscript𝑘𝑖conditionalsubscript𝑡𝑖subscriptsubscript𝑡𝑖superscriptsubscript𝑘1𝐾superscriptsubscript0𝑇subscript𝜆𝑘conditional𝑡subscript𝑡differential-d𝑡\mathcal{L}(\mathcal{S})=\sum_{i=1}^{L}\log\lambda_{k_{i}}(t_{i}|\mathcal{H}_{%t_{i}})-\sum_{k=1}^{K}\int_{0}^{T}\lambda_{k}(t|\mathcal{H}_{t})dt,caligraphic_L ( caligraphic_S ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_log italic_λ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_H start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t | caligraphic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_t ,

In summary, Ex-ITHP employs identical encoding techniques, specifically temporal encoding and event type encoding, while also eliminating the use of WQsuperscript𝑊𝑄W^{Q}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT and WKsuperscript𝑊𝐾W^{K}italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, akin to iTHP. However, it utilizes the extrapolation method to formulate the intensity as THP does.

References

  • (1)
  • Bhaduri etal. (2021)Moinak Bhaduri, Dhruva Rangan, and Anurag Balaji. 2021.Change detection in non-stationary Hawkes processes through sequential testing. In ITM Web of Conferences, Vol.36. EDP Sciences, 01005.
  • Brémaud and Massoulié (1996)Pierre Brémaud and Laurent Massoulié. 1996.Stability of nonlinear Hawkes processes.The Annals of Probability (1996), 1563–1588.
  • Daley and Vere-Jones (2003)DarylJ Daley and David Vere-Jones. 2003.An introduction to the theory of point processes. Vol. I. Probability and its Applications.
  • Daley and Vere-Jones (2007)DarylJ Daley and David Vere-Jones. 2007.An Introduction to the Theory of Point Processes: Volume II: General Theory and Structure.Springer Science & Business Media.
  • Dong etal. (2022)Zheng Dong, Xiuyuan Cheng, and Yao Xie. 2022.Spatio-temporal point processes with deep non-stationary kernels.arXiv preprint arXiv:2211.11179 (2022).
  • Donner and Opper (2018)Christian Donner and Manfred Opper. 2018.Efficient Bayesian inference of sigmoidal Gaussian Cox processes.Journal of Machine Learning Research 19, 1 (2018), 2710–2743.
  • Du etal. (2016)Nan Du, Hanjun Dai, Raksh*t Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. 2016.Recurrent marked temporal point processes: embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1555–1564.
  • Farajtabar etal. (2015)Mehrdad Farajtabar, Yichen Wang, Manuel GomezRodriguez, Shuang Li, Hongyuan Zha, and Le Song. 2015.Coevolve: A joint point process model for information diffusion and network co-evolution.Advances in Neural Information Processing Systems 28 (2015).
  • Hawkes (1971)AlanG Hawkes. 1971.Spectra of some self-exciting and mutually exciting point processes.Biometrika 58, 1 (1971), 83–90.
  • Kong etal. (2023)Quyu Kong, Pio Calderon, Rohit Ram, Olga Boichak, and Marian-Andrei Rizoiu. 2023.Interval-censored transformer hawkes: Detecting information operations using the reaction of social systems. In Proceedings of the ACM Web Conference 2023. 1813–1821.
  • Leskovec and Krevl (2014)Jure Leskovec and Andrej Krevl. 2014.SNAP Datasets: Stanford Large Network Dataset Collection.http://snap.stanford.edu/data.
  • Lewis and Mohler (2011)Erik Lewis and George Mohler. 2011.A nonparametric EM algorithm for multiscale Hawkes processes.Journal of Nonparametric Statistics 1, 1 (2011), 1–20.
  • Lloyd etal. (2015)Chris Lloyd, Tom Gunter, Michael Osborne, and Stephen Roberts. 2015.Variational inference for Gaussian process modulated Poisson processes. In International Conference on Machine Learning. 1814–1822.
  • Mei and Eisner (2017)Hongyuan Mei and Jason Eisner. 2017.The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, HannaM. Wallach, Rob Fergus, S.V.N. Vishwanathan, and Roman Garnett (Eds.). 6754–6764.
  • Ni etal. (2019)Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019.Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 188–197.
  • Noorbakhsh and Rodriguez (2022)Kimia Noorbakhsh and Manuel Rodriguez. 2022.Counterfactual Temporal Point Processes. In Advances in Neural Information Processing Systems, S.Koyejo, S.Mohamed, A.Agarwal, D.Belgrave, K.Cho, and A.Oh (Eds.), Vol.35. Curran Associates, Inc., 24810–24823.https://proceedings.neurips.cc/paper_files/paper/2022/file/9d3faa41886997cfc2128b930077fa49-Paper-Conference.pdf
  • Ogata (1998)Yosihiko Ogata. 1998.Space-time point-process models for earthquake occurrences.Annals of the Institute of Statistical Mathematics 50, 2 (1998), 379–402.
  • Okawa etal. (2019)Maya Okawa, Tomoharu Iwata, Takeshi Kurashima, Yusuke Tanaka, Hiroyuki Toda, and Naonori Ueda. 2019.Deep Mixture Point Processes: Spatio-temporal Event Prediction with Rich Contextual Information. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, Ankur Teredesai, Vipin Kumar, Ying Li, Rómer Rosales, Evimaria Terzi, and George Karypis (Eds.). ACM, 373–383.
  • Pan etal. (2021)Zhimeng Pan, Zheng Wang, JeffM Phillips, and Shandian Zhe. 2021.Self-adaptable point processes with nonparametric time decays.Advances in Neural Information Processing Systems 34 (2021), 4594–4606.
  • Shchur etal. (2020a)Oleksandr Shchur, Marin Bilos, and Stephan Günnemann. 2020a.Intensity-Free Learning of Temporal Point Processes. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Shchur etal. (2020b)Oleksandr Shchur, Nicholas Gao, Marin Biloš, and Stephan Günnemann. 2020b.Fast and flexible temporal point processes with triangular maps.Advances in Neural Information Processing Systems 33 (2020), 73–84.
  • Upadhyay etal. (2018)Utkarsh Upadhyay, Abir De, and Manuel GomezRodriguez. 2018.Deep reinforcement learning of marked temporal point processes.Advances in Neural Information Processing Systems 31 (2018).
  • Wei etal. (2023)Song Wei, Yao Xie, ChristopherS Josef, and Rishikesan Kamaleswaran. 2023.Granger Causal Chain Discovery for Sepsis-Associated Derangements via Continuous-Time Hawkes Processes. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2536–2546.
  • Whong (2014)Chris Whong. 2014.FOILing NYC’s taxi trip data.FOILing NYCs Taxi Trip Data. Np 18 (2014).
  • Xiao etal. (2017a)Shuai Xiao, Mehrdad Farajtabar, Xiaojing Ye, Junchi Yan, Le Song, and Hongyuan Zha. 2017a.Wasserstein learning of deep generative point process models.Advances in neural information processing systems 30 (2017).
  • Xiao etal. (2017b)Shuai Xiao, Junchi Yan, Xiaokang Yang, Hongyuan Zha, and StephenM. Chu. 2017b.Modeling the Intensity Function of Point Process Via Recurrent Neural Networks. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, Satinder Singh and Shaul Markovitch (Eds.). AAAI Press, 1597–1603.
  • Zhang etal. (2022b)Lu-ning Zhang, Jian-wei Liu, Zhi-yan Song, and Xin Zuo. 2022b.Temporal attention augmented transformer Hawkes process.Neural Computing and Applications (2022), 1–15.
  • Zhang etal. (2020)Qiang Zhang, Aldo Lipani, Ömer Kirnap, and Emine Yilmaz. 2020.Self-Attentive Hawkes Process. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol.119). PMLR, 11183–11193.
  • Zhang etal. (2022a)Yizhou Zhang, Defu Cao, and Yan Liu. 2022a.Counterfactual neural temporal point process for estimating causal influence of misinformation on social media.Advances in Neural Information Processing Systems 35 (2022), 10643–10655.
  • Zhao etal. (2015)Qingyuan Zhao, MuratA Erdogdu, HeraY He, Anand Rajaraman, and Jure Leskovec. 2015.Seismic: A self-exciting point process model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. 1513–1522.
  • Zhou etal. (2022)Feng Zhou, Quyu Kong, Zhijie Deng, Jichao Kan, Yixuan Zhang, Cheng Feng, and Jun Zhu. 2022.Efficient Inference for Dynamic Flexible Interactions of Neural Populations.Journal of Machine Learning Research 23, 211 (2022), 1–49.
  • Zhou etal. (2020a)Feng Zhou, Zhidong Li, Xuhui Fan, Yang Wang, Arcot Sowmya, and Fang Chen. 2020a.Efficient inference for nonparametric Hawkes processes using auxiliary latent variables.Journal of Machine Learning Research 21, 241 (2020), 1–31.
  • Zhou etal. (2020b)Feng Zhou, Zhidong Li, Xuhui Fan, Yang Wang, Arcot Sowmya, and Fang Chen. 2020b.Fast multi-resolution segmentation for nonstationary Hawkes process using cumulants.International Journal of Data Science and Analytics 10 (2020), 321–330.
  • Zhou etal. (2021)Feng Zhou, Yixuan Zhang, and Jun Zhu. 2021.Efficient Inference of Flexible Interaction in Spiking-neuron Networks. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  • Zhou etal. (2013)Ke Zhou, Hongyuan Zha, and Le Song. 2013.Learning triggering kernels for multi-dimensional Hawkes processes. In International Conference on Machine Learning. 1301–1309.
  • Zhou and Yu (2023)Zihao Zhou and Rose Yu. 2023.Automatic Integration for Fast and Interpretable Neural Point Processes. In Learning for Dynamics and Control Conference. PMLR, 573–585.
  • Zhu etal. (2018)Han Zhu, Xiang Li, Pengye Zhang, Guozheng Li, Jie He, Han Li, and Kun Gai. 2018.Learning tree-based deep model for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1079–1088.
  • Zhu etal. (2021)Shixiang Zhu, Minghe Zhang, Ruyi Ding, and Yao Xie. 2021.Deep fourier kernel for self-attentive point processes. In International Conference on Artificial Intelligence and Statistics. PMLR, 856–864.
  • Zipkin etal. (2016)JosephR Zipkin, FredericP Schoenberg, Kathryn Coronges, and AndreaL Bertozzi. 2016.Point-process models of social network interactions: Parameter estimation and missing data recovery.European journal of applied mathematics 27, 3 (2016), 502–529.
  • Zuo etal. (2020)Simiao Zuo, Haoming Jiang, Zichong Li, Tuo Zhao, and Hongyuan Zha. 2020.Transformer hawkes process. In International conference on machine learning. PMLR, 11692–11702.
Interpretable Transformer Hawkes Processes: Unveiling Complex Interactions in Social Networks (2024)
Top Articles
Latest Posts
Article information

Author: Rueben Jacobs

Last Updated:

Views: 6413

Rating: 4.7 / 5 (57 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Rueben Jacobs

Birthday: 1999-03-14

Address: 951 Caterina Walk, Schambergerside, CA 67667-0896

Phone: +6881806848632

Job: Internal Education Planner

Hobby: Candle making, Cabaret, Poi, Gambling, Rock climbing, Wood carving, Computer programming

Introduction: My name is Rueben Jacobs, I am a cooperative, beautiful, kind, comfortable, glamorous, open, magnificent person who loves writing and wants to share my knowledge and understanding with you.