## 1. Introduction

Predicting and understanding the decisions individuals make has a host of applications, including modeling online shopping preferences (Ruiz et al., Reference Ruiz, Athey and Blei2020), forecasting the demand for renewable energy (Axsen & Kurani, Reference Axsen and Kurani2012; Michelsen & Madlener, Reference Michelsen and Madlener2012), and analyzing elections (Dreher et al., Reference Dreher, Gould, Rablen and Vreeland2014; Glasgow, Reference Glasgow2001). These and other scenarios are studied under the umbrella of *discrete choice* (Train, Reference Train2009), which describes any setting where people select items from a set of available alternatives. While discrete choice has its roots in econometrics, machine learning approaches have recently found great success in discrete choice applications (Seshadri et al., Reference Seshadri, Peysakhovich and Ugander2019; Rosenfeld et al., Reference Rosenfeld, Oshiba and Singer2020; Tomlinson & Benson, Reference Tomlinson and Benson2021; Bower & Balzano, Reference Bower and Balzano2020). This recent interest is driven by the increasing importance of Web-based choices (e.g., purchases on Amazon or bookings on Expedia), which provide both motivating applications and benchmark datasets. These new methods extend existing econometric models—most notably the classic *conditional* or *multinomial logit* (CL/MNL) (McFadden, Reference McFadden and Zarembka1973)—by learning more complex effects, such as context-dependent and nonlinear preferences.

One of the crucial aspects of human decision making is that, as fundamentally social creatures, our preferences are strongly influenced by our social context. Viral trends, conformity, word-of-mouth, and signaling all play roles in behavior, including choices (Feinberg et al., Reference Feinberg, Bruch, Braun, Falk, Fefferman, Feit and Small2020; Axsen & Kurani, Reference Axsen and Kurani2012). Additionally, people with similar preferences, beliefs, and identities are more likely to be friends in the first place, a phenomenon known as *homophily* (McPherson et al., Reference McPherson, Smith-Lovin and Cook2001). Together, these factors indicate that social network structure could be very informative in predicting choices. In economics and sociology, there has been growing interest in incorporating social factors into discrete choice models (McFadden, Reference McFadden2010; Maness et al., Reference Maness, Cirillo and Dugundji2015; Feinberg et al., Reference Feinberg, Bruch, Braun, Falk, Fefferman, Feit and Small2020). However, the methods used so far in these fields have largely been limited to simple feature-based summaries of social influence [e.g., what fraction of someone’s friends have selected an item (Goetzke & Rave, Reference Goetzke and Rave2011)].

On the other hand, the machine learning community has developed a rich assortment of graph learning techniques that can incorporate entire social networks into predictive models (Kipf & Welling, Reference Kipf and Welling2017; Jia & Benson, Reference Jia and Benson2021; Wu et al., Reference Wu, Pan, Chen, Long, Zhang and Philip2020), such as graph neural networks (GNNs) and graph-based regularization. These approaches can handle longer-range interactions and are less reliant on hand-crafted features. Because of the large gulf between the discrete choice and machine learning communities, there has been almost no study of the application of graph learning methods to discrete choice, where they have the potential for major impact. Perhaps one factor hindering the use of graph learning in discrete choice is that machine learning methods are typically designed for either regression or classification. Discrete choice has several features distinguishing it from multiclass classification (its closest analog)—for instance, each observation can have a different set of available items. As a concrete example, any image could be labeled as a cat in a classification setting, but people choosing between doctors may have their options dictated by their insurance policy.

Motivated by this need, we adapt graph learning techniques to incorporate social network structure into discrete choice modeling. By taking advantage of phenomena like homophily and social contagion, these approaches improve the performance of choice prediction in a social context. In particular, we demonstrate how GNNs can be applied to discrete choice, derive Laplacian regularization for the multinomial logit model, and adapt label propagation for discrete choice. We show in synthetic data that Laplacian regularization can improve sample complexity by orders of magnitude in an idealized scenario.

To evaluate our methods, we perform experiments on real-world election data and Android app installations, with networks derived from Facebook friendships, geographic adjacency, and Bluetooth pings between phones. We find that such network structures can improve the predictions of discrete choice models in a semi-supervised learning task. For instance, Laplacian regularization improves the mean relative rank (MRR)Footnote
^{1}
of predictions by up to a 6.8% in the Android app installation data and up to 2.6% in the 2016 US election data. In contrast with our results on app installations, we find no evidence of social influence in app usage among the same participants: social factors appear to influence the apps people get, but less so the apps they actually use. Instead, we find that app usage is dominated by personal habit. Another interesting insight provided by our discrete choice models in the app installation data is the discovery of two separate groups of participants, one in which Facebook is popular, while the other prefers Myspace.Footnote
^{2}
We further showcase the power provided by a discrete choice approach by making counterfactual predictions in the 2016 US election data with different third-party candidates on the ballot. While a common narrative is that Clinton’s loss was due to spoiler effects by third-party candidates (Chalabi, Reference Chalabi2016; Rothenberg, Reference Rothenberg2019), our results do not support this theory, although we emphasize the likelihood of confounding factors. Our tools enable us to rigorously analyze these types of questions.

## 2. Related work

There is a long line of work in sociology and network science on social behavior, including effects like contagion and herding (Centola & Macy, Reference Centola and Macy2007; Easley & Kleinberg, Reference Easley and Kleinberg2010; Banerjee, Reference Banerjee1992). More recently, there has been interest in the use of discrete choice in conjunction with network-based analysis (Feinberg et al., Reference Feinberg, Bruch, Braun, Falk, Fefferman, Feit and Small2020) enabled by rich data with both social and choice components (Aharony et al., Reference Aharony, Pan, Ip, Khayal and Pentland2011). The traditional econometric approach to discrete choice modeling with social effects is to add terms to an individual’s utility that depend on the actions or preferences of others (Brock & Durlauf, Reference Brock and Durlauf2001; McFadden, Reference McFadden2010; Maness et al., Reference Maness, Cirillo and Dugundji2015). For instance, this approach can account for an individual’s desire for conformity (Bernheim, Reference Bernheim1994). This is done by treating the choices made by a chooser’s community as a feature of the chooser and applying a standard multinomial logit (Páez et al., Reference Páez, Scott and Volz2008; Kim et al., Reference Kim, Rasouli and Timmermans2014; Goetzke & Rave, Reference Goetzke and Rave2011; Walker et al., Reference Walker, Ehlers, Banerjee and Dugundji2011; Kim et al., Reference Kim, Rasouli and Timmermans2018). In contrast, we focus on methods that employ the entire graph rather than derived features. This enables methods to account for longer-range interactions and phenomena such as network clustering without hand-crafting features. We are aware of one econometric paper that uses preference correlations over a full network in a choice model (Leung, Reference Leung2019), but inference under this method requires Monte Carlo simulation. Laplacian regularization, on the other hand, allows us to find our model’s maximum likelihood estimator with straightforward convex optimization. Mixture models are another way of incorporating structured preference heterogeneity into discrete choice, such as the mixed logit (McFadden & Train, Reference McFadden and Train2000) and hierarchical Bayes models with mixture priors (Allenby & Rossi, Reference Allenby and Rossi2006; Burda et al., Reference Burda, Harding and Hausman2008). Again, these approaches present significant challenges for inference, requiring Monte Carlo methods, variational approximations, or expectation maximization. Additionally, in positing unknown latent populations, mixture models ignore the key information provided by the structure of the network. Another large area of research in discrete choice concerns models that allow deviations from the axiom of *independence of irrelevant alternatives* (IIA) (Luce, Reference Luce1959). Many of these models, such as the multinomial probit (Hausman & Wise, Reference Hausman and Wise1978), are very challenging to estimate. To keep our focus on incorporating network effects, we use tractable logit models obeying IIA. However, there are recent non-IIA models admitting efficient inference to which we could apply our methods (Seshadri et al., Reference Seshadri, Peysakhovich and Ugander2019; Bower & Balzano, Reference Bower and Balzano2020; Tomlinson & Benson, Reference Tomlinson and Benson2021); this is beyond the scope of the present work, but we expand further on this idea in the discussion.

In another direction, there are many machine learning methods that use network structure in predictive tasks; GNNs (Kipf & Welling, Reference Kipf and Welling2017; Xu et al., Reference Xu, Hu, Leskovec and Jegelka2019; Wu et al., Reference Wu, Pan, Chen, Long, Zhang and Philip2020) are a popular example. Discrete choice is related to classification tasks, but the set of available items (i.e., labels) is specific to each observation—additionally, discrete choice models are heavily informed by economic notions of preference and rationality (McFadden, Reference McFadden and Zarembka1973; Train, Reference Train2009). A more traditional machine learning method of exploiting network structure for classification is label propagation (Zhu & Ghahramani, Reference Zhu and Ghahramani2002), which we extend to the discrete choice setting. Recent work has shown how to combine label propagation with GNNs for improved performance (Jia & Benson, Reference Jia and Benson2020) and presented a unified generative model framework for label propagation, GNNs, and Laplacian regularization (Jia & Benson, Reference Jia and Benson2021). The present work can be seen as an adaptation and empirical study of the methods from (Jia & Benson, Reference Jia and Benson2021) for discrete choice rather than regression.

The idea of applying Laplacian regularization to discrete choice models appeared several years ago in an unpublished draft (Zhang et al., Reference Zhang, Fountoulakis, Cao, Mahoney and Pozdnoukhov2017). However, the draft did not provide experiments beyond binary choices [which reduces to standard semi-supervised node classification (Kipf & Welling, Reference Kipf and Welling2017)]. In contrast, we compare Laplacian regularization with other methods of incorporating social network structure (GNNs and propagation) on real-world multialternative choice datasets.

There is a large body of existing research on predicting app usage and installation, including using social network structure (Baeza-Yates et al., Reference Baeza-Yates, Jiang, Silvestri and Harrison2015; Pan et al., Reference Pan, Aharony and Pentland2011; Xu et al., Reference Xu, Lin, Lu, Cardone, Lane, Chen and Choudhury2013), but our use of network-based discrete choice models for this problem is novel. Our approach has the advantage of being applicable to both usage and installation with minimal differences, allowing us to compare the relative importance of social structure in these settings. Another line of related work applies discrete choice models to networks in order to model edge formation (Overgoor et al., Reference Overgoor, Benson and Ugander2019; Tomlinson & Benson, Reference Tomlinson and Benson2021; Gupta & Porter, Reference Gupta and Porter2020; Overgoor et al., Reference Overgoor, Pakapol Supaniratisai and Ugander2020).

## 3. Preliminaries

In a discrete choice setting, we have a universe of *items*
$\mathcal{U}$
and a set of *choosers*
$\mathcal{A}$
. In each choice instance, a chooser
$a\in{\mathcal{A}}$
observes a *choice set*
$C \subset{\mathcal{U}}$
and chooses one item
$i \in C$
. Each item
$i\in{\mathcal{U}}$
may be described by a vector of features
$\boldsymbol{y}_i \in{\mathbb{R}}^{d_y}$
. Similarly, a chooser
$a$
may have a vector of features
$\boldsymbol{x}_a \in{\mathbb{R}}^{d_x}$
. In the most general form, a choice model assigns choice probabilities for
$a$
to each item
$i\in C$
:

where
$u_{\boldsymbol{\theta }}(i, C, a)$
is the utility of item
$i$
to chooser
$a$
when seen in choice set
$C$
, a function with parameters
$\boldsymbol{\theta }$
. Note that since the utilities in Equation (1) can depend on the choice set, this general form can express choice probabilities that vary arbitrarily across choice sets (this is sometimes called the *universal logit*). When constructing more useful parsimonious models, the utilities
$u_{\boldsymbol{\theta }}(i, C, a)$
can depend on
$\boldsymbol{x}_a$
,
$\boldsymbol{y}_i$
, both, or neither. In the simplest case—the traditional logit model—
$u_{\boldsymbol{\theta }}(i, C, a) = u_i$
is constant over choosers and sets. This formulation is attractive from an econometric perspective, since it corresponds to a rationality assumption: if we suppose a chooser has underlying utilities
$u_1, \dots, u_k$
and observes a perturbation of their utilities
$u_i + \varepsilon _i$
(where
$\varepsilon _i$
follows a Gumbel distribution) before selecting the maximum observed utility item, then their resulting choice probabilities take the form of a logit (McFadden, Reference McFadden and Zarembka1973).

When we add a linear term in chooser features to the logit model, the result is the *multinomial logit* (MNL) (Hoffman & Duncan, Reference Hoffman and Duncan1988; McFadden, Reference McFadden and Zarembka1973), with utilities
$u_{\theta }(i, C, a) = u_i + \boldsymbol{\gamma}_{\boldsymbol{i}}^T \boldsymbol{x}_{\boldsymbol{a}}$
, where
$u_i$
are item-specific utilities and
$\boldsymbol{\gamma}_{\boldsymbol{i}}$
is a vector of item-specific coefficients capturing interactions with the chooser features
$\boldsymbol{x}_{\boldsymbol{a}}$
. Similarly, when we add a linear term in item features, the result is a *conditional logit* (CL), with utilities
$u_i + \boldsymbol{\varphi }^T\boldsymbol{y}_{\boldsymbol{i}}$
. The *conditional multinomial logit* (CML) has both the chooser and item feature terms:
$u_i + \boldsymbol{\varphi }^T\boldsymbol{y}_{\boldsymbol{i}} + \boldsymbol{\gamma}_{\boldsymbol{i}}^T\boldsymbol{x}_{\boldsymbol{a}}$
. In order to capture heterogeneous preferences among a group of choosers, one natural approach is to allow each chooser
$a$
to have different logit utilities. We call this a *per-chooser logit*, which is specified by per-chooser utilities
$u_{\boldsymbol \theta }(i, C, a) = u_{ia}$
. Similarly, a *per-chooser* CL has varying item feature coefficients
$\boldsymbol \varphi _a$
for each chooser
$a$
, with
$u_{\boldsymbol \theta }(i, C, a) = u_{ia} + \boldsymbol{\varphi }_a^T\boldsymbol{y}_{\boldsymbol{i}}$
. More generally, we call any choice model parameter that varies across choosers a *per-chooser parameter*.

In addition to this standard discrete choice setup, our settings also have a network describing the relationships between choosers. Choosers are nodes in an undirected graph
$G = ({\mathcal{A}}, E)$
where the presence of an edge
$(a, b)\in E$
indicates a connection between
$a$
and
$b$
(e.g., a friendship). We assume
$G$
is connected. The *Laplacian* of
$G$
is
$L = D - A$
, where
$D$
is the diagonal degree matrix of
$G$
and
$A$
is the adjacency matrix. The Laplacian has a number of useful applications, including graph clustering (Hagen & Kahng, Reference Hagen and Kahng1991) and counting spanning trees (Merris, Reference Merris1994). For our purposes, the key property of the Laplacian is that quadratic forms of
$L$
measure how much a node-wise vector differs across edges of the graph (we elaborate on this property below). We use
$n = |{\mathcal{A}}|$
,
$m = |E|$
, and
$k = |U|$
. Finally,
$I$
denotes the identity matrix.

## 4. Graph-based methods for discrete choice

We identify three phases in choice prediction where networks can be incorporated: networks can be used (1) to inform model parameters, (2) to learn chooser representations, or (3) to directly produce predictions. In this section, we develop representative methods in each category. We briefly describe each method before diving into more detail.

First, networks can inform inference for a model that already accounts for chooser heterogeneity. This is done by incorporating the correlations in utilities (or other choice model parameters) of individuals who are close to each other in the network; we refer to these as *preference correlations* for simplicity. Our Laplacian regularization approach (described in Section 4.1) does exactly this, and we show that it corresponds to a Bayesian prior on network-based preference correlations. Second, networks can be used to learn latent representations of choosers that are then used as features in a choice model like the MNL. GNNs have been extensively studied as representation-learning tools—in Section 4.2, we focus on how to incorporate them into choice models, using graph convolutional networks (GCNs) (Kipf & Welling, Reference Kipf and Welling2017) as our canonical example. Third, direct network-based methods (such as label propagation (Zhu & Ghahramani, Reference Zhu and Ghahramani2002), which repeatedly averages a node’s neighboring labels) can also be used as a simple baseline for choice predictions. While this approach is simple and efficient, it lacks the proper handling of choice sets of the previous probabilistic approaches. Nonetheless, we find it a useful and effective baseline, and we adapt label propagation for discrete choice in Section 4.3.

### 4.1. Laplacian regularization

We begin by describing how to incorporate network information in a choice model like MNL through Laplacian regularization (Ando & Zhang, Reference Ando and Zhang2007). Laplacian regularization encourages parameters corresponding to connected nodes to be similar through a loss term of the form $\lambda \boldsymbol{\alpha }^TL \boldsymbol{\alpha }$ , where $L$ is the graph Laplacian (as defined in Section 3), $\boldsymbol{\alpha }$ is the vector of parameter values for each node, and $\lambda$ is the scalar regularization strength. A famous identity is that $\boldsymbol{\alpha }^TL \boldsymbol{\alpha } = \sum _{(i, j) \in E} (\boldsymbol{\alpha }_i - \boldsymbol{\alpha }_j)^2$ , which more clearly shows the regularization of connected nodes’ parameters towards each other. This also shows that the Laplacian is positive semi-definite, since $\boldsymbol{\alpha }^TL \boldsymbol{\alpha } \ge 0$ , which will be useful to preserve the convexity of the multinomial logit’s (negative) log-likelihood.

The idea of using Laplacian regularization for discrete choice was proposed in (Zhang et al., Reference Zhang, Fountoulakis, Cao, Mahoney and Pozdnoukhov2017) (although they focused on regularizing intercept terms in binary logistic regression). We generalize the idea to be applicable to any logit-based choice model and show that it corresponds to Bayesian inference with a network correlation prior. We then specialize to the models we use in our experiments. Laplacian regularization is simple to implement, can be added to any logit-based choice model with per-chooser parameters, and only requires training one extra hyperparameter. Laplacian regularization also carries a number of advantages over another approach to accounting for structured preference heterogeneity, mixture modeling.

#### 4.1.1. Theory of Laplacian-regularized choice models

Consider a general choice model, as in Equation (1). We split the parameters $\boldsymbol{\theta }$ into two sets $\theta _{\mathcal{A}}$ and $\theta _{G}$ , where parameters $\boldsymbol{\alpha } \in \theta _{\mathcal{A}}, \boldsymbol{\alpha }\in{\mathbb{R}}^n$ vary over choosers and parameters $\beta \in \theta _{G}, \beta \in{\mathbb{R}}$ are constant over choosers. The log-likelihood of a general choice model is

The Laplacian- and $L_2$ -regularized log-likelihood (with $L_2$ regularization strength $\gamma$ ) is then

We show that regularized maximum likelihood estimation of $\boldsymbol{\theta }$ corresponds to Bayesian inference with a prior on per-chooser parameters that encourages smoothness over the network. In contrast, existing results on priors for semi-supervised regression (Xu et al., Reference Xu, Dyer and Owen2010; Chin et al., Reference Chin, Chen, Altenburger and Ugander2019) typically split the nodes into observed and unobserved, fixing the observed values and only considering randomness over unobserved nodes. In choice modeling, observing choices at a node only updates our beliefs about their preferences, leaving some uncertainty. Our result also allows some parameters of the choice model to be chooser dependent and others to be constant across choosers, allowing it to be fully general over choice models. Finally, we note that $L_2$ regularization can also be applied to the global parameters $\beta$ , which as usual corresponds to a Gaussian prior on these parameters—however, we state the result with uniform priors to emphasize the Laplacian regularization on the per-chooser parameters $\boldsymbol \alpha$ .

Theorem.
*The maximizer*
$\boldsymbol{\theta }^*_{\text{MLE}}$
*of the Laplacian- and*
$L_2$
-*regularized log-likelihood*
$\ell _L(\boldsymbol{\theta };\;{\mathcal{D}})$
*is the maximum a posteriori estimate*
$\boldsymbol{\theta }^*_{\text{MAP}}$
*after observing*
$\mathcal{D}$
*under the i.i.d. priors*
$\boldsymbol{\alpha } \sim \mathcal{N}(0, [\lambda L + \gamma I] ^{-1})$
*for each*
$\boldsymbol{\alpha } \in \theta _{\mathcal{A}}$
*and i.i.d. uniform priors for each*
$\beta \in \theta _{G}$
.

*Proof.* First, recall that
$L$
is positive semi-definite, so
$\lambda L + \gamma I$
(with
$\gamma, \lambda \gt 0$
) is positive definite and invertible. Now, using Bayes’ Theorem,

Since $\Pr ({\mathcal{D}})$ is independent of the parameters and $\log$ is monotonic and increasing,

Notice that the first term is exactly the log-likelihood $\ell (\boldsymbol{\theta };\;{\mathcal{D}})$ . Additionally, the priors of each parameter are independent, so

Since the priors $p(\beta )$ are uniform, they do not affect the maximizer:

Now consider the Gaussian priors $p(\boldsymbol{\alpha })$ :

Simplifying the term in the $\exp$ reveals the two regularization terms:

We thus have for a constant $c$ independent of $\boldsymbol \alpha$ ,

Plugging this is in and dropping the constants not affecting the maximizer yields

Notice that the Gaussian in the theorem above has precision (i.e., inverse covariance) matrix $\lambda L + \gamma I$ . The partial correlation between the per-chooser parameters $\boldsymbol{\alpha }_i$ and $\boldsymbol{\alpha }_j$ , $i \ne j$ (controlling for all other nodes) is therefore

using the standard Gaussian identity relating precision and partial correlation (Liang et al., Reference Liang, Song and Qiu2015) (where $d_i$ is the degree of $i$ ). If both $d_i, d_j \gt 0$ and $\gamma$ is small, then we can approximate

This is easy to interpret: $\boldsymbol{\alpha }_i$ and $\boldsymbol{\alpha }_j$ have partial correlation 0 when $i$ and $j$ are unconnected ( $A_{ij} = 0$ ) and positive partial correlation when they are connected (larger when they have fewer other neighbors). That is, the Gaussian prior in the theorem assumes neighboring nodes have correlated preferences.

#### 4.1.2. Laplacian-regularized logit models

To incorporate Laplacian regularization in our four logit models (logit, MNL, CL, and CML), we add per-chooser utilities $v_{ia}$ for each item $i$ and chooser $a$ to the utility formulations. For instance, this results in the following utility function for a per-chooser MNL: $u_\theta (i, C, a) = u_i + \boldsymbol{x}_{\boldsymbol{a}}^T\boldsymbol{\gamma}_{\boldsymbol{i}} + v_{ia}$ . While we could get rid of the global utilities $u_i$ , $L_2$ regularization enables us to learn a parsimonious model where $u_i$ is the global baseline utility and $v_{ia}$ represents per-chooser deviations. The per-chooser parameters of a Laplacian-regularized logit are $\theta _A = \{\boldsymbol{v}_{\boldsymbol{i}}\}_{i\in U}$ , where the vector $\boldsymbol{v}_{\boldsymbol{i}}$ stacks the values of $v_{ia}$ for each chooser $a\in A$ . All other parameters are global. The Laplacian- and $L_2$ -regularized log-likelihood can then be written down by combining Equations (2) and (3). Crucially, since the Laplacian is positive semi-definite, the terms $-\frac{\lambda }{2} \boldsymbol{v}_{\boldsymbol{i}}^T L \boldsymbol{v}_{\boldsymbol{i}}$ are concave—and since all four logit log-likelihoods are concave (as is the $L_2$ regularization term), their regularized negative log likelihoods (NLLs) are convex. This enables us to easily learn maximum-likelihood models with standard convex optimization methods.

### 4.2. Graph neural networks

GNNs (Wu et al., Reference Wu, Pan, Chen, Long, Zhang and Philip2020) use a graph to structure the aggregations performed by a neural network, allowing parameters for neighboring nodes to influence each other. We test the canonical GNN, a graph convolutional network (GCN) (Kipf & Welling, Reference Kipf and Welling2017), where node embeddings are averaged across neighbors before each neural network layer. There are many other types of GNNs (see Wu et al. (Reference Wu, Pan, Chen, Long, Zhang and Philip2020) for a survey)—we emphasize that we do not claim this particular GCN approach to be optimal for discrete choice. Rather, we illustrate how GNNs can be applied to choice data and encourage further exploration.

In a depth- $d$ GCN, each layer performs the following operation, producing a sequence of embeddings $H^{(0)}, \dots, H^{(d)}$ :

where $H^{(0)}$ is initialized using node features (if they are available—if not, $H^{(0)}$ is learned), $\sigma$ is an activation function, $W^{(i)}$ are parameters, and $A' = (D+2I)^{-{1}/{2}}(A+I)(D+2I)^{-{1}/{2}}$ is the degree-normalized adjacency matrix (with self-loops). Self-loops are added to $G$ to allow a node’s current embedding to influence its embedding in the next layer. We can either use $H^{(d)}$ as the final embeddings or concatenate each layer’s embedding into a final embedding $H$ . In our experiments, we use a two-layer GCN (both with output dimension 16) and concatenate the layer embeddings. For simplicity, we fix the dropout rate at 0.5.

To apply GCNs to discrete choice, we can treat the final node embeddings as chooser features and apply an MNL, modeling utilities as $u_\theta (i, C, a) = u_i + H_a^T\boldsymbol{\gamma}_{\boldsymbol{i}}$ , where $u_i$ and $\boldsymbol{\gamma}_{\boldsymbol{i}}$ are per-item parameters (the intercept and embedding coefficients, respectively). If item features are also available, we add the CL term $\boldsymbol{\theta }^T \boldsymbol{y}_{\boldsymbol{i}}$ . Thanks to automatic differentiation software such as PyTorch (Paszke et al., Reference Paszke, Gross, Massa, Lerer, Bradbury, Chanan and Chintala2019), we can train both the GCN and MNL/CML weights end-to-end. Again, any node representation learning method could be used for the embeddings $H$ —we use a GCN for simplicity.

In general, GNNs have the advantage of being highly flexible, able to capture complex interactions between the features of neighboring nodes. However, some recent research has indicated that nonlinearity is less helpful for classification in GNNs than in traditional neural networks tasks (Wu et al., Reference Wu, Souza, Zhang, Fifty, Yu and Weinberger2019). With the additional modeling power comes significant additional difficulty in training and hyperparameter selection (for embedding dimensions, depth, dropout rate, and activation function).

### 4.3. Choice fraction propagation

We also consider a baseline method that uses the graph to directly derive choice predictions, without a probabilistic model of choice. We extend label propagation (Zhou et al., Reference Zhou, Bousquet, Lal, Weston and Schölkopf2004; Jia & Benson, Reference Jia and Benson2021) to multialternative discrete choice. The three features distinguishing the choice setting from standard label propagation are that we can observe multiple “labels” (i.e., choices) per chooser, each observation may have had different available labels, and that not all labels are available at inference time. Given training data of observed choices of the form
$(i, C, a)$
, where chooser
$a \in{\mathcal{A}}$
chose item
$i\in C \subseteq{\mathcal{U}}$
, we assign each chooser
$a$
a vector
$\boldsymbol{z}_{\boldsymbol{a}}$
of size
$k = |U|$
with each item’s *choice fraction*. That is, the
$i$
th entry of
$\boldsymbol{z}_{\boldsymbol{a}}$
stores the fraction of times
$a$
chose
$i$
in the observed data out of all opportunities they had to do so (i.e., the number of times
$i$
appeared in their choice set). We use choice fraction rather than counts to normalize by the number of observations for a chooser and not to count against an item instances when it was not available.

We then apply label propagation to the vectors $\boldsymbol{z}_{\boldsymbol{a}}$ over $G$ . Let $Z^{(0)}$ be the matrix whose rows are $\boldsymbol{z}_{\boldsymbol{a}}$ . As in standard label propagation, we iterate $ Z^{(i+1)} \gets (1-\rho ) Z^{(0)} + \rho D^{-{1}/{2}}AD^{-{1}/{2}} Z^{(i)}$ until convergence, where $\rho \in [0, 1]$ is a hyperparameter that controls the strength of the smoothing. Let $ Z^{(\infty )}$ denote the stationary point of the iterated map. For inference, we can use the $a$ th row of $Z^{(\infty )}$ (in practice, we will have a matrix arbitrarily close to $Z^{(\infty )}$ ), denoted $\boldsymbol{z}_{\boldsymbol{a}}^{(\boldsymbol\infty )}$ , to make predictions for chooser $a$ . Given a choice set $C$ , we predict $a$ will choose the argmax of $\boldsymbol{z}_{\boldsymbol{a}}^{(\boldsymbol\infty )}$ restricted to items appearing in $C$ . Note that in a semi-supervised setting, we do not observe any choices from the test choosers, so their entries of $Z^{(0)}$ will be zero. The term $(1-\rho )Z^{(0)}$ then acts as a uniform prior, regularizing the test chooser entries of $Z^{(\infty )}$ toward 0. Since choice fraction propagation does not use chooser or item features, it is best suited to scenarios where neither are available.

## 5. Networked discrete choice data

We now describe several datasets in which we can leverage social network structure for improved choice prediction using the methods we develop. Table 1 shows a summary of our datasets, which are available at https://osf.io/egj4q/.

$^*$ Voters had more than one election (i.e., choice) on their ballots.

### 5.1. Friends and Family app data

The Friends and Family dataset (Aharony et al., Reference Aharony, Pan, Ip, Khayal and Pentland2011) follows over 400 residents of a young-family community in North America during 2010-2011. The dataset is remarkably rich, capturing many aspects of the participants’ lives. For instance, they were given Android phones with purpose-made logging software that captured app installation and usage as well as Bluetooth pings between participants’ phones. We use the installation and usage data to construct two separate choice datasets (app-install and app-usage) and use a network built from Bluetooth pings, as in (Aharony et al., Reference Aharony, Pan, Ip, Khayal and Pentland2011). We ignore uncommon and built-in apps (for instance, we ignore apps whose package names begin with com.android, com.motorola, com.htc, com.sec, and com.google), leaving a universe $\mathcal{U}$ of 127 apps in app-install and 121 in app-usage (e.g., Twitter, Facebook, and Myspace).

To construct app-install, we use scans that checked which apps were installed on each participant’s phone every 10 minutes—3 hours. Each time a new app
$i$
appears in a scan for a participant
$a$
, we consider that a choice from the set of apps
$C$
that were not installed at the time of the last scan. We use a plain logit as the baseline model in app-install, since no item features are readily available. To construct app-usage, we use 30-second resolution scans of running apps. To separate usage into sessions, we select instances where a participant ran an app for the first time in the last hour. We consider such app runs to be a choice
$i$
from the set of all apps
$C$
installed on participant
$a$
’s phone at that time. Our discrete choice approach enables us to account for these differences in app availability. In app-usage, we use a CL with a single instance-specific item feature: *recency*, defined as
$\log ^{-1}(\text{seconds since last use})$
or 0 if the user has not used the app. While it would be possible to construct more complex sets of features with additional effort (for instance categorizing different types of apps or tracking down their Android store ratings), a simple baseline suffices to demonstrate how social network structure can benefit choice modeling even in the absence of item and user features.

To form the social network $G$ over participants for both datasets, we use Bluetooth proximity hits—like the original study (Aharony et al., Reference Aharony, Pan, Ip, Khayal and Pentland2011), we only consider hits in the month of April between 7 am and midnight (to avoid coincidental hits from neighbors at night). For each participant $a$ , we form the link $(a, b)$ to each of their 10 most common interaction partners $b$ (we also tested thresholds 2–9, but our methods all performed very similarly). We perform this thresholding because the Bluetooth ping network is extremely dense and contains many edges that are likely not socially meaningful (for instance, nearby phones may ping each other when two strangers shop in the same store). Prior research on this data found that social contacts were useful in predicting app installations, but did not employ a discrete choice approach (Aharony et al., Reference Aharony, Pan, Ip, Khayal and Pentland2011; Pan et al., Reference Pan, Aharony and Pentland2011). Our discrete choice approach allows us to account for multi-hop social connections and the context of each installation (i.e., what apps were already installed).

As a warm-up data analysis, we show that people are more likely to install an app the more of their friends have it (but not if we randomize friendships). Let $n$ the total number of people, $n_i$ be the number of people who installed application $i$ , $f_a$ the number of friends of person $a$ , and $f_{ai}$ the number of friends of person $a$ who have app $i$ . Suppose app installations are independent of friendships. If we sample some person $a$ uniformly at random and check which of their friends have app $i$ , then the probability that $a$ also has app $i$ is $(n_i - f_{ai})/(n-f_a)$ (simply the remaining fraction of people who have the app, after observing the friends of $a$ ). However, if app installations correlate across friendships, the observed probability would be higher when $f_{ai}/ f_a$ is larger. We measure the empirical probability that a person has an app at different friend-installation fractions. Specifically, we measure

where $\boldsymbol{1}_{ai}$ is an indicator for whether person $a$ has app $i$ . Notice that if friendships are uncorrelated with app installations, the expectation of the summand is $0$ . Instead of taking the mean over all app pairs, we take the mean at each unique friend-installation fraction to see if having more friends with an app results in stronger deviations from uniform installations. This is exactly what we observe: when people have more friends with an app, they are more likely to install it (Figure 1). In contrast with two null models (a configuration model with the same degree distribution and an Erdős–Rényi graph with the same density), we see an increase in peoples’ installation probabilities as a larger fraction of their friends have an app. This is in line with findings that the probability an individuals joins a social network community increases with the number of their friends in the community (Backstrom et al., Reference Backstrom, Huttenlocher, Kleinberg and Lan2006). However, it is worth emphasizing that this finding is purely correlational—we have no way of knowing whether increased installation rates are due to homophily in the social network, word-of-mouth contagion, or other confounding factors.

### 5.2. County-level US presidential election data

US presidential election data is a common testbed for graph learning methods using a county-level adjacency network, but the typical approaches are to treat elections as binary classification or regression problems (predicting the vote shares of one party) (Jia & Benson, Reference Jia and Benson2020; Zhou et al., Reference Zhou, Dong, Wang, Lee, Hooi, Xu and Feng2020; Huang et al., Reference Huang, He, Singh, Lim and Benson2020). However, this ignores the fact that voters have more than two options—moreover, different candidates can be on the ballot in different states. The universe of items $\mathcal{U}$ in our 2016 election data contains no fewer than 31 different candidates (and a “none of these candidates” option in Nevada, which received nearly 4% of the votes in one county). While third-party candidates are unlikely to win in the US, they often receive a nontrivial (and quite possibly consequential) fraction of votes. For instance, in the 2016 election, independent candidate Evan McMullin received 21.5% of the vote in Utah, while Libertarian candidate Gary Johnson and Green Party candidate Jill Stein received 3.3% and 1.1% nationally (the gap between Clinton and Trump was only 2%). A discrete choice approach enables us to include third-party candidates and account for different ballots in different states. As a visual example, in Figure 2, we show the states in which McMullin appeared on the ballot as well as his per-county vote share. By accounting for ballot variation, we can make counterfactual predictions about what would happen if different candidates had been on the ballot, which is difficult without a discrete choice framework. For example, given McMullin’s regional support in Utah, it is possible that he would have fared better in Nevada (Utah’s western neighbor) than in an East Coast state like New York. Using the entire ballots also allows us to account for one possible reason why McMullin’s vote share appears not to spilled over into Colorado, while it did into Idaho: Colorado had fully 22 candidates on the ballot, while Idaho only had 8. A discrete choice approach handles this issue cleanly, while regression on vote shares does not. We note that, due to inherent limitations of observational data, we cannot be sure of the causes of the effects we observe (Tomlinson et al., Reference Tomlinson, Ugander and Benson2021)—nonetheless, a discrete choice approach enables more flexible modeling and can improve prediction performance regardless of the cause of preference correlations.

We gathered county-level 2016 presidential voting data from (Kearney, Reference Kearney2018) and county data from (Jia & Benson, Reference Jia and Benson2020),Footnote
^{3}
which includes a county adjacency network, county-level demographic data (e.g., education, income, birth rates, USDA economic typology,Footnote
^{4}
and unemployment rates), and the Social Connectedness Index (SCI) (Bailey et al., Reference Bailey, Cao, Kuchler, Stroebel and Wong2018) measuring the relative frequency of Facebook friendships between each pair of counties. We aggregate all votes at the county level, treating each county as a chooser
$a$
and using county features as
$\boldsymbol{x}_a$
[modeling voting choices in aggregate is standard practice (Alvarez & Nagler, Reference Alvarez and Nagler1998)]. For the graph
$G$
, we tested using both the geographic adjacency network and a network formed by connecting each county to the 10 others with which it has the highest SCI. We found almost identical results with both networks, so we only discuss the results using the SCI network. We refer to the resulting dataset as us-election-2016.

### 5.3. California precinct-level election data

The presidential election data is particularly interesting because different ballots have different candidates, all running in the same election. For instance, this is analogous to having different regional availability of goods within a category in an online shopping service. In our next two datasets, ca-election-2016 and ca-election-2020, we highlight a different scenario: when ballots in different locations may have different *elections*. Extending the online shopping analogy, this emulates the case where different users view different recommended categories of items. Although it is beyond the scope of the present work, a discrete choice approach would enable measuring cross-election effects, such as coattail effects (Hogan, Reference Hogan2005; Ferejohn & Calvert, Reference Ferejohn and Calvert1984) where higher-office elections increase excitement for down-ballot races.

To construct these datasets, we used data from the 2016 and 2020 California general elections from the Statewide Database.Footnote
^{5}
This includes per-precinct registration and voting data as well as shapefiles describing the geographic boundaries of each precinct (California has over 20,000 voting precincts). The registration data contain precinct-level demographics (counts for party affiliation, sex, ethnicity, and age ranges), although such data were not available for all precincts. We restrict the data to the precincts for which all three data types were available: voting, registration, and shapefile (99.8% of votes cast are included in our processed 2016 data and 99.0% in our 2020 data). Again, we treat each precinct as a chooser
$a$
with demographic features
$\boldsymbol{x}_a$
.

Our processed California data include elections for the US Senate, US House of Representatives, California State Senate, and California ballot propositions. We set aside presidential votes due to overlap with the previous dataset and state assembly votes to keep the data size manageable. Due to California’s nonpartisan top-two primary system,Footnote
^{6}
there are two candidates running for each office—however, each voter has a different set of elections on their ballot due to differences in US congress and California state senate districts (the state has 53 congressional districts and 40 state senate districts). A discrete choice approach enables us to train a single model accounting for preferences over all types of candidates. We use the precinct adjacency network
$G$
(since SCI is not available at the finer-grained precinct level), which we constructed from the Statewide Database shapefiles using QGIS (https://qgis.org).

## 6. Empirical results

We begin by demonstrating the sample complexity benefit of using network structure through Laplacian regularization on synthetic data. We then apply all three approaches to our datasets, compare their performance, and demonstrate the insights provided by a networked discrete choice approach. See Table 1 in Section 5 for a dataset overview. Our code and instructions for reproducing results are available at https://github.com/tomlinsonk/graph-based-discrete-choice/.

### 6.1. Improved sample complexity with Laplacian regularization

By leveraging correlations between node preferences through Laplacian regularization, we need fewer samples per node in order to achieve the same inference quality. When preferences are smooth over the network, an observation of a choice by one node gives us information about the preferences of its neighbors (and its neighbors’ neighbors, etc.), effectively increasing the usefulness of each observation. In Figure 3, we show the sample complexity benefit of Laplacian regularization in synthetic data with 100-node Erdős–Rényi graphs ( $p=0.1$ ) and preferences over 20 items generated according to the prior from Section 4.1.1. In each of 8 trials, we generate the graph, sample utilities, and then simulate a varying number of choices by each chooser. We repeat this for different homophily strengths $\lambda$ . For each simulated choice, we first draw a choice set size uniformly between 2 and 20, then pick a uniformly random choice set of that size. We then measure the mean-squared error in inferred utilities of observed items (fixing the utility of the first item to 0 for identification). When applying Laplacian regularization, we use the corresponding value of $\lambda$ used to generate the data (in real-world data, this needs to be selected through cross-validation). We train the models for 100 epochs.

In this best-case scenario, we need orders of magnitude fewer samples per chooser if we take advantage of preference correlations: with Laplacian regularization, estimation error with only one sample per chooser is lower than the estimation error with no regularization and 1000 samples per chooser. The stronger the homophily, the fewer observations are needed to achieve optimal performance, since a node’s neighbor’s choices are more informative.

### 6.2. Prediction performance comparison

We now evaluate our approaches on real-world choice data. In the style of semi-supervised learning, we use a subset of choosers for training and held-out choosers for validation and testing. This emulates a scenario where it is too expensive to gather data from everyone in the network or existing data is not available for all nodes (e.g., perhaps not all individuals have consented to choice data collection). We vary the fraction of training choosers from 0.1 to 0.8 in increments of 0.1, using half of the remaining choosers for validation and half for testing. We perform 8 independent sampling trials at each fraction in the election datasets and 64 in the smaller Friends and Family datasets.

As a baseline, we use standard logit models with no network information. For the election datasets, we use an MNL that uses county/precinct features to predict votes. This approach to modeling elections is common in political science (Dow & Endersby, Reference Dow and Endersby2004). For app-install, we use a simple logit. For app-usage, we use a CL with recency (as defined in Section 5.1). We then compare the three graph-based methods we propose to the baseline choice model: a GCN-augmented MNL (or CML), a Laplacian-regularized logit (or CL/MNL) with per-chooser utilities, and choice fraction propagation. Aside from propagation, we train the other methods with batch Rprop (Riedmiller & Braun, Reference Riedmiller and Braun1993), as implemented in PyTorch (Paszke et al., Reference Paszke, Gross, Massa, Lerer, Bradbury, Chanan and Chintala2019). For each dataset–model pair, we select the hyperparameters that result in the lowest validation loss in a grid search; we tested learning rates $10^{-3}, 10^{-2}, 10^{-1}$ and $L_2$ regularization strengths $10^{-5}$ , $10^{-4}$ , $10^{-3}$ , $10^{-2}$ , $10^{-1}$ (we also tested no $L_2$ regularization in the two app datasets). We similarly select Laplacian $\lambda$ using validation data from $10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}$ in the election datasets (in addition to these, we also test $10^0, 10^{-1}, 10^{-6}, 10^{-7}$ in the app datasets) and propagation $\rho$ from $0.1, 0.25, 0.5, 0.75, 1$ . The smaller hyperparameter ranges in the election datasets were used due to runtime constraints. We train the likelihood-based models for 100 epochs, or until the squared gradient magnitude falls below $10^{-8}$ . For propagation, we perform 256 iterations, breaking if the sum of squared differences between consecutive iterates falls below $10^{-8}$ . We note that we did not aggressively fine-tune the GCN beyond learning rate and $L_2$ regularization strength, since it has many more hyperparameters than our other approaches and is more expensive to train. Our GCN results should therefore be interpreted as the performance a discrete choice practitioner should expect to achieve in a reasonable amount of time using the model, which we believe is an important metric.

In Figure 4, we show results of all four approaches on all five datasets. We evaluate the three likelihood-based methods using their test set NLL and use MRR (Tomlinson & Benson, Reference Tomlinson and Benson2021) to evaluate propagation. For one sample, MRR is defined as the relative position of the actual choice in the list of predictions in decreasing confidence order (where 0 is the beginning of the list and and 1 is the end). We then report the mean MRR over the test set. In app-install, both Laplacian regularization and propagation improve prediction performance over the baseline logit model, and the advantage increases with the fraction of participants used for training (up to $6.8\%$ better MRR). However, the GCN performs worse than logit in terms of likelihood and the same or worse in terms of MRR. In contrast, graph-based methods do not outperform a CL in app-install. In the three election datasets, Laplacian-regularized MNL consistently outperforms MNL (with up to $2.6\%$ better MRR in us-election-2016; the margin in the California data is small but outside errorbars), while the GCN performs on par with MNL in us-election-2016 and worse in the California datasets.

These results yield insight into the role networks play in different choice behaviors. In app-usage, we find no benefit from using social network structure using any method. Instead, the recency feature appears to dominate, with propagation (which has no access to item features) performing much worse than the three models that do incorporate recency. This indicates that app usage is driven by individual habit rather than by external social factors. On the other hand, our results show that app *installation* has a strong social component: even simple Bluetooth proximity between friends provides a signal that they will install (but not necessarily use) similar apps. This finding highlights how combining a discrete choice approach with network data can illuminate the role social networks play in different choice behaviors. In the election data, especially ca-election-2016, even simple choice propagation performs remarkably well, despite *entirely ignoring demographic features*. This reveals that many of the important predictive demographic features (such as party affiliation, age, and ethnicity) are so strongly correlated over the adjacency network that we do not need to know information about you to predict your vote: it suffices to know about your neighbors or your neighbors’ neighbors.

We also compare the runtime of each method. To measure runtime, each model was run on a 50-25-25 train-validation-test split of each dataset four times. Since the hyperparameters are not crucial for runtime measurements (especially because Rprop is not sensitive to initial learning rate as an adaptive method), we fixed the learning rate at 0.01, $L_2$ regularization strength at 0.001, Laplace regularization strength at 0.0001, and propagation $\rho$ at 0.5. For each trial, we trained and tested each model once, shuffling the order of models to avoid systematic bias due to caching. Laplacian regularization has very low overhead over CL/MNL, while GCN is up to $4\times$ slower in the smaller datasets (see Table 2). In the larger datasets, PyTorch’s built-in parallelism reduces this relative gap. Propagation is more than $10\times$ faster than the choice models in every dataset.

### 6.3. Facebook and Myspace communities in app-install

Given that we observed significant improvement in prediction performance in app-install, we take a closer look at the patterns learned by the Laplacian-regularized logit compared to the plain logit. In particular, the Facebook and Myspace apps were in the top 20 most-preferred apps under both models. Given that these were competitor apps at the time,Footnote
^{7}
we hypothesized that they might be popular among different groups of participants. This is exactly what we observe in the learned parameters of the Laplacian-regularized logit. Facebook and Myspace are in the top 10 highest-utility apps for 70 and 27 participants, respectively (out of 139 total; we refer to these sets as
$F$
and
$M$
). Intriguingly, the overlap between
$F$
and
$M$
is only 3. Moreover, looking at the Bluetooth interaction network, we find the edge densities are more than twice as high within each of
$F$
and
$M$
than between them (Table 3), indicating they are true communities in the social network. In short, the Laplacian-regularized logit learns about two separate subcommunities, one in which Facebook is popular and one in which Myspace is popular.

### 6.4. Counterfactuals in the 2016 US election

One of the powerful uses of discrete choice models is applying them to counterfactual scenarios to predict what might happen under different choice sets [e.g., in assortment optimization (Rusmevichientong et al., Reference Rusmevichientong, Shen and Shmoys2010)]. For instance, we can use our models to make predictions about election outcomes if different candidates had been on ballots in 2016. However, we begin this exploration with a warning: making predictions from observational data is subject to *confounders*, unobserved factors that affected both who was on which ballot and how the states voted. For example, only Nevadans had the option to vote for “None of these options,” and Nevada is an outlier in a number of ways that are likely to impact voting, including its reliance on tourism, high level of diversity, and lack of income tax. This makes it less likely that the preferences of Nevadans for “None of these options” will neatly generalize to voters in other states. There are causal inference methods of managing confounding in discrete choice models; for instance, our county covariates act as regression controls (Tomlinson et al., Reference Tomlinson, Ugander and Benson2021). If those covariates fully described variation in county voting preferences, then the resulting choice models would be unbiased, even with confounding (Tomlinson et al., Reference Tomlinson, Ugander and Benson2021). However, we do not believe the covariates fully describe voting, since we can improve prediction by using regional or social correlations not captured by the county features. Nonetheless, examining our model’s counterfactual predictions is still instructive, demonstrating an application of choice models, providing insight into the model’s behavior, and motivating randomized experiments to test predictions about the effect of ballot changes. We note that the MNL we use obeys IIA, preventing relative preferences for candidates changing within a particular county when choice sets change. However, since states contain many counties, they are mixtures of MNLs (which can violate IIA), so their outcomes can change under the model.

$^*$ Maine allocates Electoral College votes proportionally—we assume a 3-1 split.

A widespread narrative of the 2016 election is that third-party candidates cost Clinton the election by disproportionately taking votes from her (Chalabi, Reference Chalabi2016; Rothenberg, Reference Rothenberg2019). To test this hypothesis, we examine three counterfactual scenarios: (Scenario 1) all ballots have five options: Clinton, Trump, Johnson, Stein, and McMullin; (Scenario 2) ballots only list Clinton and Trump, and (Scenario 3) ballots are as they were in 2016, but “None of these candidates” is added to every ballot. For each scenario, we take the best (validation-selected) Laplacian-regularized MNL trained on 80% of counties from each of the 8 county sampling trials and average their vote count predictions. Maximum-likelihood outcomes under the model are shown in Table 4. We find no evidence to support the claim that third-party candidates hurt Clinton more than Trump. None of the scenarios changed the two major measures of outcome: Clinton maintained the popular vote advantage, while Trump carried the Electoral College. A few swing states change hands in the predictions. The model places more weight on “None of these candidates” than seems realistic (for instance, predicting it to be the plurality winner in Rhode Island), likely because training data is only available for this option in a single state, leading to confounding. We also note that under the true choice sets, the model’s maximum likelihood state outcomes are the same as in Scenarios 1 and 2. A more complete analysis would examine the full distribution of Electoral College outcomes rather than just the maximum likelihood outcome, but we leave such analysis for future work as it is not our main focus.

## 7. Discussion

As we have seen, social and geographic network structure can be very useful in modeling the choices of a group of connected individuals, since people tend to have more similar preferences to their network neighborhood than to distant strangers. Several possible explanations are possible for this phenomenon: people may be more likely to become friends with similarly-minded individuals (homophily) or trends may spread across existing friendships (contagion). Unfortunately, determining whether homophily or contagion is responsible for similar behavior among friends is notoriously difficult [and often impossible (Shalizi & Thomas, Reference Shalizi and Thomas2011)].

We saw poor performance from the GCN relative to the logit models—as we noted, there are many hyperparameters that could be fine-tuned to possibly improve this performance, although this might not be practical for nonexperts. Additionally, there are a host of other GNNs that could outperform GCNs in a choice task. Our contributions in this area are to demonstrate how GNN models can be adapted for networked choice problems and to encourage further exploration of such problems. However, our findings are consistent with several lines of recent work that show simple propagation-based methods outperforming GNNs (Huang et al., Reference Huang, He, Singh, Lim and Benson2020; Wu et al., Reference Wu, Souza, Zhang, Fifty, Yu and Weinberger2019; He et al., Reference He, Deng, Wang, Li, Zhang and Wang2020).

There are several interesting avenues for future work in graph-based methods for discrete choice. As we noted, much of the recent machine learning interest in discrete choice (Seshadri et al., Reference Seshadri, Peysakhovich and Ugander2019; Bower & Balzano, Reference Bower and Balzano2020; Rosenfeld et al., Reference Rosenfeld, Oshiba and Singer2020; Tomlinson & Benson, Reference Tomlinson and Benson2021) has revolved around incorporating context effects (violations of IIA). Combining our methods with such approaches could answer questions that are to our knowledge entirely unaddressed in the literature (and possibly even unasked): Do context effects have a social component? If so, what kinds of context effects? Can we improve contextual choice prediction with social structure (in terms of accuracy or sample complexity)? Another natural extension of our work is to use a weighted Laplacian when we have a weighted social network. In another direction, choice data could be studied as an extra signal for community detection in networks, building on our identification of the Facebook and Myspace communities in the Friends and Family data.

## Acknowledgments

This research was supported by ARO MURI, ARO Award W911NF19-1-0057, NSF CAREER Award IIS-2045555, and NSF DMS-EPSRC Award 2146079. We thank Marios Papachristou for helpful discussions.

## Competing interests

None.