VSCode Reveal intro

How to Learn when Data Gradually Reacts to Your Model

paper by Zachary Izzo, James Zou, Lexing Ying Presentation by

Kimia Kazemian

31/10/2022

what is this about?

Problem: training ML models in the performative setting.
Goal: minimize perforative risk: model loss on the distribution it induces.
Previous work: induced data distribution depends only on the deployed model.
Too simplistic? dependence on the “state”
Example: credit score
Contribution: meta algorithm

i.e. when the data distribution reacts to the deployed model the scenario in which our deployed model or algorithm effects the distribution of the data or population which we are studying

GOAL: A model favorable data distribution and performs well on the induced distribution

Previous Work: Assumes that the data distribution immediately adapts to the deployed model

In practice, however, this may not be the case, as the population may take time to adapt to the model. In many applications, the data distribution depends on BOTH the currently deployed ML model and on the “state” that the population was in before the model was deployed

BANK loans: people with more credit lines open are more likely to repay their loans. distribution of the outcome has changed,

BANK:so we can expect the distribution to change grad- ually as applicants have more time to adapt, before finally settling to some steady-state distribution

Problem setup

$D : \Theta × M(Z) → M(Z)$
$\Theta$: set of admissible model parameters
$Z$: data sample space.
$M(Z)$: set of probability measures on $Z$.
$\rho_t = D(\theta_t, \rho_{t−1}).$ $\mu_t=m(\theta_t,\mu_{t-1})$
$\rho_∗(\theta) = \underset{t\to\infty} {lim} \rho_t \hspace{0.5em}$ where $\hspace{0.5em}\theta_t ≡ \theta \hspace{0.5em}$ for all $t$ $\mu_*=\underset{k \to \infty}{lim} \hspace{0.5em} m^{(k)}(\theta,\mu_{k-1})$
$\theta_{OPT}=\underset{\theta \in \Theta}{argmin \hspace{0.5em}} \mathcal{L}^*(\theta)$
Assume $\rho_t$ belongs to a parametric family with parameter $\mu_t$ and density $p(.,\mu_t)$

Problem setup

Assume $\theta,\mu \in \mathbb{R}^d$
$\partial_i f$ denotes derivative wrt $i$th argument
$\psi_t = [\theta_t^\top, \mu_t^\top] ^\top$ denotes the full input to $m$ at time $t$, and for any collection of vectors $v_i, v_{i:j}$ denotes the matrix with columns $v_i, v_{i+1}, …, v_j$.

How do we do it?

low dimensional statistics?

Observation: individuals modify their behavior based on a low- dimensional proxy, such as a credit score or classification probability
How can we apply stateful PerfGD for a high-dimensional model without incurring a large error due to the high dimension?
$\mu_t = m(\theta_t,\mu_{t-1})= \bar{m}(s(\theta_t,\mu_{t-1}),\mu_{t-1})$ $s(\theta,\mu)\in \mathbb{R}^{d_s}$ and $d_s \ll dim(\theta)$
$\partial_1 s(\theta_t,\mu_{t-1})=\partial_1 \bar{m}(s_t,\mu_{t-1})\partial s(\theta_t,\mu_{t-1})$

experiments

spam classification:

_{Reference:Strategic classification}

what else?

societal impacts: possibly maximize a certain measure of negative externality
future work: what are our assumptions?
- deterministic MDP
- batch setting

When the population in question consists of people, this amounts to trying to induce these people to behave in a way which makes them easy to classify, which may not align with behaviors that benefit these people the most. Indeed, it has been observed that in some cases, such a procedure can maximize a certain measure of negative externality (Jagadeesan et al., 2021). However, manipulation of the data distribution also has the capability to produce the opposite effect, i.e., inducing a data distribution which is advantageous both for the modeled population and the modeler.

highly structured Markov decision process, but relaxing some of the assumptions on the underlying MDP is of interest for improving the practical efficacy of this setting, and offers the potential for connections with reinforcement learning ur current method works in the batch setting where we have enough samples to accurately estimate population-level quantities. Developing methods that can work in a stochastic/limited sample regime is also of interest.

_{Reference: Alternative microfoundations for strategic
classification}