Mixed Membership Stochastic Block Model

Introduction to MMSBM

What is MMSBM?

Mixed Membership Stochastic Block Model (MMSBM) is a probabilistic model for recommendation systems that:

  • Discovers latent groups in both users and items

  • Allows users and items to belong to multiple groups simultaneously

  • Models rating patterns between groups

  • Makes predictions based on group interactions

Why use MMSBM?

Traditional recommendation systems often assume users and items have fixed characteristics. However, in reality:

  • A user might watch action movies on weekends but documentaries during the week

  • A mountain can be both a climbing spot and a skiing destination

  • A book might appeal to both science fiction fans and mystery readers

MMSBM captures this flexibility by allowing “mixed memberships” - both users and items can belong partially to multiple groups.

Model Structure

Core Components

  1. User Groups (K groups):

    • Each user u has a membership vector \(\theta_u\)

    • \(\theta_{uk}\) = probability of user u belonging to group k

    • Example: A user might be 70% “action fan” and 30% “climbing movies fan”:

      user_memberships = {
          'action_group': 0.7,
          'climbing_group': 0.3
      }
      
  2. Item Groups (L groups):

    • Each item i has a membership vector \(\eta_i\)

    • \(\eta_{i\ell}\) = probability of item i belonging to group ℓ

    • Example: A movie might be 60% “action” and 40% “drama”:

      movie_memberships = {

      ‘action_group’: 0.6, ‘drama_group’: 0.4

      }

  3. Rating Patterns (for each group pair):

    • \(p_{k\ell}(r)\) = probability of rating r between user group k and item group ℓ

    • Captures how different user groups tend to rate different item groups

    • Example: “action fans” might tend to rate “action movies” highly:

      rating_patterns = {
          ('action_fan', 'action_movie'): {
              5: 0.6,  # 60% chance of 5-star rating
              4: 0.3,  # 30% chance of 4-star rating
              3: 0.1   # 10% chance of 3-star rating
          }
      }
      

Mathematical Formulation

The Rating Process

When a user rates an item, the model assumes:

  1. The user acts as a member of some group k

  2. The item acts as a member of some group ℓ

  3. The rating follows the pattern for groups k and ℓ

This gives us the probability of a rating r:

\[Pr[r_{ui} = r] = \sum_{k,\ell} \theta_{uk} \eta_{i\ell} p_{k\ell}(r)\]

Implementation:

def prod_dist(x, theta, eta, pr):
    return (theta[x[0]][:, np.newaxis, np.newaxis] *
            (eta[x[1], :][:, np.newaxis] * pr)
           ).sum(axis=0).sum(axis=0)

Constraints

All vectors must represent valid probability distributions:

  1. User memberships sum to 1:

    \[\sum_k \theta_{uk} = 1 \quad \text{for all } u\]
  2. Item memberships sum to 1:

    \[\sum_\ell \eta_{i\ell} = 1 \quad \text{for all } i\]
  3. Rating probabilities sum to 1:

    \[\sum_r p_{k\ell}(r) = 1 \quad \text{for all } k,\ell\]

Implementation:

def normalize_with_d(df, d):
    """Normalize user/item memberships"""
    return df / [np.repeat(max(len(a), 1), df.shape[1])
                for a in d.values()]

def normalize_with_self(df):
    """Normalize 3D arrays"""
    temp = df.reshape((df.shape[0] * df.shape[1], df.shape[2]))
    return (
            temp / (np.where(temp.sum(axis=1) == 0, 1, temp.sum(axis=1)))[:, np.newaxis]
    ).reshape(df.shape)

Model Training

The training process has two main components:

  1. Multiple Sampling Runs:

    • Start with different random initializations

    • Run EM algorithm on each

    • Choose best result based on accuracy

  2. Cross-Validation Option:

    • Split data into folds

    • Train on each subset

    • Test on held-out data

Implementation:

def cv_fit(self, data, folds=5):
    """Cross-validated model fitting"""
    accuracies = []
    for f in range(folds):
        train, test = self._split_data(data, f)
        self.fit(train)
        acc = self.score(test)['accuracy']
        accuracies.append(acc)
    return accuracies

Making Predictions

To predict ratings for new user-item pairs:

  1. Use learned group memberships (\(\theta\), \(\eta\))

  2. Use learned rating patterns (\(p\))

  3. Compute expected rating using the probability formula

Implementation:

def predict(self, data):
    self._check_is_fitted()

    test = self.data_handler.format_test_data(data)
    self.test = test

    # Get the info for all the runs
    rats = [
        self.em.compute_prod_dist(test, a["theta"], a["eta"], a["pr"])
        for a in self.results
    ]
    prs = np.array([a["pr"] for a in self.results])
    likelihoods = np.array([a["likelihood"] for a in self.results])
    thetas = np.array([a["theta"] for a in self.results])
    etas = np.array([a["eta"] for a in self.results])

    best = self.choose_best_run(rats)

    # Store the cleaned best objects
    self.theta = self.data_handler.return_theta_indices(thetas[best])
    self.eta = self.data_handler.return_eta_indices(etas[best])
    self.pr = self.data_handler.return_pr_indices(prs[best])
    self.likelihood = likelihoods[best]

    # The prediction matrix is the average of all runs
    self.prediction_matrix = np.array(rats).mean(axis=0)

    return self.prediction_matrix

Interpreting Results

The model provides several insights:

  1. User Groups:

    • Which types of users exist?

    • How do different users combine these types?

    • Example: Discover “weekend watchers” vs “daily viewers”

  2. Item Groups:

    • What are the natural categories?

    • How do items span multiple categories?

    • Example: Find movies that bridge genres

  3. Rating Patterns:

    • How do different user types rate different item types?

    • Which combinations lead to high/low ratings?

    • Example: Understand what drives high ratings

Example Analysis:

# Get group memberships for a user
user_groups = model.theta.loc['user_123']
print("User group memberships:", user_groups)

# Get item categorization
item_groups = model.eta.loc['item_456']
print("Item group memberships:", item_groups)

# Look at rating patterns
rating_probs = model.pr
print("Rating patterns:", rating_probs)

Want to Learn More?