Mixed Membership Stochastic Block Model¶
Introduction to MMSBM¶
What is MMSBM?¶
Mixed Membership Stochastic Block Model (MMSBM) is a probabilistic model for recommendation systems that:
Discovers latent groups in both users and items
Allows users and items to belong to multiple groups simultaneously
Models rating patterns between groups
Makes predictions based on group interactions
Why use MMSBM?¶
Traditional recommendation systems often assume users and items have fixed characteristics. However, in reality:
A user might watch action movies on weekends but documentaries during the week
A mountain can be both a climbing spot and a skiing destination
A book might appeal to both science fiction fans and mystery readers
MMSBM captures this flexibility by allowing “mixed memberships” - both users and items can belong partially to multiple groups.
Model Structure¶
Core Components¶
User Groups (K groups):
Each user u has a membership vector \(\theta_u\)
\(\theta_{uk}\) = probability of user u belonging to group k
Example: A user might be 70% “action fan” and 30% “climbing movies fan”:
user_memberships = { 'action_group': 0.7, 'climbing_group': 0.3 }
Item Groups (L groups):
Each item i has a membership vector \(\eta_i\)
\(\eta_{i\ell}\) = probability of item i belonging to group ℓ
Example: A movie might be 60% “action” and 40% “drama”:
- movie_memberships = {
‘action_group’: 0.6, ‘drama_group’: 0.4
}
Rating Patterns (for each group pair):
\(p_{k\ell}(r)\) = probability of rating r between user group k and item group ℓ
Captures how different user groups tend to rate different item groups
Example: “action fans” might tend to rate “action movies” highly:
rating_patterns = { ('action_fan', 'action_movie'): { 5: 0.6, # 60% chance of 5-star rating 4: 0.3, # 30% chance of 4-star rating 3: 0.1 # 10% chance of 3-star rating } }
Mathematical Formulation¶
The Rating Process¶
When a user rates an item, the model assumes:
The user acts as a member of some group k
The item acts as a member of some group ℓ
The rating follows the pattern for groups k and ℓ
This gives us the probability of a rating r:
Implementation:
def prod_dist(x, theta, eta, pr):
return (theta[x[0]][:, np.newaxis, np.newaxis] *
(eta[x[1], :][:, np.newaxis] * pr)
).sum(axis=0).sum(axis=0)
Constraints¶
All vectors must represent valid probability distributions:
User memberships sum to 1:
\[\sum_k \theta_{uk} = 1 \quad \text{for all } u\]Item memberships sum to 1:
\[\sum_\ell \eta_{i\ell} = 1 \quad \text{for all } i\]Rating probabilities sum to 1:
\[\sum_r p_{k\ell}(r) = 1 \quad \text{for all } k,\ell\]
Implementation:
def normalize_with_d(df, d):
"""Normalize user/item memberships"""
return df / [np.repeat(max(len(a), 1), df.shape[1])
for a in d.values()]
def normalize_with_self(df):
"""Normalize 3D arrays"""
temp = df.reshape((df.shape[0] * df.shape[1], df.shape[2]))
return (
temp / (np.where(temp.sum(axis=1) == 0, 1, temp.sum(axis=1)))[:, np.newaxis]
).reshape(df.shape)
Model Training¶
The training process has two main components:
Multiple Sampling Runs:
Start with different random initializations
Run EM algorithm on each
Choose best result based on accuracy
Cross-Validation Option:
Split data into folds
Train on each subset
Test on held-out data
Implementation:
def cv_fit(self, data, folds=5):
"""Cross-validated model fitting"""
accuracies = []
for f in range(folds):
train, test = self._split_data(data, f)
self.fit(train)
acc = self.score(test)['accuracy']
accuracies.append(acc)
return accuracies
Making Predictions¶
To predict ratings for new user-item pairs:
Use learned group memberships (\(\theta\), \(\eta\))
Use learned rating patterns (\(p\))
Compute expected rating using the probability formula
Implementation:
def predict(self, data):
self._check_is_fitted()
test = self.data_handler.format_test_data(data)
self.test = test
# Get the info for all the runs
rats = [
self.em.compute_prod_dist(test, a["theta"], a["eta"], a["pr"])
for a in self.results
]
prs = np.array([a["pr"] for a in self.results])
likelihoods = np.array([a["likelihood"] for a in self.results])
thetas = np.array([a["theta"] for a in self.results])
etas = np.array([a["eta"] for a in self.results])
best = self.choose_best_run(rats)
# Store the cleaned best objects
self.theta = self.data_handler.return_theta_indices(thetas[best])
self.eta = self.data_handler.return_eta_indices(etas[best])
self.pr = self.data_handler.return_pr_indices(prs[best])
self.likelihood = likelihoods[best]
# The prediction matrix is the average of all runs
self.prediction_matrix = np.array(rats).mean(axis=0)
return self.prediction_matrix
Interpreting Results¶
The model provides several insights:
User Groups:
Which types of users exist?
How do different users combine these types?
Example: Discover “weekend watchers” vs “daily viewers”
Item Groups:
What are the natural categories?
How do items span multiple categories?
Example: Find movies that bridge genres
Rating Patterns:
How do different user types rate different item types?
Which combinations lead to high/low ratings?
Example: Understand what drives high ratings
Example Analysis:
# Get group memberships for a user
user_groups = model.theta.loc['user_123']
print("User group memberships:", user_groups)
# Get item categorization
item_groups = model.eta.loc['item_456']
print("Item group memberships:", item_groups)
# Look at rating patterns
rating_probs = model.pr
print("Rating patterns:", rating_probs)
Want to Learn More?¶
See Expectation-Maximization Algorithm for optimization details
Check Quick Start Guide for practical examples
Look at ../../api/modules for API details