MongoDB with Spring Boot: Production Data Modeling, Aggregations & Performance Guide (2026)
A complete production guide covering MongoDB document modeling patterns, embedded vs referenced design, compound indexing with the ESR rule, aggregation pipeline in Java, multi-document transactions, change streams, sharding, and Atlas operations.
1. When to Choose MongoDB
| Feature | PostgreSQL | MongoDB | Cassandra |
|---|---|---|---|
| Schema flexibility | Rigid | ✅ Dynamic | Moderate |
| Hierarchical data | Joins required | ✅ Native nesting | Manual denorm |
| ACID transactions | ✅ Full ACID | ✅ Multi-doc (4.0+) | ❌ LWT only |
| Write throughput | Moderate | ✅ High | ✅ Very high |
| Aggregation | ✅ SQL GROUP BY | ✅ Pipeline | ❌ Limited |
Use MongoDB for: product catalogs with varying attributes, content management, user activity logs, IoT time-series, mobile app backends with evolving schemas, and any use case where you query by a primary key and need the full nested document in one request.
2. Spring Boot Setup
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-mongodb</artifactId>
</dependency>
# application.yml
spring:
data:
mongodb:
uri: mongodb+srv://user:password@cluster.mongodb.net/mydb?retryWrites=true&w=majority&readPreference=secondaryPreferred
# readPreference=secondaryPreferred — reads go to replica for better throughput
3. Document Modeling: Embedded vs Referenced
Rule of thumb: Embed data that is always accessed together and has a bounded/small cardinality. Reference data that has an independent lifecycle, high cardinality, or is accessed separately.
// DON'T: push follower IDs into user document — unbounded growth, 16MB limit
{
"_id": "user123",
"name": "Alice",
"followerIds": ["u1","u2","u3",...,"u99999"] // grows to millions!
}
@Document(collection = "products")
public class Product {
@Id
private String id; // MongoDB ObjectId (time-sortable, unique)
@Indexed(unique = true)
private String sku;
private String name;
private String description;
private String category; // @Indexed for facets
// EMBED: variants are few (<20), always fetched with product
private List<ProductVariant> variants; // {color, size, price, stock}
// REFERENCE: reviews are many (unbounded), accessed separately
@DBRef(lazy = true) // lazy = don't auto-fetch at load time
private List<Review> reviews;
// Metadata
@CreatedDate
private Instant createdAt;
@LastModifiedDate
private Instant updatedAt;
}
// Separate reviews collection — queried independently, paginated
@Document(collection = "reviews")
public class Review {
@Id private String id;
@Indexed private String productId; // FK by convention
private String userId;
private int rating;
private String comment;
@CreatedDate private Instant createdAt;
}
4. Indexing: ESR Rule, TTL, Partial, Sparse
ESR Rule: For compound indexes, order fields as Equality → Sort → Range. This maximizes index usage and minimizes in-memory sorts.
// Compound index following ESR rule for: category=electronics, sort by price, range on stock
@CompoundIndex(def = "{'category': 1, 'price': 1, 'stock': 1}", name = "category_price_stock")
@Document(collection = "products")
public class Product { ... }
// TTL index: auto-delete OTP documents after 5 minutes
@Document(collection = "otps")
public class Otp {
@Id private String id;
@Indexed(expireAfterSeconds = 300) // TTL index
private Date createdAt;
private String code;
private String userId;
}
// Partial index: only index active products (smaller index = faster)
// Must be done programmatically with MongoTemplate
mongoTemplate.indexOps(Product.class).ensureIndex(
new Index("sku", Sort.Direction.ASC)
.named("active_sku_idx")
.sparse()
.partial(new Document("status", "ACTIVE"))
);
5. Aggregation Pipeline in Java
@Service
public class ProductAggregationService {
@Autowired private MongoTemplate mongoTemplate;
public CategorySalesReport getSalesReport(String category, LocalDate from, LocalDate to) {
TypedAggregation<Order> agg = Aggregation.newAggregation(Order.class,
// Stage 1: $match FIRST — filter before any computation
Aggregation.match(
Criteria.where("category").is(category)
.and("createdAt").gte(from).lte(to)
.and("status").is("COMPLETED")),
// Stage 2: $group — sum revenue and count orders per product
Aggregation.group("productId")
.sum("amount").as("totalRevenue")
.count().as("orderCount")
.avg("amount").as("avgOrderValue"),
// Stage 3: $lookup — join with products collection
Aggregation.lookup("products", "_id", "_id", "product"),
Aggregation.unwind("product"),
// Stage 4: $sort — top selling products first
Aggregation.sort(Sort.by(Sort.Direction.DESC, "totalRevenue")),
Aggregation.limit(50),
// Stage 5: $project — shape output
Aggregation.project("orderCount", "totalRevenue", "avgOrderValue")
.andExpression("product.name").as("productName")
);
return mongoTemplate.aggregate(agg, CategorySalesReport.class).getMappedResults()
.stream().findFirst().orElse(new CategorySalesReport());
}
}
6. Spring Data MongoDB: Repository & MongoTemplate
// MongoRepository for simple CRUD
public interface ProductRepository extends MongoRepository<Product, String> {
// Derived queries
List<Product> findByCategoryAndPriceLessThan(String category, double maxPrice);
@Query("{ 'category': ?0, 'variants.stock': { $gt: 0 } }")
Page<Product> findAvailableByCategory(String category, Pageable pageable);
}
// MongoTemplate for complex Criteria
@Service
public class ProductSearchService {
@Autowired private MongoTemplate mongoTemplate;
public Page<Product> search(ProductFilter filter, Pageable pageable) {
Criteria criteria = new Criteria();
if (filter.getCategory() != null)
criteria.and("category").is(filter.getCategory());
if (filter.getMinPrice() != null)
criteria.and("price").gte(filter.getMinPrice());
if (filter.getMaxPrice() != null)
criteria.and("price").lte(filter.getMaxPrice());
if (filter.getKeyword() != null)
criteria.and("name").regex(filter.getKeyword(), "i"); // case-insensitive
Query query = Query.query(criteria)
.with(pageable)
.with(Sort.by("createdAt").descending());
List<Product> results = mongoTemplate.find(query, Product.class);
long count = mongoTemplate.count(Query.query(criteria), Product.class);
return new PageImpl<>(results, pageable, count);
}
}
7. Multi-Document Transactions
@Configuration
public class MongoConfig {
@Bean
public MongoTransactionManager transactionManager(MongoDatabaseFactory dbFactory) {
return new MongoTransactionManager(dbFactory);
}
}
@Service
public class OrderService {
@Transactional // ACID across multiple collections
public Order placeOrder(OrderRequest request) {
// 1. Deduct inventory atomically
Product product = productRepository.findById(request.getProductId())
.orElseThrow(() -> new ProductNotFoundException(request.getProductId()));
if (product.getStock() < request.getQuantity()) {
throw new InsufficientStockException();
}
product.setStock(product.getStock() - request.getQuantity());
productRepository.save(product);
// 2. Create order record
Order order = Order.builder()
.productId(request.getProductId())
.userId(request.getUserId())
.quantity(request.getQuantity())
.amount(product.getPrice() * request.getQuantity())
.status("PENDING")
.build();
return orderRepository.save(order);
// If any exception: full rollback of both writes
}
}
8. Change Streams: Real-Time Events
@Component
public class OrderChangeStreamListener {
@Autowired private MongoTemplate mongoTemplate;
@Autowired private EventPublisher eventPublisher;
private BsonDocument lastResumeToken;
@PostConstruct
public void startListening() {
ChangeStreamOptions options = ChangeStreamOptions.builder()
.filter(Aggregation.newAggregation(
Aggregation.match(Criteria.where("operationType").in("insert", "update"))))
.resumeAt(loadLastResumeToken()) // resume after restart
.build();
Flux<ChangeStreamEvent<Order>> stream = mongoTemplate.changeStream(
"orders", options, Order.class);
stream.subscribe(event -> {
lastResumeToken = event.getRaw().getResumeToken();
saveResumeToken(lastResumeToken); // persist for crash recovery
Order order = event.getBody();
if ("insert".equals(event.getOperationType().getValue())) {
eventPublisher.publish(new OrderCreatedEvent(order));
}
});
}
}
10. Production Operations & Atlas
| Area | Action | Tool |
|---|---|---|
| Slow queries | Enable profiler level 2 (slowms=100) | db.setProfilingLevel(2, {slowms: 100}) |
| Query explain | Check executionStats for index usage | cursor.explain("executionStats") |
| Index advisor | Atlas Performance Advisor auto-suggests indexes | Atlas UI |
| Backups | Atlas continuous backup with point-in-time restore | Atlas / mongodump |
11. Interview Questions & Checklist
A: Embed the first ~5 comments in the post document for fast display (no second query). Store all comments in a separate comments collection with a postId index for paginated loading of full comment threads. This pattern — "subset pattern" — balances read performance for the common case (show post + preview comments) and scalability for the edge case (post with 10k comments).
- Always use replica set (required for transactions)
- Explicit schema validation via $jsonSchema
- Avoid unbounded arrays
- Follow ESR rule for compound indexes
- Use TTL index for session/OTP documents
- $match first in every aggregation pipeline
- Test explain("executionStats") for all queries
- Set maxTimeMS on all queries
- Use change streams for event-driven sync
- Atlas Performance Advisor in production
12. At BRAC IT: MongoDB for Audit Logs and Analytics
At BRAC IT we use MongoDB for two purposes: audit logging (every loan state transition) and analytics pre-aggregation (daily portfolio risk summaries). We chose MongoDB for these use cases over PostgreSQL for three reasons: our audit event schema evolves frequently as we add new event types and regulatory requirements, individual audit documents are self-contained (no joins needed), and write throughput is significantly higher than PostgreSQL for our insert-heavy workload.
Our audit collection stores one document per event. Each document is timestamped, tagged with the user, service, correlation ID, and contains the full before/after state of the entity:
{
"_id": ObjectId("..."),
"eventType": "LOAN_STATUS_CHANGED",
"correlationId": "a3f4b2c1-...",
"timestamp": ISODate("2026-04-28T10:32:00Z"),
"actorId": "officer-uuid-123",
"actorType": "LOAN_OFFICER",
"entityId": "loan-uuid-456",
"before": { "status": "PENDING", "assignedTo": null },
"after": { "status": "APPROVED", "assignedTo": "officer-uuid-123" },
"metadata": {
"serviceVersion": "2.4.1",
"hostId": "payment-service-pod-7d8f",
"ipAddress": "10.0.1.45"
}
}
This schema has evolved 14 times in three years. Adding a new field to "metadata" in MongoDB requires zero migration — new documents have the field, old documents do not. A @Document class in Spring Data simply adds the new field as nullable and it starts appearing in new documents immediately. In PostgreSQL, that would be an ALTER TABLE plus a default-value backfill migration on a table with 50 million rows.
For analytics, we use MongoDB's aggregation pipeline to pre-compute daily portfolio summaries and store them in a separate "snapshots" collection. Dashboards query the snapshot collection (millisecond response) rather than aggregating 50 million audit events on the fly (minutes). A scheduled job runs nightly:
@Scheduled(cron = "0 30 2 * * *") // 2:30 AM daily
public void computeDailySnapshot() {
LocalDate yesterday = LocalDate.now().minusDays(1);
List<AggregationOperation> pipeline = List.of(
match(where("eventType").is("LOAN_DISBURSED")
.and("timestamp").gte(yesterday.atStartOfDay())),
group("loanPurpose")
.count().as("totalLoans")
.sum("amount").as("totalAmount")
.avg("amount").as("avgAmount"),
project("totalLoans", "totalAmount", "avgAmount")
.and("_id").as("loanPurpose")
);
List<DailySnapshot> snapshots =
mongoTemplate.aggregate(newAggregation(pipeline),
"audit_events", DailySnapshot.class).getMappedResults();
snapshotRepository.saveAll(snapshots);
}
13. Schema Design Anti-Patterns We Learned the Hard Way
Three MongoDB schema mistakes we made in production and how we fixed them:
Anti-pattern 1: Unbounded arrays. We initially embedded all loan transactions inside the loan document. After 18 months, some loan documents had grown to over 16 MB — MongoDB's document size limit. Writes started failing with "document too large" errors in the middle of the night. The fix: extract transactions into a separate "loan_transactions" collection and reference by loan ID. Any array that can grow without bound belongs in a separate collection, not embedded.
Anti-pattern 2: Over-normalisation. Coming from a relational background, our first schema had loan documents referencing borrower documents by ID, with separate collections for addresses, guarantors, and collateral. Every query required $lookup (MongoDB's join equivalent). Performance was poor — $lookup operations on large collections are expensive and cannot use indexes on the joined collection efficiently. The fix: embed small, stable, frequently-read sub-documents (borrower name, ID number, primary phone) directly in the loan document. Only use references for large or frequently-updated data.
Anti-pattern 3: Missing sparse indexes. We added a "referredBy" field to loan documents — only 15% of loans have referrals. We created an index on referredBy expecting it to speed up referral reporting. The index was 85% null values, taking up space without helping query performance. The fix: use sparse: true on indexes for fields that only a minority of documents contain. Sparse indexes skip documents where the indexed field is missing or null.
// Sparse index for rarely-present field
db.loans.createIndex(
{ "referredBy": 1 },
{ sparse: true, name: "idx_referred_by_sparse" }
)
// In Spring Data MongoDB:
@Document(collection = "loans")
public class Loan {
@Indexed(sparse = true)
private String referredBy; // null for most documents
}
14. Change Streams in Production: Three Lessons Learned
We use MongoDB change streams to sync audit events to Elasticsearch in real time for full-text search. The architecture: a dedicated Spring Boot "sync service" subscribes to the audit_events collection change stream and indexes each event into Elasticsearch as it is inserted. The sync service has been running in production for 14 months with three hard-learned lessons:
Lesson 1: Persist your resume token. A change stream resumes from a position encoded in the resume token. If your sync service crashes or restarts and you have not persisted the resume token, you lose your position and will either miss events (if you resume from "now") or reprocess everything (if you restart from the beginning). We store the resume token in a MongoDB collection after processing each batch of events. On startup, we load the last saved token and resume from there.
Lesson 2: Handle primary elections gracefully. When MongoDB performs a primary election (during rolling upgrades, failovers, or routine maintenance), the change stream cursor becomes invalid. The subscription throws an exception. Your listener must catch this, discard the old cursor, wait a few seconds, and re-open the change stream from the last saved resume token.
Lesson 3: Make processing idempotent. Due to lesson 2 (cursor invalidation + resume), you may process the same event twice on reconnect. Your downstream system must handle duplicate events safely. Our Elasticsearch indexing is idempotent: we use the MongoDB document _id as the Elasticsearch document ID. Indexing the same document twice simply overwrites it — no duplicate records, no errors.
15. MongoDB Performance Checklist
Before going to production with a MongoDB-backed service, validate each item in this checklist:
- Every query uses an index — run
db.collection.explain("executionStats")on your slowest queries; any COLLSCAN result is a missing index - All arrays have an upper bound — if an array can grow indefinitely, move it to a separate collection
- Compound indexes follow ESR rule — Equality fields first, Sort fields second, Range fields last
- maxTimeMS set on all read queries — prevents runaway queries from consuming all server resources
- Connection pool sized correctly — default is 100; for high-throughput services, increase to 200–500 and monitor pool utilisation
- Write concern set appropriately — use
w: "majority"for data you cannot afford to lose on primary failover - Replica set configured — never run a single MongoDB node in production; minimum 3-node replica set
- Atlas Performance Advisor reviewed — or check slow query log daily during the first two weeks of production traffic
- Change stream resume tokens persisted — if you use change streams, store the resume token durably
- Index on high-cardinality fields only — indexes on low-cardinality fields (e.g., status with 3 values) hurt write performance more than they help reads