Aggregation in der Praxis: Von simpel zu komplex

Die Aggregation-Pipeline entfaltet ihre Macht erst in realen Szenarien. Theoretisches Wissen über $match, $group und $lookup ist eine Sache – diese Stages zu kombinieren, um Business-Fragen zu beantworten, ist eine andere. Eine E-Commerce-Site will “Top-Selling-Products last quarter by region”. Ein Analytics-Dashboard braucht “User-Engagement-Metrics with cohort-analysis”. Ein Inventory-System benötigt “Low-Stock-Alerts with supplier-lead-times”. Diese Requirements übersetzen sich in mehrstufige Pipelines mit komplexer Logik.

Dieses Kapitel durchläuft praktische Aggregation-Patterns systematisch, von grundlegenden Filtern bis zu Multi-Collection-Joins mit transformierten Daten. Der Fokus ist nicht auf einzelne Stages (das wurde im vorherigen Kapitel behandelt), sondern auf deren Kombination für reale Use-Cases. Jedes Beispiel ist production-ähnlich – keine akademischen Toy-Datasets, sondern realistische Datenstrukturen und Business-Requirements.

38.1 Basic Pattern: Filter, Sort, Limit

Der einfachste Use-Case: “Show me the cheapest 10 products in Electronics category.” Dies ist eine straightforward Pipeline, aber sie illustriert fundamentale Principles.

db.products.aggregate([
  // Stage 1: Filter nach Category
  { $match: { 
      category: "Electronics",
      inStock: true 
  }},
  
  // Stage 2: Sort nach Preis
  { $sort: { price: 1 } },
  
  // Stage 3: Top 10
  { $limit: 10 },
  
  // Stage 4: Projection für Clean Output
  { $project: {
      _id: 0,
      name: 1,
      price: 1,
      brand: 1
  }}
])

$match früh reduziert die Datenmenge. Statt alle Millionen Produkte zu sortieren, sortieren wir nur die paar tausend Electronics. $sort dann $limit ist effizienter als umgekehrt – MongoDB optimiert dies intern (muss nicht alle sortieren, nur genug für Top-10). $project am Ende für clean Output.

Wenn ein Index auf { category: 1, price: 1 } existiert, kann MongoDB diese Pipeline vollständig Index-backed ausführen – IXSCAN statt COLLSCAN, keine In-Memory-Sort. Dies ist der Unterschied zwischen Millisekunden und Sekunden bei großen Collections.

38.2 Grouping Pattern: Aggregierte Statistiken

Ein häufiger Analytics-Use-Case: “Average rating per product category, sorted by rating.” Dies erfordert Gruppierung und Aggregation.

db.reviews.aggregate([
  // Nur approved Reviews
  { $match: { status: "approved" } },
  
  // Group by product category
  { $group: {
      _id: "$product.category",
      avgRating: { $avg: "$rating" },
      reviewCount: { $sum: 1 },
      minRating: { $min: "$rating" },
      maxRating: { $max: "$rating" }
  }},
  
  // Filter: Nur Categories mit mindestens 10 Reviews
  { $match: { reviewCount: { $gte: 10 } } },
  
  // Sort by average rating
  { $sort: { avgRating: -1 } },
  
  // Clean naming
  { $project: {
      _id: 0,
      category: "$_id",
      avgRating: { $round: ["$avgRating", 2] },
      reviewCount: 1,
      ratingRange: { 
        $concat: [
          { $toString: "$minRating" },
          " - ",
          { $toString: "$maxRating" }
        ]
      }
  }}
])

Der zweite $match (nach $group) filtert auf aggregiertem reviewCount. Dies ist nur möglich nach der Aggregation – man kann nicht vor $group auf reviewCount filtern, weil es noch nicht existiert. Die $round-Expression gibt saubere 2-Decimal-Averages. Die $concat baut ein Human-Readable-Range-String.

In Production würde man vermutlich auch Confidence-Intervals berechnen (Standard-Deviation) und vielleicht Recency gewichten (Recent-Reviews höher gewichtet). Dies würde komplexere Math-Operatoren erfordern.

38.3 Time-Series Pattern: Temporal Aggregation

Business-Metriken sind oft zeitbasiert: “Monthly revenue by region for last year.” Dies erfordert Date-Extraction und mehrdimensionale Gruppierung.

db.sales.aggregate([
  // Filter: Last year
  { $match: {
      saleDate: {
        $gte: new Date("2023-01-01"),
        $lt: new Date("2024-01-01")
      }
  }},
  
  // Extract year, month, region
  { $project: {
      year: { $year: "$saleDate" },
      month: { $month: "$saleDate" },
      region: "$customerRegion",
      amount: 1
  }},
  
  // Group by year, month, region
  { $group: {
      _id: {
        year: "$year",
        month: "$month",
        region: "$region"
      },
      revenue: { $sum: "$amount" },
      transactionCount: { $sum: 1 },
      avgTransactionValue: { $avg: "$amount" }
  }},
  
  // Sort chronologically
  { $sort: {
      "_id.year": 1,
      "_id.month": 1,
      "_id.region": 1
  }},
  
  // Reshape for readability
  { $project: {
      _id: 0,
      year: "$_id.year",
      month: "$_id.month",
      region: "$_id.region",
      revenue: { $round: ["$revenue", 2] },
      transactionCount: 1,
      avgTransactionValue: { $round: ["$avgTransactionValue", 2] }
  }}
])

MongoDB’s Date-Operatoren ($year, $month, $dayOfWeek, etc.) sind essentiell für Time-Series-Aggregations. Sie extrahieren Components aus Dates ohne komplexe Application-Logic. Für Quarterly-Reports würde man $ceil({ $divide: [{ $month: "$date" }, 3] }) nutzen.

Ein Index auf { saleDate: 1, customerRegion: 1 } würde das initiale $match dramatisch beschleunigen. Ohne Index scannt MongoDB alle Sales-History.

38.4 Join Pattern: Enriching mit $lookup

Single-Collection-Aggregations sind limitiert. Real-World-Systems sind normalized – Orders referenzieren Customers, Products referenzieren Categories. $lookup performed Joins.

db.orders.aggregate([
  // Filter: Orders from last month
  { $match: {
      orderDate: {
        $gte: new Date(Date.now() - 30 * 24 * 60 * 60 * 1000)
      },
      status: "completed"
  }},
  
  // Join mit Customers
  { $lookup: {
      from: "customers",
      localField: "customerId",
      foreignField: "_id",
      as: "customer"
  }},
  
  // Unwind customer (single document)
  { $unwind: "$customer" },
  
  // Join mit Products für item details
  { $unwind: "$items" },
  { $lookup: {
      from: "products",
      localField: "items.productId",
      foreignField: "_id",
      as: "items.productDetails"
  }},
  { $unwind: "$items.productDetails" },
  
  // Group zurück zu Orders (nach Unwind)
  { $group: {
      _id: "$_id",
      orderId: { $first: "$_id" },
      customerName: { $first: "$customer.name" },
      customerEmail: { $first: "$customer.email" },
      orderDate: { $first: "$orderDate" },
      items: { 
        $push: {
          productName: "$items.productDetails.name",
          quantity: "$items.quantity",
          price: "$items.price"
        }
      },
      totalAmount: { $sum: { $multiply: ["$items.quantity", "$items.price"] } }
  }},
  
  // Sort by amount
  { $sort: { totalAmount: -1 } },
  
  // Clean output
  { $project: {
      _id: 0,
      orderId: 1,
      customerName: 1,
      customerEmail: 1,
      orderDate: 1,
      itemCount: { $size: "$items" },
      items: 1,
      totalAmount: { $round: ["$totalAmount", 2] }
  }}
])

Diese Pipeline ist non-trivial. Sie: 1. Filtert Orders 2. Joint mit Customers (1:1) 3. Unwinds items-Array 4. Joint jedes Item mit Products (1:N) 5. Groupt zurück zu Orders (reverting Unwind) 6. Berechnet totals

Der kritische Teil: $unwind + $lookup + $group. Wir unwind Items, um jedes einzeln zu joinen, dann group zurück zu Orders. Dies ist ein Common-Pattern für Array-of-References.

Diese Pipeline macht viele Lookups. Für 100 Orders mit durchschnittlich 5 Items = 500 Product-Lookups. Bei großen Datasets wird dies langsam. Die Alternative: Denormalisierung – Product-Name direkt in Order-Item embedden. Trade-off: Redundanz vs. Performance.

38.5 Advanced Join: Pipeline $lookup mit Filtering

Das basic $lookup joint alle matching Dokumente. Manchmal will man nur spezifische oder transformierte Daten. Pipeline-$lookup erlaubt Sub-Pipelines im Join.

db.customers.aggregate([
  // Filter: Active customers
  { $match: { status: "active" } },
  
  // Join mit Orders - aber nur Top-3
  { $lookup: {
      from: "orders",
      let: { custId: "$_id" },
      pipeline: [
        // Match orders für diesen Customer
        { $match: {
            $expr: { $eq: ["$customerId", "$$custId"] },
            status: "completed"
        }},
        // Sort by amount
        { $sort: { amount: -1 } },
        // Top 3
        { $limit: 3 },
        // Project nur nötige Felder
        { $project: {
            _id: 1,
            amount: 1,
            orderDate: 1
        }}
      ],
      as: "topOrders"
  }},
  
  // Calculate customer metrics
  { $project: {
      customerName: "$name",
      email: 1,
      topOrders: 1,
      topOrdersTotal: { $sum: "$topOrders.amount" },
      topOrdersAvg: { $avg: "$topOrders.amount" }
  }},
  
  // Filter: Only customers mit mindestens 3 Orders
  { $match: { 
      "topOrders.2": { $exists: true }
  }}
])

Die Sub-Pipeline kann alles – Filter, Sort, Limit, Transformations. Dies ist mächtiger als basic $lookup. Der let-Block definiert Variables (hier custId), die in der Sub-Pipeline via $$custId zugreifbar sind.

Für “Customers with no orders in last 90 days” würde man eine Sub-Pipeline nutzen, die nach Recent-Orders filtert, und dann im Parent ein $match: { recentOrders: { $size: 0 } }.

38.6 Conditional Logic: Business-Rules in Pipelines

Real-World-Aggregations haben oft Business-Logic: “Classify customers as VIP, Regular, or Inactive based on spending.” Dies erfordert Conditional-Operators.

db.customers.aggregate([
  // Join mit Orders für Spending-Calculation
  { $lookup: {
      from: "orders",
      localField: "_id",
      foreignField: "customerId",
      as: "orders"
  }},
  
  // Calculate total spent
  { $addFields: {
      totalSpent: { $sum: "$orders.amount" },
      orderCount: { $size: "$orders" },
      lastOrderDate: { $max: "$orders.orderDate" }
  }},
  
  // Classify customer tier
  { $addFields: {
      tier: {
        $switch: {
          branches: [
            { 
              case: { $gte: ["$totalSpent", 10000] },
              then: "VIP"
            },
            {
              case: { $gte: ["$totalSpent", 1000] },
              then: "Regular"
            }
          ],
          default: "Basic"
        }
      },
      activityStatus: {
        $cond: {
          if: {
            $gte: [
              "$lastOrderDate",
              new Date(Date.now() - 90 * 24 * 60 * 60 * 1000)
            ]
          },
          then: "Active",
          else: "Inactive"
        }
      }
  }},
  
  // Group by tier für Summary
  { $group: {
      _id: "$tier",
      customerCount: { $sum: 1 },
      avgSpent: { $avg: "$totalSpent" },
      activeCount: {
        $sum: {
          $cond: [{ $eq: ["$activityStatus", "Active"] }, 1, 0]
        }
      }
  }},
  
  { $sort: { avgSpent: -1 } }
])

Diese erlauben komplexe Business-Logic direkt in der Pipeline, ohne Application-Code.

38.7 Window Functions: Running Totals und Rankings

MongoDB 5.0+ hat Window-Functions via $setWindowFields. Use-Case: “Running total revenue per month” oder “Rank products by sales within category”.

db.sales.aggregate([
  // Filter auf ein Jahr
  { $match: {
      saleDate: {
        $gte: new Date("2024-01-01"),
        $lt: new Date("2025-01-01")
      }
  }},
  
  // Extract month
  { $addFields: {
      year: { $year: "$saleDate" },
      month: { $month: "$saleDate" }
  }},
  
  // Group by month
  { $group: {
      _id: { year: "$year", month: "$month" },
      monthlyRevenue: { $sum: "$amount" }
  }},
  
  // Sort chronologically
  { $sort: { "_id.year": 1, "_id.month": 1 } },
  
  // Calculate running total
  { $setWindowFields: {
      sortBy: { "_id.year": 1, "_id.month": 1 },
      output: {
        runningTotal: {
          $sum: "$monthlyRevenue",
          window: {
            documents: ["unbounded", "current"]
          }
        },
        movingAvg3Month: {
          $avg: "$monthlyRevenue",
          window: {
            documents: [-2, 0]  // Current + 2 previous
          }
        }
      }
  }},
  
  // Clean output
  { $project: {
      _id: 0,
      year: "$_id.year",
      month: "$_id.month",
      monthlyRevenue: { $round: ["$monthlyRevenue", 2] },
      runningTotal: { $round: ["$runningTotal", 2] },
      movingAvg3Month: { $round: ["$movingAvg3Month", 2] }
  }}
])

Sie erlauben Calculations über “Windows” von Dokumenten – Running-Totals, Moving-Averages, Rankings, Lead/Lag. Vor 5.0 waren solche Calculations schwierig oder unmöglich in reinem MongoDB.

38.8 Output to Collection: Materialized Views

Manchmal will man Aggregation-Results persistent speichern – etwa für Daily-Reports, die gecached werden sollen. $out oder $merge schreibt Results in Collections.

db.orders.aggregate([
  { $match: { 
      orderDate: {
        $gte: new Date("2024-01-01"),
        $lt: new Date("2024-02-01")
      }
  }},
  
  { $group: {
      _id: {
        productId: "$items.productId",
        category: "$items.category"
      },
      totalSold: { $sum: "$items.quantity" },
      revenue: { $sum: { 
        $multiply: ["$items.quantity", "$items.price"]
      }}
  }},
  
  { $sort: { revenue: -1 } },
  
  // Write results zu Collection
  { $out: "monthly_product_stats_2024_01" }
])

$out ersetzt die gesamte Target-Collection. Für Incremental-Updates nutzt man $merge:

{ $merge: {
    into: "product_stats",
    on: "_id",  // Merge-Key
    whenMatched: "replace",  // Oder "merge", "keepExisting"
    whenNotMatched: "insert"
}}

Materialized-Views für komplexe Aggregations, die oft queried aber selten updated werden. Statt die expensive Pipeline bei jedem Request zu laufen, läuft sie einmal täglich und schreibt Results in eine Collection, die dann schnell query-bar ist.

38.9 Real-World-Pattern: Customer-360-View

Ein komplexes Production-Pattern: “360-Degree-Customer-View” – alle Informationen über einen Customer in einem Report.

db.customers.aggregate([
  { $match: { _id: ObjectId("...") } },  // Specific customer
  
  // Order history
  { $lookup: {
      from: "orders",
      let: { custId: "$_id" },
      pipeline: [
        { $match: { $expr: { $eq: ["$customerId", "$$custId"] } }},
        { $sort: { orderDate: -1 } },
        { $limit: 10 }
      ],
      as: "recentOrders"
  }},
  
  // Support tickets
  { $lookup: {
      from: "support_tickets",
      let: { custId: "$_id" },
      pipeline: [
        { $match: { $expr: { $eq: ["$customerId", "$$custId"] } }},
        { $group: {
            _id: "$status",
            count: { $sum: 1 }
        }}
      ],
      as: "supportStats"
  }},
  
  // Reviews
  { $lookup: {
      from: "reviews",
      localField: "_id",
      foreignField: "customerId",
      as: "reviews"
  }},
  
  // Calculate derived metrics
  { $addFields: {
      lifetimeValue: { $sum: "$recentOrders.amount" },
      avgOrderValue: { $avg: "$recentOrders.amount" },
      reviewCount: { $size: "$reviews" },
      avgRating: { $avg: "$reviews.rating" },
      supportTicketCount: { $sum: "$supportStats.count" }
  }},
  
  // Clean output
  { $project: {
      customerName: "$name",
      email: 1,
      joinDate: "$createdAt",
      lifetimeValue: { $round: ["$lifetimeValue", 2] },
      avgOrderValue: { $round: ["$avgOrderValue", 2] },
      recentOrders: {
        $map: {
          input: "$recentOrders",
          as: "order",
          in: {
            orderId: "$$order._id",
            date: "$$order.orderDate",
            amount: "$$order.amount"
          }
        }
      },
      reviewCount: 1,
      avgRating: { $round: ["$avgRating", 1] },
      supportTicketCount: 1,
      supportBreakdown: "$supportStats"
  }}
])

Diese Pipeline aggregiert Daten aus vier Collections – Customers, Orders, Tickets, Reviews – in einen comprehensive View. Dies ist typisch für CRM-Dashboards oder Customer-Service-Tools.

Pattern	Stages	Use-Case	Performance-Tip
Filter-Sort-Limit	$match → $sort → $limit	Top-N-Queries	Index auf Match+Sort-Felder
Group-Aggregate	$group → $sort	Statistics, Reports	$match vor $group
Time-Series	$project (date extract) → $group	Temporal Analytics	Index auf Date-Field
Basic Join	$lookup → $unwind	Enrich mit Related-Data	Minimize Lookups
Filtered Join	$lookup (pipeline) → …	Conditional Joins	Filter in Sub-Pipeline
Conditional Logic	addFields(cond/$switch)	Business-Rules	-
Window Functions	$setWindowFields	Running-Totals, Rankings	5.0+ only
Materialized View	… → out/merge	Cache Complex-Aggregations	Schedule periodic refresh

Die Aggregation-Pipeline in der Praxis ist eine Kunst – Trade-offs zwischen Complexity, Performance und Maintainability. Eine 20-Stage-Pipeline mag technisch correct sein, aber sie ist schwer zu debuggen und zu optimieren. Die Best Practice: Break complex Aggregations in Steps – Test jede Stage einzeln mit .limit(5), prüfe Intermediate-Results, und optimize Stage-by-Stage. Start mit klarem Business-Requirement, design die Pipeline logisch, optimize mit Indexes und Explain, und dokumentiere komplexe Logic für künftige Maintainer. Mit diesem Approach werden Aggregations von verwirrend zu powerful – capable of Analytics, die sonst External-Systems erfordern würden.

38 Aggregation in der Praxis: Von simpel zu komplex