Aggregate Data Models
Data Model
• A data model is a representation that we use
  to perceive and manipulate our data.
• It allows us to:
  – Represent the data elements under analysis, and
  – How these are related to each others
• This representation depends on our
  perception.
     Data Model: Database View
• In the database field, it describes how we
  interact with the data in the database.
• This is distinct from the storage model:
  – It describes how the database stores and
    manipulate the data internally.
• In an ideal worlds:
  – We should be ignorant of the storage model, but
  – In practice we need at least some insight to
    achieve a decent performance
         Data Models: Example
• A Data model is the model of the specific data
  in an application
• A developer might point to an entity-
  relationship diagram and refer it as the data
  model containing
  – customers,
  – orders and
  – products
       Data Model: Definition
• In this course we will refer “data
  model” as the model by which the
  database organize data.
• It can be more formally defined as
  meta-model
      Last Decades Data Model
• The dominant data model of the last decades
  what the relational data model.
1. It can be represented as a set of tables.
2. Each table has rows, with each row
   representing some entity of interest.
3. We describe entities through columns
4. A column may refer to another row in the
   same or different table (relationship).
            NoSQL Data Model
• It moves away from the relational data model
• Each NoSQL database has a different model
  – Key-value,
  – Document,
  – Column-family,
  – Graph, and
  – Sparse (Index based)
• Of these, the first three share a common
  characteristic (Aggregate Orientation).
Relational Model
       vs
Aggregate Model
             Relational Model
• The relational model takes the information that
  we want to store and divides it into tuples (rows).
• However, a tuple is a limited data structure.
• It captures a set of values.
• So, we can’t nest one tuple within another to get
  nested records.
• Nor we can put a list of values or tuple within
  another.
             Relational Model
• This simplicity characterize the relational
  model
• It allows us to think on data manipulation as
  operation that have:
  – As input tuples, and
  – Return tuples
• Aggregate orientation takes a different
  approach.
               Aggregate Model
• It recognizes that, you want to operate on data unit
  having a more complex structure than a set of
  tuples.
• We can think on term of complex record that allows:
  – List,
  – Map,
  – And other data structures to be nested inside it
• Key-Value, document, and column-family databases
  uses this complex structure.
             Aggregate Model
• Aggregate is a term coming from Domain-
  Driven Design [Evans03]
  – An aggregate is a collection of related objects that
    we wish to treat as a unit. It is a unit for data
    manipulation and management for consistency.
• We like to update aggregates with atomic
  operation
• We like to communicate with our data storage
  in terms of aggregates
            Aggregate Models
• This definition matches really with how key-value,
  document, and column-family databases works.
• With aggregates it is easier to work on a cluster,
  since they are unit for replication and sharding.
• Aggregates are also easier for application
  programmer to work since it solve the impedance
  mismatch problem of relational databases.
     Example of Relational Model
• Assume we are
  building an e-
  commerce website;
• We have to store
  information about:
  users, products,
  orders, shipping
  addresses, billing
  addresses, and
  payment data.
      Example of Relational Model
• As we are good
  relational soldier:
  – Everything is
    normalized
  – No data is
    repeated in
    multiple tables.
  – We have referential
    integrity
Example of Relational Model
      Example of Aggregate Model
• We have two aggregates: Customers and Orders
• We use the black diamond composition to show
  how data fits into the aggregate structure
                     A possible aggregation
          Example of Aggregate Model
• The customer contains a list of billing addresses;
• The order contains a list of: order items, a shipping address, and
  payments
• The payment itself contains a billing address for that payment
            Example of Aggregate Model
• A single address appears 3 times, but instead of using an id it is copied each time
• This fits a domain where we don’t want shipping, payment and billing address to
  change
• What is the difference w.r.t a relational representation?
       Example of Aggregate Model
• The link between customer and the order is a
  relationship between aggregates
       Example of Aggregate Model
• Link from an order item would cross into a separate
  aggregate structure for product (not considered
  here)
• This is kind of denormalization – similar to tradeoff
  with relational database, but is more common with
  aggregate because we want to minimize the
  number of aggregates we access.
       Example of Aggregate Model
• We aggregate to minimize the number of
  aggregates we access during data interaction
• •The important think to notice is that,
  – We have to think about accessing that data
  – We make this part of our thinking when developing the
    application data model
• We could draw our aggregate differently, but it
  really depends on the “data accessing models”.
• No universal answer for how to draw aggregate boundaries
• It depends entirely on how you tend to manipulate data!
  – Accesses on a single order at a time: first solution
  – Accesses on customers with all orders: second solution
• Context-specific
  – some applications will prefer one or the other
  – even within a single system
• Focus on the unit of interaction with the data storage
• Pros:
  – it helps greatly with running on a cluster: data will be manipulated
    together, and thus should live on the same node!
• Cons:
  – an aggregate structure may help with some data interactions but be
    an obstacle for others.
Consider a Student information system consisting of 3 entities namely,
Student_info, Course_info, and Marksheet.
Following are the frequent queries in the workload:
1. List the details of students admitted to ‘F.Y.B.Sc’ course.
2. List the details of students staying in ‘Kothrud’ area and studying in
   ‘T.Y.B.Sc’
3. Find the maximum score value for ‘Databases’ subject
4. List the number of students failing in the subject ‘Computer networks’
   (marks < 40)
Given the above workload, derive an aggregate boundary, for aggregating the
three entities. Justify your answer.
Consequences of Aggregate Models
       No Distributable Storage
• Relational mapping can captures data elements
  and their relationship well.
• It does not need any notion of aggregate entity,
  because it uses foreign key relationship.
• But we cannot distinguish for a relationship that
  represent aggregations from those that don’t.
• As result we cannot take advantage of that
  knowledge to store and distribute our data.
       Marking Aggregate Tools
• Many data modeling techniques provides way to
  mark aggregate structures in relational models
• However, they do not provide semantic that
  helps in distinguish relationships
• When working with aggregate-oriented
  databases, we have a clear view of the semantic
  of the data.
• We can focus on the unit of interaction with the
  data storage.
          Aggregate Ignorant
• Relational database are aggregate-ignorant,
  since they don’t have concept of aggregate
• Also graph database are aggregate-ignorant.
• This is not always bad.
• In domains where it is difficult to draw
  aggregate boundaries aggregate-ignorant
  databases are useful.
      Aggregate and Operations
• An order is a good aggregate when:
  – A customer is making and reviewing an order, and
  – When the retailer is processing orders
• However, when the retailer want to analyze its
  product sales over the last months, then
  aggregate are trouble.
• We need to analyze each aggregate to extract
  sales history.
       Aggregate and Operations
• Aggregate may help in some operation and not in
• others.
• In cases where there is not a clear view aggregate-
  ignorant database are the best option.
• But, remember the point that drove us to
  aggregate models (cluster distribution).
• Running databases on a cluster is need when
  dealing with huge quantities of data.
          Running on a Cluster
• It gives several advantages on computation
  power and data distribution
• However, it requires to minimize the number of
  nodes to query when gathering data
• By explicitly including aggregates, we give the
  database an important view of which
  information should be stored together
• But, still we have the problem on querying
  historical data
Aggregates and Transactions
             ACID transactions
• Relational database allow us to manipulate any
  combination of rows from any table in a single
  transaction.
• ACID transactions:
  – Atomic,
  – Consistent,
  – Isolated, and
  – Durable
  have the main point in Atomicity.
          Atomicity & RDBMS
• Many rows spanning many tables are updated
  into an Atomic operation
• It may succeeded or failed entirely
• Concurrently operations are isolated and we
  cannot see partial updates
• However relational database still fail.
           Atomicity & NoSQL
• NoSQL don’t support Atomicity that spans
  multiple aggregates.
• This means that if we need to update multiple
  aggregates we have to manage that in the
  application code.
• Thus the Atomicity is one of the consideration
  for deciding how to divide up our data into
  aggregates
Aggregates Models on NoSQL
         Key-Value and Document
• Key-value and Document databases are strongly
  aggregate-oriented.
• Both of these types of databases consists of lot of
  aggregates with a key used to get the data.
• The two type of databases differ in that:
  – In a key-value stores the aggregate is opaque (Blob)
  – In a document database we can see a structure in the
    aggregate.
      Key-Value and Document
• The advantage of opacity is that we can store
  whatever we like in the aggregate.
• The database may impose some size limit, but
  we have freedom
• A document store imposes limits on what we
  can place in it, defining a structure on the
  data.
       Key-Value and Document
• With a key-value we can only access by its key
• With document:
  – We can submit queries based on fields,
  – We can retrieve part of the aggregate, and
  – The database can create index based on the fields
    of the aggregate.
• But in practice they are used differently
       Key-Value and Document
• In practice, the line between key-value and
  document gets a bit blurry.
• An ID field is put in a document database to do a
  key-value style lookup
• With key-value databases we expect aggregates
  using a key
• With document databases, we mostly expect to
  submit some form of query on the internal
  structure of the documents.
         Column-Family Stores
• One of the most influential NoSQL databases
  was Google’s BigTable [Chang et al.]
• Its name derives from its structure composed
  by sparse columns and no schema.
• We don’t have to think of this structure as a
  table, but to a two-level map.
         Column-Family Stores
• These BigTable-style data model are referred
  to as column stores.
• Pre-NoSQL column stores like C-Store used
  SQL and the relational model.
• What make NoSQL columns store different is
  how physically they store data.
• Most databases has rows as unit of storage,
  which helps in writing performances
          Column-Family Stores
• However, there are many scenarios where:
   – Write are rares, but
   – You need to read a few columns of many rows at
     once
• In this situations, it’s better to store groups of
  columns for all rows as the basic storage unit.
• These kind of databases are called column
  stores or column-family databases
          Column-Family Stores
• Column-family databases have a two-level aggregate
  structure.
• Similarly to key-value the first key is the row
  identifier.
• The difference is that retrieving a key return a Map
  of more detailed values.
• These second-level values are defined to as columns.
• Fixing a row we can access to all the column-families
  or to a particular element.
Example of Column Model
         Column-Family Stores
• They organize their columns into families.
• Each column is a part of a family, and column
  family acts as unit of access.
• Then the data for a particular column family
  are accessed together.
              Column-Family Stores:
              How to structure data
• In row-oriented:
  – each row is an aggregate (For example the customer
    with id 456),
  – with column families representing useful chunks of
    data (profile, order history) within that aggregate
• In column-oriented:
  – each column family defines a record type (e.g.
    customer profiles) with rows for each of the records.
  – You can think of a row as the join of records in all
    columnfamilies
                        Key Points
• An aggregate is a collection of data that we interact with as
  a unit.
• Aggregates form the boundaries for ACID operations with
  the database
• Key-value, document, and column-family databases can all
  be seen as forms of aggregate-oriented database
• Aggregates make it easier for the database to manage data
  storage over clusters
• Aggregate-oriented databases work best when most data
  interaction is done with the same aggregate
• Aggregate-ignorant databases are better when interactions
  use data organized in many different formations