Laboratorio Cini Data Science

Slogan

Data Architectures represent those models, policies, rules, and standards that govern which data is collected and how it is stored, arranged, integrated, and used in data systems and organizations. They represent a foundational pillar in any enterprise architecture, since they are responsible for organizing data, setting access criteria and coordinating the various sources, all with the aim of making the data usable and functional to achieve business objectives.

Description

The Data Architecture describes the way data will be processed, stored and used by the organization that will use it. It lays out the criteria on processing operations including the whole flow of the system.

Main techniques

With the abundance of data available today, organizations have diverse options for managing and analyzing it. Current research focuses on leveraging four significant Data Architectures, namely: Data Warehouse, Data Lake, Data Lakehouse, and Data Mesh. Each approach has unique characteristics, use cases, and benefits.

  • A Data Warehouse is a central repository of integrated data from one or more disparate sources. They store in one single place current and historical data that are used for creating analytical reports for workers throughout the enterprise. This is beneficial for companies as it enables them to interrogate and draw insights from their data and make decisions. The data stored in the warehouse is uploaded from the operational systems (such as marketing or sales), passing through an operational data store and may require data cleansing for additional operations to ensure data quality before it is used in the data warehouse for reporting. Extract, transform, load (ETL) and extract, load, transform (ELT) are the two main approaches used to build a data warehouse system.
  • Data Lake is a central repository for storing vast amounts of raw, semi-structured, and unstructured data at scale. Unlike traditional databases, data lakes are designed to handle data in its native format without the need for prior structuring. Data lakes use schema-on-read to transform and structure data for analysis. Common processing frameworks, like Apache Spark, are used for data processing and analysis. Data lakes simplify data exploration by enabling users to extract insights from raw data before structuring it. However, data lakes can be challenging to manage due to their high volume and diversity of data. Proper planning is necessary to avoid disorganization and poor performance when querying unstructured data.
  • A Data Lakehouse is hybrid data architecture that aims to combine the benefits of both data lakes and data warehouses. The Data Lakehouse can store both structured and semi-structured data, and it uses ETL and ELT processes to transform and load data for analytical querying. Data Lakehouses support advanced querying with SQL, making them compatible with a range of analytics tools and frameworks.
  • Data Mesh is a modern data architecture and organizational approach that aims to address the challenges of scaling and democratizing data within large, complex organizations. It represents a shift away from a centralized data approach to a more decentralized, domain-oriented model. Data Mesh promotes decentralized data ownership and management across domains. It encourages cross-functional teams to treat data as a product and take responsibility for its quality and governance, creating a data fabric that facilitates data discovery, access, and sharing. With Data Mesh, the responsibility for analytical data is shifted from the central data team to the domain teams, supported by a data platform team that provides a domain-agnostic data platform.

Gaps and challenges:

  • Dealing with the complexity and variety of data sources, formats, and types, which are the main features of Big Data collections. Data can come from various internal and external systems, such as databases, applications, APIs, sensors, social media, and so on. Data can also be structured, semi-structured, or unstructured, and have different schemas, standards, and quality levels. To manage such complexity and variety, one needs to have a clear understanding of data landscape, its sources, characteristics, and dependencies. A robust data integration and transformation process is also needed, that can handle different data formats and types, and ensure data consistency and accuracy.
  • Ensuring that data architecture can scale and perform as data volume, velocity, and variety grow. The Data Architecture needs to be flexible and adaptable to changing business needs and data demands. It also needs to be efficient and optimized to deliver data in a timely and reliable manner and to be capable of handling data streams, i.e., the continuous transfer of data from one or more sources at a steady, high speed for processing into specific outputs in a real-time fashion.
  • ensuring that the Data Architecture complies with the relevant regulations, policies, and standards for data governance and security.

Objectives:

  • definition of technologies, methodologies, and processes that govern data management activities
  • definition of specific reference architectures

Practical Impact

The adoption of a Data Architecture is crucial for any company or organization that has the need to process and analyze data in an effective and efficient manner. When it comes to deciding which Data Architecture is right, any organization should consider many factors. Each architecture type can be especially useful in specific scenarios, so understanding what those are can help make that decision. Currently, lots of organizations use a combination of different data processing and analytics architectures to suit their specific requirements. Which approach one chooses will depend on the organization’s data types, intended uses, and goals. Sometimes, a mix of these architectures can offer the most complete solution for data needs.

Sub-areas:

  • Data integration: different formats/sources
  • Efficient data handling: timely responses, streaming
  • Data governance: supporting harmonized data activities across the organization
  • Data security
  • Distributed data processing: Distributed data architectures can reduce the time to access data, offer redundancy, and increase flexibility.

Share This

S5 Box

Cini Single Sign ON

Questo sito memorizza solo cookie tecnico/funzionali. Se vuoi saperne di più vai alla sezione Cookie Policy