Big-Data Engineering

This is where to start from for data engineers.

TECHNICALDATA ENGINEERING

5/8/20245 min read

white printing paper with numbers
white printing paper with numbers

This is one of the many articles to come on Big-Data Engineering and all the buzz around it. In this article we will try to understand the very basics on this field by going through what does Big data mean to start with. What is Data engineering, who is a data engineer, what are the responsibilities of a data engineer and finally the skills and technologies required.

1. What is Big-Data?

Big data can be described in terms of data management challenges that — due to increasing volume, velocity and variety of data — cannot be solved with traditional databases. While there are plenty of definitions for big data, most of them include the concept of what’s commonly known as “three V’s” of big data:

Volume: Ranges from terabytes to petabytes of data

Variety: Includes data from a wide range of sources and formats (e.g. web logs, social media interactions, e-commerce and online transactions, financial transactions, etc)

Velocity: Increasingly, businesses have stringent requirements from the time data is generated, to the time actionable insights are delivered to the users. Therefore, data needs to be collected, stored, processed, and analysed within relatively short windows — ranging from daily to real-time

This data comes from myriad sources: smartphones and social media posts; sensors, such as traffic signals and utility meters; point-of-sale terminals; consumer wearables such as fit meters; electronic health records; and on and on.

Beyond the Big Three Vs

More recently, big-data practitioners and thought leaders have proposed additional Vs:

Veracity: This refers to the quality of the collected data. If source data is not correct, analyses will be worthless. As the world moves toward automated decision-making, where computers make choices instead of humans, it becomes imperative that organisations be able to trust the quality of the data.

Variability: Data’s meaning is constantly changing. For example, language processing by computers is exceedingly difficult because words often have several meanings. Data scientists must account for this variability by creating sophisticated programs that understand context and meaning.

Visualisation: Data must be understandable to nontechnical stakeholders and decision makers. Visualisation is the creation of complex graphs that tell the data scientist’s story, transforming the data into information, information into insight, insight into knowledge, and knowledge into advantage.

Value: How can organisations make use of big data to improve decision-making? A McKinsey article about the potential impact of big data on health care in the U.S. suggested that big-data initiatives “could account for $300 billion to $450 billion in reduced health-care spending, or 12 to 17 percent of the $2.6 trillion baseline in US health-care costs.” The secrets hidden within big data can be a goldmine of opportunity and savings.

How Does Big Data Work?

With new tools that address the entire data management cycle, big data technologies make it technically and economically feasible, not only to collect and store larger datasets, but also to analyse them in order to uncover new and valuable insights. In most cases, big data processing involves a common data flow — from collection of raw data to consumption of actionable information.

  • Collect. Collecting the raw data — transactions, logs, mobile devices and more — is the first challenge many organisations face when dealing with big data. A good big data platform makes this step easier, allowing developers to ingest a wide variety of data — from structured to unstructured — at any speed — from real-time to batch.

  • Store. Any big data platform needs a secure, scalable, and durable repository to store data prior or even after processing tasks. Depending on your specific requirements, you may also need temporary stores for data in-transit.

  • Process & Analyse. This is the step where data is transformed from its raw state into a consumable format — usually by means of sorting, aggregating, joining and even performing more advanced functions and algorithms. The resulting data sets are then stored for further processing or made available for consumption via business intelligence and data visualisations tools.

  • Consume & Visualize. Big data is all about getting high value, actionable insights from your data assets. Ideally, data is made available to stakeholders through self-service business intelligence and agile data visualisation tools that allow for fast and easy exploration of datasets. Depending on the type of analytics, end-users may also consume the resulting data in the form of statistical “predictions” — in the case of predictive analytics — or recommended actions — in the case of prescriptive analytics.

The Evolution of Big Data Processing

The big data ecosystem continues to evolve at an impressive pace. Today, a diverse set of analytic styles support multiple functions within the organisation.

  • Descriptive analytics help users answer the question: “What happened and why?” Examples include traditional query and reporting environments with scorecards and dashboards.

  • Predictive analytics help users estimate the probability of a given event in the feature. Examples include early alert systems, fraud detection, preventive maintenance applications, and forecasting.

  • Prescriptive analytics provide specific (prescriptive) recommendations to the user. They address the question — What should I do if “x” happens?

2. What is Data engineering?

Data engineering is crucial to any tech-driven organisation. While the specifics of the role vary from job to job, its primary function is to develop, test, and maintain big data architectures, data pipelines, warehouses, and other processing systems.

3. Who is Big-Data Engineer?

The person that is in charge of the design and development of data pipelines is known as a Big Data Engineer. They are the brains behind the data collection from various sources, and these are sets of organized data for analysts and data scientists.

A data engineer’s ultimate goal is to retrieve, store, and distribute data throughout an organisation.

4. What are the responsibilities?
  • Design, create, and manage scalable ETL (extract, transform, load) systems and pipelines for various data sources

  • Manage, improve, and maintain existing data warehouse and data lake solutions

  • Optimize and improve existing data quality and data governance processes to improve performance and stability

  • Build bespoke tools and algorithms for the data science and data analytics teams (and other data-driven teams across the business)

  • Work closely with business intelligence teams and software developers to define strategic objectives as data models

  • Work closely with the wider IT team to manage the business’s wider infrastructure

  • Explore the next generation of data-related tech to expand the organisation’s capacity and maintain a competitive edge.

5. What are the skills and technologies?
  • Critical thinking, excellent communication, team working, and problem-solving

  • Data-warehouse, Data-lake, Data-mart and Data Modelling techniques.

  • Hands-on experience using Python and SQL, and big data technologies like the Apache Stack

  • Experience using relational database management systems, e.g. PostgreSQL, MySQL

  • Understanding of batch and real-time data integration, data replication, data streaming, virtualisation, and so on.

  • Data Storage and management such as Hadoop HDFS, AWS S3, AWS redshift etc..

  • Computing such as Spark, MapReduce, Hive, Pestos etc.

  • Cluster Managers such YARN, Mesos.

  • Reporting and visualisation of the data.

Hope this gives a detailed understanding to get started with data engineering, I will be covering few more conceptual details in my upcoming articles so stay tuned.