Big data

The term "big data" is used to refer to the management and analysis of extremely large and/or complex datasets. These datasets are often too large, too fast-moving, or too complex for conventional databases to process efficiently.

Big data processes

Big data processes include:

  • Data ingestion: Refers to the process of data collection and importing from various sources. Ingestion can be understood as "the absorption of information".

  • Data storage: Big data platforms typically use distributed storage systems, such as Hadoop HDFS, Amazon S3, Google Cloud storage. NoSQL distributed databases such as MongoDB and Cassandra are also popular.

  • Data processing: This is where data is transformed into meaningful and actionable insights. Processing will typically involve removing errors and duplication, integration, categorization, etc. There are two main data processing models:

    • Batch processing: Suitable for high volume data. Used by tools like Apache Hadoop.

    • Real-time processing: Processes data as it flows in. Apache Flink and FineDataLink support this model.

  • Data analytics: This is part of data processing, but also includes the presentation and visualization of processed data to reveal insights, trends and patterns in the data. Machine learning tools can also be used to automate analysis.

  • Data mining: The process of discovering patterns and knowledge from large amounts of data. It involves using statistical and computational techniques to analyze data and extract useful information.

  • Data visualization: Maps, graphs, and charts help to make sense of raw numbers and text, helping to pinpoint trends, patterns, and outliers.

Big data technologies

Big data technologies include:

  • Data lakes: Store and process multiple data formats, including structured, semi-structured, and unstructured data, ingested from multiple sources (eg. on-premises, cloud, edge).

  • Data warehouses: Analyze and report structured and semi-structured data from multiple data sources. Suitable for ad hoc analysis, custom reporting, and business intelligence support activities.

  • Stream processing systems: Handle streaming data. Suitable for applications that need an immediate response, eg. for fraud detection.

  • Distributed databases and distributed file systems: Allow for storage of very large datasets across multiple [nodes].

  • NoSQL databases: Store data differently from traditional relational databases, using flexible formats like JSON instead of rigid tables. This allows storage of large, unstructured datasets with high speed and scalability.

Big data platforms

Apache Hadoop

Developed in the early 2000s, Hadoop is an open-source framework built to process vast datasets across distributed clusters of computers. Hadoop is popular among enterprises due to its fault tolerance and scalability.

Key components of the platform like the [HDFS] and MapReduce allow businesses to store, process, and analyze structured and unstructured data on a large scale. Hadoop can also perform data cluster analysis through integrations with tools like Apache Mahout, which provides scalable machine learning algorithms for clustering and classification.

Apache Spark / Amazon Redshift

Apache Spark was originally developed at UC Berkeley in 2009. It’s a speedy open-source analytics platform designed for large scale data processing. As one of the most popular data platforms, it excels in batch and real-time data processing. It supports in-memory data processing, which boosts the speed of handling tasks compared to traditional disk-based systems.

Spark supports numerous programming languages including Java and Python. It integrates with the Hadoop ecosystem and offers Spark SQL for querying data.

Google BigQuery

Developed by Google, BigQuery is a fully managed and serverless data warehouse designed for large-scale data processing.

Microsoft Azure HDInsight

Microsoft Azure HDInsight is a fully managed cloud service for processing and analyzing large datasets. The platform supports many other open-source frameworks like Apache Hadoop and Apache Spark. It is known for offering a scalable, reliable, and flexible infrastructure that allows users to deploy and manage clusters seamlessly.

Databricks

Databricks provides a fully managed, scalable infrastructure with real-time data processing and complex analytics. Built on Apache Spark, Databricks aims to simplify the development and deployment of big data applications.