Big Data: SQL, Hive and ElasticSearch

type

status

date

slug

1. What is OLTP and OLAP:

OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) are two of the most important and widely used branches of SQL applications, here are the details:

OLTP (Online Transaction Processing)

What is OLTP: A system for managing real-time, transaction-oriented applications like retail POS and banking.

Designed for: Handling frequent, short, atomic transactions.

Focus: Real-time data processing for day-to-day operations.

Normalized Data: Data is highly structured to avoid redundancy (follows 3NF or higher). It breaks data into multiple tables linked by relationships.

CRUD Operations: Focuses on Create, Read, Update, and Delete operations.

Typical SQL Queries: Simple, involving a few records.

Indexed SQL Operations: Indexes speed up data retrieval by organizing records for faster searching (e.g., B-tree, hash indexes).

Common Databases: MySQL, PostgreSQL, SQL Server, Oracle.

OLAP (Online Analytical Processing)

What is OLAP: A system for analyzing large datasets, supporting complex queries for decision-making.

Designed for: Data analysis and generating insights from historical data.

Focus: Aggregating, slicing, and dicing data for business intelligence.

Denormalized Data: Data is stored in fewer, wider tables (e.g., star schema) for easy access to large datasets.

Complex Queries: Focuses on grouping, aggregating, and multi-table joins.

Indexed SQL Operations: Uses indexes like bitmap indexes and materialized views to improve query performance.

Common Databases: Amazon Redshift, Google BigQuery, Snowflake, Azure Synapse.

2. Power BI

Power BI is a business analytics service by Microsoft that allows users to create interactive visualizations and access business intelligence (BI) features. It enables users to generate dashboards, reports(friendly to non-technical stakeholders), and set up automated reports to track key performance indicators (KPIs). Power BI integrates with Microsoft products like Excel, Azure, and SQL Server.

Procedures for Using Power BI:

Data Connection and Transformation:

Open Power BI and use the Get Data feature to connect to your data sources (e.g., Excel, SQL Server, or online services).

Use Power Query Editor to clean and transform your data. This could include tasks such as removing duplicates, filtering data, or combining data from different sources.

Once the data is ready, click Close & Apply to load it into Power BI for visualization.

Data Modeling:

In the Model view, establish relationships between tables by linking columns (usually primary and foreign keys).

Use DAX (Data Analysis Expressions) to create new calculated columns, measures (e.g., total sales), and complex calculations like year-over-year growth.

Ensure the model is structured efficiently to facilitate smooth querying and report creation.

Creating Data Visualizations:

Navigate to the Report view and start adding visualizations by dragging fields into charts, tables, or other graphical elements (e.g., bar charts, line graphs).

Customize the visualizations by adjusting colors, labels, and formatting to ensure clarity and readability.

Add slicers and filters to enable interactive exploration of data.

Building Dashboards and Reports:

After creating the visualizations, arrange them into an interactive dashboard or report.

Use drill-through features to allow users to explore more detailed data when needed.

Organize the layout so that stakeholders can easily interpret the data insights at a glance.

Automating Report Refreshes and Monitoring:

Set up automatic data refresh for your reports by configuring scheduled refresh times through the Power BI Service.

Create alerts or notifications that notify users when KPIs reach a specified threshold.

Make use of Power BI’s Q&A feature, allowing users to type natural language queries to explore the data.

Sharing Reports and Collaboration:

Publish your reports to the Power BI Service (cloud platform) for sharing and collaboration.

Create workspaces where teams can access, view, and collaborate on reports.

Assign different roles (e.g., viewer, editor) to control access to the reports and dashboards.

Use the Power BI app to package and share collections of dashboards and reports across the organization.

3. Schemas and Design Tools:

Logical Models and Schemas:

Star Schema: A central fact table surrounded by dimension tables. This is simple and efficient for querying but may introduce redundancy.

Snowflake Schema: Normalizes dimension tables to remove redundancy, optimizing storage but potentially reducing performance.

Fact Constellation Schema: Contains multiple fact tables sharing dimension tables, allowing for more complex queries.

Design Tools:

ERDPlus (https://erdplus.com/):

An online tool used to create Entity Relationship Diagrams (ERDs), relational schemas, dimensional models (such as star and snowflake schemas), and generate SQL Data Definition Language (DDL) statements.
ERDPlus enables drawing entities, attributes, relationships, fact tables, and dimension tables, while also defining primary and foreign keys with referential integrity constraints.

Visual Paradigm (https://www.visual-paradigm.com/): Another tool for creating ERDs and relational schemas. The online version has limited features, but a free trial is available for the full version.

Microsoft Visio: A popular tool that provides features to design flowcharts and ER diagrams. It is a paid tool but offers advanced functionality for database diagramming.

4. SQL Operations

Examples of SQL Operations:

Joins combine data from multiple tables.

Nested Queries are queries within other queries (correlated or non-correlated).

Aggregation functions (SUM, AVG, etc.) summarize data, often combined with GROUP BY.

HAVING filters results after aggregation.

CUBE and ROLLUP are advanced aggregation techniques for multidimensional analysis.

Slicing and Dicing involve filtering and reorganizing data from different perspectives in OLAP analysis.

Views allow storing reusable queries as virtual tables or materialized tables for performance optimization.

Here’s an explanation of each SQL operation concept, including joins, nested queries, aggregation, rollup, cube, slicing, dicing, having, and views:

1. Joins

Joins are used in SQL to combine data from two or more tables based on a related column between them.

Inner Join: Returns rows that have matching values in both tables. If there is no match, the row is excluded from the result.

Example:

Left Outer Join: Returns all rows from the left table and the matching rows from the right table. If there is no match, NULL values are returned for columns from the right table.

Example:

Right Outer Join: Returns all rows from the right table and the matching rows from the left table. If there is no match, NULL values are returned for columns from the left table.

Example:

Full Outer Join: Returns all rows when there is a match in either table. Non-matching rows are filled with NULLs from the other table.

Example:

Cross Join: Returns the Cartesian product of two tables, combining every row from the first table with every row from the second table.

Example:

2. Nested Queries (Subqueries)

Nested queries are queries embedded inside another query. They can be correlated or non-correlated.

Non-Correlated Subqueries: The subquery is independent of the outer query and can run separately. It is like a function providing a result used by the outer query.

Example:

Correlated Subqueries: The subquery depends on the outer query for its values, and it runs once for each row processed by the outer query.

Example:

3. Aggregation

Aggregation functions are used to perform calculations on multiple rows of a table and return a single result.

Common Aggregation Functions:

SUM(): Adds up all values in a column.
COUNT(): Counts the number of rows.
AVG(): Calculates the average value.
MAX(): Returns the highest value.
MIN(): Returns the lowest value.

GROUP BY: Groups rows that have the same values into summary rows, often used with aggregation functions.

Example:

4. HAVING Clause

HAVING is used to filter data after GROUP BY and aggregation. Unlike WHERE, which filters rows before grouping, HAVING filters the aggregated results.

Example:

5. CUBE Operator

The CUBE operator in SQL is used for generating subtotals and grand totals across all combinations of a set of dimensions. It performs aggregation for all possible combinations of columns.

Example:This will return totals for:

department_id + job_id
department_id only
job_id only
grand total

6. ROLLUP Operator

The ROLLUP operator is a special extension of GROUP BY that creates subtotals in a hierarchical fashion. It computes aggregates for a sequence of dimensions, moving from most detailed to least detailed.

Example:This query will return totals for:

department_id + job_id
department_id only
grand total

7. Slicing and Dicing

Slicing and Dicing are OLAP techniques used to analyze data from different perspectives.

Slicing: Cutting a data cube across a single dimension, creating a "slice" of the data for a specific value.

Example: Retrieving sales data for a specific year.

Dicing: Cutting the data cube along multiple dimensions, retrieving a subset of data.

Example: Retrieving sales data for a specific year and region.

8. Views

A view is a virtual table based on the result of an SQL query. It acts like a table but does not physically store the data.

Virtual Views: These are computed every time they are queried and are always up-to-date but can be slower since the query is run each time.

Example:

Materialized Views: These are precomputed and stored physically on disk, providing faster query performance but might contain outdated data if not refreshed.

Example:

5. HIVE

Hive acts as a bridge between SQL-based querying and the underlying distributed computing power of Hadoop. Hive is a data warehousing solution built on top of Hadoop, allowing users to run SQL-like queries (HiveQL) on large datasets stored in Hadoop’s HDFS (Hadoop Distributed File System). Here's how it works based on the PDF:

HiveQL (Hive Query Language):

Hive uses HiveQL, a SQL-like language, to perform queries on datasets. It simplifies querying by offering familiar SQL syntax to users, abstracting the underlying complexity of Hadoop’s MapReduce processes.

Basic Queries: Users can query tables using simple commands such as SELECT, WHERE, GROUP BY, etc., similar to standard SQL. Hive also supports advanced operations like joins and aggregate functions (e.g., COUNT, AVG).

How Hive Runs on Hadoop:

When a user submits a query via Hue (a web interface for Hadoop services) or a command-line interface, Hive converts that query into MapReduce jobs which run on the Hadoop cluster. These jobs break down large datasets, distribute the workload across multiple nodes, and then combine the results.

Query Execution: Queries written in HiveQL are parsed and transformed into one or more MapReduce jobs. For example, a SELECT query is broken into a Map phase (retrieving relevant data) and a Reduce phase (aggregating or filtering results).

Data Loading and Storage:

Hive interacts with Hadoop’s HDFS for storage. Data is stored in tables, with columns representing structured data (e.g., STRING, INT, DATE).

Loading Data: Data can be loaded from various sources (e.g., CSV files) stored in HDFS. In the example from the PDF, an employee dataset (employee.csv) is uploaded to Amazon S3 and then loaded into Hive using the command:

Table Creation and Querying:

Table Creation: Before querying, tables need to be created with the correct schema (e.g., data types for each column). In the example provided, the employee_data table is created to store employee details with various columns like Employee_Name, Salary, Position, etc.

Querying Data: Once the data is loaded, users can run queries like filtering (WHERE clauses), sorting (ORDER BY), and aggregating (GROUP BY).

Advanced Features:

Joins: Hive supports joining multiple tables. For example, you can join employee data with geographic data.

Functions and Expressions: Hive provides built-in functions (e.g., AVG(), SUM(), UPPER()) to perform data transformations and calculations.

Results and Visualization:

After a query is executed, Hive returns the results, which can be visualized in tools like Hue (web interface for Hadoop). Hue provides additional functionalities like data visualization (scatter plots, pie charts, etc.) to analyze results interactively.

6. How Elasticsearch and Kibana Work:

Elasticsearch Overview:

Elasticsearch is a distributed, RESTful search and analytics engine designed for scalability, speed, and real-time capabilities. It is used for full-text search, structured search, and complex data analysis across large datasets.

Kibana Overview:

Kibana is a visualization tool that works on top of Elasticsearch. It provides a user-friendly interface for querying, visualizing, and analyzing data stored in Elasticsearch.

How to Use Elasticsearch and Kibana Together:

Install Elasticsearch: Run Elasticsearch on your machine, ensuring it's available on localhost:9200.

Install Kibana: Launch Kibana on localhost:5601. Kibana provides an interface to visualize Elasticsearch data.

Add Sample Data: Use Kibana's interface to add sample datasets (e.g., eCommerce data).

Explore Data with Kibana:

Use Kibana’s Dev Tools to interact with the data using Elasticsearch queries.

Create visualizations such as bar charts, pie charts, and maps to explore and analyze the data indexed in Elasticsearch.

Build dashboards that aggregate different visualizations for comprehensive insights.

Key Concepts:

1. Domain Specific Language (DSL)

DSL in Elasticsearch allows users to write queries in JSON format. It offers powerful querying capabilities, supporting both structured queries and full-text searches.

Example: A basic query to retrieve all documents from an index:

2. Creating an Index and Documents

Index Creation: You can create an index where documents will be stored.

Adding a Document: Documents are added to an index. A document is essentially a JSON object that represents the data.

3. Searching Documents

Querying: You can query documents stored in Elasticsearch using DSL.

4. Bulk Ingestion

Elasticsearch supports bulk operations, which allows multiple documents to be indexed, updated, or deleted in a single API call.

5. Mapping

Dynamic Mapping: Elasticsearch automatically detects and adds new fields when a document is indexed without pre-defined schema.

Explicit Mapping: You can manually define the structure of an index by specifying field types, such as text, integer, float.

6. Analyzer API

Analyzers are used to break down text into tokens for searching. Elasticsearch provides different analyzers like standard and keyword.

Example: Using the Analyzer API to analyze a text with the standard analyzer:

7. Coercion

Coercion is Elasticsearch’s attempt to automatically clean and convert data types. For example, if a number is stored as a string, Elasticsearch may attempt to convert it to a number.

7. Elasticsearch on AWS

How AWS EMR and S3 work together to process large datasets using Hadoop in Elasticsearch.

1. Define EMR Cluster: Set up an EMR cluster with Hadoop on AWS

Amazon EMR (Elastic MapReduce) is a cloud-based service used to process large datasets efficiently by distributing the workload across multiple virtual machines (EC2 instances). EMR supports big data frameworks like Hadoop, Spark, and Presto.

2. Upload Data/Code to S3: Use S3 to store both input data and the JAR files

Amazon S3 (Simple Storage Service) is a scalable cloud storage service where you can store any amount of data. In this workflow, S3 is used to store the input data (e.g., raw logs or files) and the custom code (e.g., a compiled JAR file for Hadoop jobs).

3. Run Hadoop Job: Execute a customized MapReduce job on the EMR cluster

Hadoop works by dividing a large dataset into smaller chunks, each of which is processed in parallel across multiple nodes.

Map Phase: The data is divided into key-value pairs, which are processed in parallel across the EMR cluster’s nodes.

Reduce Phase: After processing, the results from the Map phase are combined and aggregated into the final output.

Customized Code: In this step, your custom MapReduce job (e.g., wordcount.jar) is executed on the EMR cluster. This job processes the data stored in the /input folder of S3 and saves the results in the /output folder.

EMR Cluster: As the job runs, each task is distributed across the nodes in the EMR cluster, significantly reducing processing time for large datasets.

4. View Output: Access the processed results in S3 and analyze them

After the Hadoop job completes, the processed data is saved in the designated S3 output folder (e.g., /output).