RSS

PolyBase: I now Pronounce you SQL and Hadoop

27 Apr

 

Are you interested in Hadoop? With SQL Server 2016 just around the corner, a powerful new feature is being included called PolyBase.  To summarize what it is, just think about it as a marriage between SQL Server and Hadoop.  It is a way to store and query data on HDFS using SQL Server.

HDFS is a distributed file system that works differently than what we’re used to in the Windows OS side of things; the general principle is to use cheap commodity hardware that replicates data in order to account for availability and to prevent loss of data. With that in mind, it makes a great use case to store a lot of data cheaply for archiving purposes or can be used to store large quantities of data that been to be processed in large quantities as well.

For more information please visit: https://msdn.microsoft.com/en-us/library/mt143171.aspx

Now if you want to try it out for yourself, make sure you install the PolyBase Engine (from the SQL Server setup) and feel free to try the modified code sample below.


Create database PolyBaseConnect

----------------------------------------------------------------
-- AdventureWorksDW2016CTP3 samples: PolyBase
--Download here:
--https://www.microsoft.com/en-us/download/details.aspx?id=49502
--The sample code below was obtained from the Adventure works SQL2016CTP3 demos
--I have modified them slightly and added my own code
--Flat file used in example is in the SQLServer2016CTP3Samples.zip file
--In order to copy it to your Azure Blob Storage I recommend you download
--and use Azure Storage Explorer
----------------------------------------------------------------

-- This sample will show you how to query and load data from Azure blob storage
-- to AdventureWorks2016CTP3 database using PolyBase.

USE PolyBaseConnect
go

------------------------------- Configuration ------------------------------------------------------------------

-- Specify the type of data source you want to query.
-- Choose Option 7 for Azure blob storage.
exec sp_configure 'hadoop connectivity', 7;
Reconfigure;

-- Restart SQL Server to set the changes. This will automatically restart
-- SQL Server PolyBase Engine and SQL Server PolyBase Data Movement Service. 

-- Verify hadoop connectivity run_value is set to 7.
exec sp_configure 'hadoop connectivity';

------------------------------- Polybase Building Blocks --------------------------------------------------------

-- STEP 1: Create a database master key to encrypt database scoped
--credential secret in the next step.
-- Replace  with a password to encrypt the master key
--DROP MASTER KEY
CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'P@ssw0rd';

-- STEP 2: Create a database scoped credential to authenticate against your
--Azure storage account.
-- Replace the  with your Azure storage account key (primary access key).
-- To find the key, open your storage account on Azure Portal
--(https://portal.azure.com/).
--DROP DATABASE SCOPED CREDENTIAL AzureStorageCredential
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH IDENTITY = '',
SECRET = '';

select * from sys.database_credentials;

-- STEP 3: Create an external data source to specify location and credential
--for your Azure storage account.
-- Replace the  with your Azure storage blob container.
-- Replace the  with your Azure storage account name.
--DROP EXTERNAL DATA SOURCE AzurePolybaseStorage
CREATE EXTERNAL DATA SOURCE AzurePolybaseStorage
WITH (
   TYPE = HADOOP,
   LOCATION = 'wasbs://ConatainerName@StorageAccountName.blob.core.windows.net',
   CREDENTIAL = AzureStorageCredential
);
---'wasbs://[ContainerName]@[StorageAccountName].blob.core.windows.net',

select * from sys.external_data_sources;

-- Step 4: Create an external file format to specify the layout of
--data stored in Azure blob storage.
-- The data is in a pipe-delimited text file.
CREATE EXTERNAL FILE FORMAT TextFilePipe
WITH (
		FORMAT_TYPE = DelimitedText,
		FORMAT_OPTIONS (FIELD_TERMINATOR = '|')
);

CREATE EXTERNAL FILE FORMAT TextFileComma
WITH (
		FORMAT_TYPE = DelimitedText,
		FORMAT_OPTIONS (FIELD_TERMINATOR = ',')
);

select * from sys.external_file_formats;

-- Step 5: Create an external table to reference data stored
--in your Azure blob storage account.
-- Specify column properties for the table.
-- Replace LOCATION:  with the relative path of your file from the blob container.
-- If the file is directly under your blob container, the location
--would simply be 'FactResellerSalesArchive.txt'.
--DROP EXTERNAL TABLE dbo.FactResellerSalesArchiveExternal
CREATE EXTERNAL TABLE dbo.FactResellerSalesArchiveExternal (
	[ProductKey] [int] NOT NULL,
	[OrderDateKey] [int] NOT NULL,
	[DueDateKey] [int] NOT NULL,
	[ShipDateKey] [int] NOT NULL,
	[ResellerKey] [int] NOT NULL,
	[EmployeeKey] [int] NOT NULL,
	[PromotionKey] [int] NOT NULL,
	[CurrencyKey] [int] NOT NULL,
	[SalesTerritoryKey] [int] NOT NULL,
	[SalesOrderNumber] [nvarchar](20) NOT NULL,
	[SalesOrderLineNumber] [tinyint] NOT NULL,
	[RevisionNumber] [tinyint] NULL,
	[OrderQuantity] [smallint] NULL,
	[UnitPrice] [money] NULL,
	[ExtendedAmount] [money] NULL,
	[UnitPriceDiscountPct] [float] NULL,
	[DiscountAmount] [float] NULL,
	[ProductStandardCost] [money] NULL,
	[TotalProductCost] [money] NULL,
	[SalesAmount] [money] NULL,
	[TaxAmt] [money] NULL,
	[Freight] [money] NULL,
	[CarrierTrackingNumber] [nvarchar](25) NULL,
	[CustomerPONumber] [nvarchar](25) NULL,
	[OrderDate] [datetime] NULL,
	[DueDate] [datetime] NULL,
	[ShipDate] [datetime] NULL
)
WITH (
		LOCATION='FactResellerSalesArchive.txt', ---
		DATA_SOURCE=AzurePolybaseStorage, ---
		FILE_FORMAT=TextFilePipe---
);

select * from sys.tables;
select * from sys.external_tables;

-- Try running queries on your external table.
SELECT * FROM dbo.FactResellerSalesArchiveExternal; -- returns 5000 rows.

SELECT * FROM dbo.FactResellerSalesArchiveExternal -- returns 1959 rows
WHERE SalesAmount > 1000;

------------------------------- Load data into your database --------------------------------------------------------

-- Step 6: Load the data from Azure blob storage into
--a new table in your database.
SELECT * INTO dbo.FactResellerSalesArchive
FROM dbo.FactResellerSalesArchiveExternal; 

-- Try a select query on this table to confirm the data has been loaded correctly.
SELECT * FROM dbo.FactResellerSalesArchive;
---My customized code---
---Not part of the Adventure Works Demo---
---Load data into Polybase "table"
--Allow Polybase export
USE master;
GO
EXEC sp_configure 'show advanced option', '1';
GO
EXEC sp_configure 'allow polybase export', '1';
GO
RECONFIGURE;
EXEC sp_configure;

---Text file names are case Sensitive!!

USE PolyBaseConnect
GO
CREATE EXTERNAL TABLE External_TestPolyBaseData (
IdentityKey int,
FirstName Varchar(100),
LastName Varchar(100)
)

WITH (
        LOCATION='TestPolybaseData.txt',---
        DATA_SOURCE = AzurePolybaseStorage,---
        FILE_FORMAT = TextfileComma ---
    )
;

Insert into External_TestPolyBaseData (IdentityKey,FirstName,LastName)
Values (1234,'Test','Insert')

Select *
From External_TestPolyBaseData

Insert into External_TestPolyBaseData (IdentityKey,FirstName,LastName)
Values (4567,'Test','Insert')

--Insert as new Table

--CETAS Supported in APS and Azure SQL DW as of publishing this post,
--no support for SQL 2016 yet
--There is no gaurentee that this feature will be added to SQL 2016 in the future
--This code is purely for educational purposes, and this is currently
--supported in APS and SQL DW in Azure.
CREATE EXTERNAL TABLE dbo.External_FactInternetSales

WITH
(
        LOCATION='TestPolybaseData.txt',---
        DATA_SOURCE = AzurePolybaseStorage,---
        FILE_FORMAT = TextfileComma ---
    )
as
Select *
from AdventureWorksDW2014.dbo.FactInternetSales;

---The way to do it in SQL 2016
----Instead Must create Table Structure First
CREATE External TABLE [dbo].[External_FactInternetSales](
	[ProductKey] [int] NOT NULL,
	[OrderDateKey] [int] NOT NULL,
	[DueDateKey] [int] NOT NULL,
	[ShipDateKey] [int] NOT NULL,
	[CustomerKey] [int] NOT NULL,
	[PromotionKey] [int] NOT NULL,
	[CurrencyKey] [int] NOT NULL,
	[SalesTerritoryKey] [int] NOT NULL,
	[SalesOrderNumber] [nvarchar](20) NOT NULL,
	[SalesOrderLineNumber] [tinyint] NOT NULL,
	[RevisionNumber] [tinyint] NOT NULL,
	[OrderQuantity] [smallint] NOT NULL,
	[UnitPrice] [money] NOT NULL,
	[ExtendedAmount] [money] NOT NULL,
	[UnitPriceDiscountPct] [float] NOT NULL,
	[DiscountAmount] [float] NOT NULL,
	[ProductStandardCost] [money] NOT NULL,
	[TotalProductCost] [money] NOT NULL,
	[SalesAmount] [money] NOT NULL,
	[TaxAmt] [money] NOT NULL,
	[Freight] [money] NOT NULL,
	[CarrierTrackingNumber] [nvarchar](25) NULL,
	[CustomerPONumber] [nvarchar](25) NULL,
	[OrderDate] [datetime] NULL,
	[DueDate] [datetime] NULL,
	[ShipDate] [datetime] NULL
	)
WITH
(
        LOCATION='External_FactInternetSales.txt',---
        DATA_SOURCE = AzurePolybaseStorage,---
        FILE_FORMAT = TextfileComma ---
    );

Insert Into [dbo].[External_FactInternetSales]
Select *
from AdventureWorksDW2014.dbo.FactInternetSales;

--Join with query to data in WASB
Set Statistics IO ON
Select *
from dbo.External_FactInternetSales EFI
inner join AdventureWorksDW2014.dbo.DimCustomer C
on EFI.CustomerKey=C.CustomerKey

--sp_who2 shows 0 CPU 0 IO
Select SalesOrderNumber, AVG(unitPrice) as AverageUnitPrice,
sum(TotalProductCost) as SumOfTotalProductCost,
sum(SalesAmount) as SumOfSalesAmount
from dbo.External_FactInternetSales
Group by SalesOrderNumber

---ExternalPushDown Works on Hadoop not Azure Blob Storage
---THIS OPTION WILL NOT WORK IN WASB
Select SalesOrderNumber, AVG(unitPrice) as AverageUnitPrice,
sum(TotalProductCost) as SumOfTotalProductCost,
sum(SalesAmount) as SumOfSalesAmount
from dbo.External_FactInternetSales
Group by SalesOrderNumber
OPTION (FORCE EXTERNALPUSHDOWN)---

---Working with JSON Files
---Export to JSON File

Select FirstName as [Customer.FirstName],MiddleName as [Customer.MiddleName],
LastName as [Customer.LastName],BirthDate as [Customer.BirthDate],
MaritalStatus as [Customer.MaritalStatus],Gender as [Customer.Gender],
EmailAddress as [Customer.EmailAddress], YearlyIncome as [Customer.YearlyIncome],
TotalChildren as [Customer.TotalChildren],
GEO.City as [Customer.City],GEO.StateProvinceName as [Customer.State],
GEO.PostalCode as [Customer.ZipCode]
from AdventureWorksDW2014.dbo.DimCustomer C
INNER JOIN AdventureWorksDW2014.dbo.DimGeography GEO
on C.GeographyKey=GEO.GeographyKey
Where GEO.CountryRegionCode='US'
FOR JSON PATH, ROOT('Customer')

---Import from JSON File**************
---https://blogs.msdn.microsoft.com/sqlserverstorageengine/2015/10/07/bulk-importing-json-files-into-sql-server/
----Json sitting in my Azure Blob
/*Basic Format from Public ZipCode data
{ city; : AGAWAM, loc : [ -72.622739, 42.070206 ], pop : 15338,
state : MA, id : 01001 }
Important Note: id column was initially labled _id,
I did a find and replace to change it to id so it would work

--Sample data obtained from http://media.mongodb.org/zips.json
*/

CREATE EXTERNAL TABLE [dbo].[External_ZipCodes]
(
       [jsonrow] varchar(8000) NOT NULL
)
WITH (LOCATION = 'zips.json',
              DATA_SOURCE = AzurePolybaseStorage,
              FILE_FORMAT = TextFilePipe);

			  --FILE_FORMAT = TextFileWithDefaults

--Then within SQL Server you can parse the JSONrow and create a view on top of it. Example Below
CREATE VIEW dbo.ExternalZipData (City,Population,State,Location,ZipCode) AS
select city, pop, state
,geography::STGeomFromText('POINT ('+lon + ' ' + lat + ')', 4326) location
,id
from [External_ZipCodes]
CROSS APPLY OPENJSON(jsonrow)
              WITH (id nvarchar(20), city nvarchar(100),
                           pop int '$.pop', state nvarchar(2) '$.state',
                           lon varchar(20) '$.loc[0]', lat varchar(20) '$.loc[1]')

						   --ID is the zip code

---# of Rows in total 29467, 1 minute 45 seconds for execution
Select City,Population,State,ZipCode,Location, location.ToString(),
location.Lat as Latitude,location.Long as Longitude
from dbo.ExternalZipData
where city = 'Macon'

Select City,Population,State,ZipCode,Location, location.ToString(),
location.Lat as Latitude,location.Long as Longitude
from dbo.ExternalZipData
where state = 'NY'
---Make sure you look at the results in the spatial mapping tab
-- it draws out a map 🙂
Select state,Count(state)
from dbo.ExternalZipData
group by state
order by count(state) desc

---Clean up
USE PolyBaseConnect
GO

Drop EXTERNAL TABLE External_TestPolyBaseData
DROP EXTERNAL TABLE [dbo].[External_ZipCodes]
DROP EXTERNAL TABLE dbo.FactResellerSalesArchiveExternal
DROP EXTERNAL TABLE External_FactInternetSales
DROP VIEW [dbo].[ExternalZipData]
DROP TABLE [dbo].[FactResellerSalesArchive]
DROP EXTERNAL FILE FORMAT TextFilePipe
DROP EXTERNAL FILE FORMAT TextFileComma
--DROP EXTERNAL DATA SOURCE AzurePolybaseStorage
--DROP CREDENTIAL AzureStorageCredential

 
6 Comments

Posted by on April 27, 2016 in Other, SQL Server 2016

 

Tags: , ,

6 responses to “PolyBase: I now Pronounce you SQL and Hadoop

  1. abdelrahman

    August 23, 2016 at 7:10 AM

    thank you ;We want in Egypt to enter the field of my Big Data just possible doing simple Video on the subject and do not forget that complement BI

    Like

     
    • abdelrahman

      August 23, 2016 at 7:12 AM

      thank you ;We want in Egypt to enter the field of BigData if
      possible doing simple Video on the subject and do not forget that complement BI

      Like

       
  2. FirstBenny

    October 9, 2017 at 5:57 PM

    I have noticed you don’t monetize your site, don’t waste your traffic, you can earn extra cash every
    month because you’ve got hi quality content. If you want to know how to make extra bucks, search for: Mrdalekjd methods
    for $$$

    Like

     

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

 
%d bloggers like this: