首页 / 新闻

03.

06

2018

Transwarp announcing the first ever audited result of the TPC-DS benchmark with Cisco UCS Integrated Infrastructure for Big Data and Analytics

技术博客

The TPC-DS benchmark is a decision support benchmark originally published in 2006. Since then, no one has published a fully audited result of the TPC-DS benchmark. Until now. In March of 2018, Cisco published the first audited result of the benchmark. The benchmark was run using Cisco UCS Integrated Infrastructure for Big Data and Analytics with Transwarp Data Hub v5.1. The benchmark result is available here http://www.tpc.org/tpcds/results/tpcds_advanced_sort.asp.

A key question presents itself: why did it takes so long; what is it about the TPC-DS benchmark that makes it so challenging?

TPC-DS Test Process


 TPC-DS benchmark workloads include 99 queries; but the benchmark test is not just about those queries. The test process and metrics include the following six steps (refer TPC-DS Specification for more detail):

Data Load test (LD).

Power test (PT), which runs the 99 queries in one stream.

Throughput test 1 (TT1), which runs the 99 queries in multiple streams.

Maintenance test 1 (DM1), which updates the databases with the refresh functions.

Throughput test 2 (TT2), which reruns the 99 queries in multiple streams.

Maintenance test (DM2), which re-updates the databases with the refresh functions.

The 99 queries are defined as templates. Each query has several parameters which are determined by dsqgen with SEED, specifically “is selected as the timestamp of the end of the database load time (Load End Time) expressed in the format mmddhhmmsss”. This timestamp “guarantees that the query substitution parameter values are not known prior to running”. In addition to the parameters, the query orders of different streams are different. Figure 1 shows the execution order.  

The TPC-DS performance result depends on the time of each step (time to generate SQL is not part of the result). To get better results, the system should have better IO throughput to load the data, better SQL performance to run the power and throughput tests and better transaction throughput for the maintenance tests. Not surprisingly, there are trade-offs between the different test steps to achieve an optimal result.

Key Factors for DBMS


 According to the TPC-DS test process, SQL syntax and transaction support, optimization, stability and scalability are identified as the key factors for the DBMS.

SQL Syntax Support


 TPC-DS Power and Throughput tests involve 99 queries. The queries include a lot of advanced features, such as correlated and unrelated sub-queries, common sub-expressions, having with sub-queries, exists, intersect and so on. Here is a summary of SQL syntax support for typical Hadoop-based systems and traditional vendors.

DBMS ANSI SQL Stored Procedures TPC-DS 99 query templates

 Hive

 SQL’92 + Some extension in SQL’99 and 2003

 No

 No TPC approved templates in public

 Impala

 SQL’92 + Some extension in SQL’99 and 2003

 No

 No TPC approved templates in public

 SparkSQL

 SQL’92 + Some extension in SQL’99 and 2003

 No

 No TPC approved templates in public

 Transwarp Inceptor

 SQL 2003 compatible

 Yes

 99 query templates with minimal changes (Approved by TPC)

 Traditional DBMS like Oracle, DB2

 SQL 2003

 Yes

 TPC provides standard templates for oracle/db2/sqlserver etc.

 According to the TPC-DS specification, only minimal changes are allowed for the test.  

Transaction Support


Data maintenance tests are performed as a must-have part of the benchmark execution. The tests simulate the periodic synchronization with source OLTP databases, an integral part of the data warehouse lifecycle. Maintenance tests include a set of refresh runs. Refresh functions are defined in Section 5 of the TPC-DS Specification. Here is Method 2 as an example:

Delete rows from *_returns with corresponding rows in *_sales

where d_date between Date1 and Date2

Delete rows from *_sales

where d_date between Date1 and Date2

According to the specification, all the transformations must be SQL-based. This transaction is required for the DBMS, and represents a big challenge in a distributed environment. Here is a summary of transaction support.

 

 Hive

 Impala

 SparkSQL

 Transwarp Inceptor

 Traditional DBMS

 Transaction Support

 Yes, with some limitation

 No

 No

 Yes

 Yes

Optimization


 There are 99 queries in TPC-DS Power and Throughput tests. The workloads are classified into four categories: ad-hoc, reporting, iterative OLAP and data mining. Most queries are characterized by high CPU and IO load. So, the DBMS must be smart enough to handle all the cases.

Most DBMS support some general optimizations like Predicate Pushdown (PPD), Cost Based Optimization (CBO), Rule Based Optimization (RBO), Partition Prune (PP), Vectorization (VEC) and Code Generator (CG). The system depends on the optimizer to acheive the best result.

Materialized View (MV) and CUBE are well-defined solutions in traditional DBMS to accelerate query performance. These are widely used in the TPC-H benchmark test; but it is hard for TPC-DS benchmark, because:

The TPC-DS specification restricts the use of materialization to the catalog sales channel.

The time to prepare and refresh materialization should be part of the performance result.

There are many combinations; and SQL parameters are different for different streams.  

 DBMS

 CBO/RBO/PPD/PP/CG/VEC

 MV/CUBE

 Hive

 Yes

 No

 Impala  

 Yes

 No

 SparkSQL

 Yes

 No

 Transwarp Inceptor

 Yes

 Yes

 Traditional DBMS like Oracle, DB2

 Yes

 Yes

 For traditional systems, due to architectural limitations, it leads to a single node bottleneck when there are lots of data exchanges. To decrease the impact, some vendors introduce special devices to handle the data exchanges, which makes the system expensive in dollars.

Stability


According to TPC-DS execute rules: “If there is any test fail in any step, the benchmark run is invalid;” and “the System Under Test (SUT), including any database server(s), shall not be restarted at any time after the power test begins until after all tests have completed.” The total process lasts a long time. Stability is another key factor particularly for systems that run 24x7.

Lots of new systems rely on aggressive memory computation and different kinds of caches to improve performance. Stability is a challenge for these systems when running the TPC-DS workloads without a break.

To improve stability, Transwarp Inceptor does careful tuning for memory usage, e.g. data will be spilled to disks when the total size exceeds a threshold. For complex “with as” queries, it will create temporary tables to save the result other than in-lining everything. This is the kind of trade-off between performance and stability that must be made.

Scalability


 Hadoop is a distributed computing framework. It is designed to be easy to scale as the data volume grows. However, for traditional systems, the data has to be carefully deployed among the nodes; the scalability is poor and hard to extend to handle continuously increasing data.

Overall


 Due to SQL or transaction support limitations, most Hadoop-based vendors cannot run the full TPC-DS benchmark. Due to architectural and optimization restrictions, traditional DBMS vendors find it difficult to pass the test or cannot achieve good results, especially in a big-data environment.

The first ever audited result of the TPC-DS benchmark was run on Cisco UCS Integrated Infrastructure for Big Data and Analytics, and Transwarp Data Hub v5.1. It achieves, at a 10,000 GB scale factor, a composite query per hour of 1,580,649 QphDS and a price/performance of $0.64 USD / QphDS.

More information about the TPC-DS is available at http://www.tpc.org/tpcds/default.asp

More Information about the Transwarp Data Hub can be found at:  http://transwarp.io/?lang=en

For more information about Cisco UCS Integrated Infrastructure for Big Data and Analytics, visit: http://www.cisco.com/go/bigdata