hadoop基准测试

TeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark.

TeraSortHadoop广泛使用的基准之一。 Hadoop的发行版包含输入生成器和排序实现:TeraGen生成输入,而TeraSort进行排序。 在这里,我们提供了一个使用Hadoop TeraSort基准测试的简短教程

TeraGen generates random data that can be used as input data for a subsequent running of TeraSort.

TeraGen生成随机数据,可用作后续TeraSort运行的输入数据。

通过TeraGen生成输入 (Generate input by TeraGen)

The syntax for TeraGen:

TeraGen的语法:

$ hadoop jar hadoop-*examples*.jar teragen 
<number of 100-byte rows> <output dir>

To make the TeraGen run on multiple nodes with multiple tasks, you may need to specify the number of map tasks (30 here as an example; for Hadoop 2):

为了使TeraGen在具有多个任务的多个节点上运行,您可能需要指定映射任务的数量(这里以30个为例;对于Hadoop 2):

$ hadoop -D mapreduce.job.maps 30 
jar hadoop-*examples*.jar teragen 
<number of 100-byte rows> <output dir>

The number of mappers depends on the number of rows you will generate and the number of nodes you have. For more information on how to set the number of mappers and reducers, please check this post.

映射器的数量取决于您将生成的行数和拥有的节点数。 有关如何设置映射器和缩减器数量的更多信息,请检查此帖子

运行TeraSort (Run TeraSort)

After the data is generated, run the sort by TeraSort

生成数据后,按TeraSort运行排序

$ hadoop jar hadoop-*examples*.jar terasort 
<input dir> <output dir>

You may also need to set the number of mappers and reducers for better performance.

您可能还需要设置映射器和化简器的数量,以获得更好的性能。

验证TeraSort排序后的输出数据 (Validate the sorted output data of TeraSort)

TeraValidate ensures that the output data of TeraSort is globally sorted.

TeraValidate确保TeraSort的输出数据是全局排序的。

The syntax for TeraValidate:

TeraValidate的语法:

$ hadoop jar hadoop-*examples*.jar teravalidate 
<output dir> <terasort-validate dir>

翻译自: https://www.systutorials.com/hadoop-terasort-benchmark/

hadoop基准测试

Logo

腾讯云面向开发者汇聚海量精品云计算使用和开发经验,营造开放的云计算技术生态圈。

更多推荐