【云原生 | 31】Docker运行实时流计算框架Apache Storm

作者简介：🏅云计算领域优质创作者🏅新星计划第三季python赛道第一名🏅 阿里云ACE认证高级工程师🏅
✒️个人主页：小鹏linux
💊个人社区：小鹏linux（个人社区）欢迎您的加入！

1. 关于Apache Storm

Apache Storm是一个分布式实时大数据处理系统。Storm设计用于在容错和水平可扩展方法中处理大量数据。它是一个流数据框架，具有最高的摄取率。虽然Storm是无状态的，它通过Apache ZooKeeper管理分布式环境和集群状态。通过Storm可以并行地对实时数据执行各种操作。Storm易于部署和操作，并且它可以保证每个消息将通过拓扑至少处理一次。

1.1 Apache Storm核心概念

Apache Storm从一端读取实时数据的原始流，并将其传递通过一系列小处理单元，并在另一端输出处理/有用的信息。

下图描述了Apache Storm的核心概念。

1.2 Apache Storm的组件

Tuple

Tuple是Storm中的主要数据结构。它是有序元素的列表。默认情况下，Tuple支持所有数据类型。通常，它被建模为一组逗号分隔的值，并传递到Storm集群。

Stream

流是元组的无序序列。

Spouts

流的源。通常，Storm从原始数据源（如Twitter Streaming API，Apache Kafka队列，Kestrel队列等）接受输入数据。否则，您可以编写spouts以从数据源读取数据。“ISpout”是实现spouts的核心接口，一些特定的接口是IRichSpout，BaseRichSpout，KafkaSpout等。

Bolts

Bolts是逻辑处理单元。Spouts将数据传递到Bolts和Bolts过程，并产生新的输出流。Bolts可以执行过滤，聚合，加入，与数据源和数据库交互的操作。Bolts接收数据并发射到一个或多个Bolts。 “IBolt”是实现Bolts的核心接口。一些常见的接口是IRichBolt，IBasicBolt等。

拓扑

Spouts和Bolts连接在一起，形成拓扑结构。实时应用程序逻辑在Storm拓扑中指定。简单地说，拓扑是有向图，其中顶点是计算，边缘是数据流。

简单拓扑从spouts开始。Spouts将数据发射到一个或多个Bolts。Bolt表示拓扑中具有最小处理逻辑的节点，并且Bolts的输出可以发射到另一个Bolts作为输入。

Storm保持拓扑始终运行，直到您终止拓扑。Apache Storm的主要工作是运行拓扑，并在给定时间运行任意数量的拓扑。

任务

现在你有一个关于Spouts和Bolts的基本想法。它们是拓扑的最小逻辑单元，并且使用单个Spout和Bolt阵列构建拓扑。应以特定顺序正确执行它们，以使拓扑成功运行。Storm执行的每个Spout和Bolt称为“任务”。简单来说，任务是Spouts或Bolts的执行。

进程

拓扑在多个工作节点上以分布式方式运行。Storm将所有工作节点上的任务均匀分布。工作节点的角色是监听作业，并在新作业到达时启动或停止进程。

流分组

数据流从Spouts流到Bolts，或从一个Bolts流到另一个Bolts。流分组控制元组在拓扑中的路由方式，并帮助我们了解拓扑中的元组流。有如下分组：

Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks.
Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides.
All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care.
Global grouping: The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id.
None grouping: This grouping specifies that you don't care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).
Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to).
Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.

1.3 Apache Storm工作流程

一个工作的Storm集群应该有一个Nimbus和一个或多个supervisors。另一个重要的节点是Apache ZooKeeper，它将用于nimbus和supervisors之间的协调。

现在让我们仔细看看Apache Storm的工作流程 −

最初，nimbus将等待“Storm拓扑”提交给它。
一旦提交拓扑，它将处理拓扑并收集要执行的所有任务和任务将被执行的顺序。
然后，nimbus将任务均匀分配给所有可用的supervisors。
在特定的时间间隔，所有supervisor将向nimbus发送心跳以通知它们仍然运行着。
当supervisor终止并且不向心跳发送心跳时，则nimbus将任务分配给另一个supervisor。
当nimbus本身终止时，supervisor将在没有任何问题的情况下对已经分配的任务进行工作。
一旦所有的任务都完成后，supervisor将等待新的任务进去。
同时，终止nimbus将由服务监控工具自动重新启动。
重新启动的网络将从停止的地方继续。同样，终止supervisor也可以自动重新启动。由于网络管理程序和supervisor都可以自动重新启动，并且两者将像以前一样继续，因此Storm保证至少处理所有任务一次。
一旦处理了所有拓扑，则网络管理器等待新的拓扑到达，并且类似地，管理器等待新的任务。

默认情况下，Storm集群中有两种模式：

本地模式 -此模式用于开发，测试和调试，因为它是查看所有拓扑组件协同工作的最简单方法。在这种模式下，我们可以调整参数，使我们能够看到我们的拓扑如何在不同的Storm配置环境中运行。在本地模式下，storm拓扑在本地机器上在单个JVM中运行。
生产模式 -在这种模式下，我们将拓扑提交到工作Storm集群，该集群由许多进程组成，通常运行在不同的机器上。如在storm的工作流中所讨论的，工作集群将无限地运行，直到它被关闭。

2. 在Docker上运行Apache Storm

Apache Storm是一个实时流计算框架，由Twitter在2014年正式开源，遵循Eclipse Public License 1.0。Storm基于Clojure等语言实现。

Storm集群与Hadoop集群在工作方式上十分相似，唯一区别在于 Hadoop上运行的是MapReduce任务，在Storm上运行的则是topology。MapReduce任务完成处理即会结束，而topology则永远在等待消息并处理（直到被停止）。

2.1 使用Compose搭建Storm集群

利用Docker Compose模板，用户可以在本地单机Docker环境快速地搭建一个Apache Storm集群，进行应用开发测试。

**Storm示例架构 **

其中包含如下容器：

·Zookeeper：Apache Zookeeper三节点部署。

·Nimbus：Storm Nimbus。

·Ui：Storm UI

·Supervisor：Storm Supervisor（一个或多个）。

·Topology：Topology部署工具，其中示例应用基于官方示例storm-starter代码构建。

**本地开发测试 **

首先从Github下载需要的代码：

$ git clone https://github.com/denverdino/docker-storm.git 
$ cd docker-swarm/local

代码库中的docker-compose.yml文件描述了典型的Storm应用架构。

用户可以直接运行下列命令构建测试镜像：

$ docker-compose build

现在可以用下面的命令来一键部署一个Storm应用：

$ docker-compose up -d

当UI容器启动后，用户可以访问容器的8080端口来打开操作界面，如图所示。

利用如下命令，可以伸缩supervisor的数量，比如伸缩到3个实例

$ docker-compose scale supervisor=3

用户也许会发现Web界面中并没有运行中的topology。这是因为Docker Compose目前只能保证容器的启动顺序，但是无法确保所依赖容器中的应用已经完全启动并可以被正常访问了。

为了解决这个问题，需要运行下面的命令来再次启动topolgoy服务应用来提交更新的拓扑：

$ docker-compose start topology

稍后刷新Storm UI，可以发现Storm应用已经部署成功了。

👑👑👑结束语👑👑👑

标签：云原生 docker 容器

本文转载自: https://blog.csdn.net/qq_62294245/article/details/127005341
版权归原作者 小鹏linux 所有，如有侵权，请联系我们删除。