datax-hdfs2stream

勤奋不是嘴上说说而已,而是实际的行动,在勤奋的苦度中持之以恒,永不退却。业精于勤,荒于嬉;行成于思,毁于随。在人生的仕途上,我们毫不迟疑地选择勤奋,她是几乎于世界上一切成就的催产婆。只要我们拥着勤奋去思考,拥着勤奋的手去耕耘,用抱勤奋的心去对待工作,浪迹红尘而坚韧不拔,那么,我们的生命就会绽放火花,让人生的时光更加的闪亮而精彩。

导读:本篇文章讲解 datax-hdfs2stream,希望对大家有帮助,欢迎收藏,转发!站点地址:www.bmabk.com,来源:原文

准备hadoop环境

master 192.168.0.200
slave1 192.168.0.201
slave1 192.168.0.202

master start-all.sh
5728 SecondaryNameNode
7828 Jps
5893 ResourceManager
5531 NameNode
slave1
3895 NodeManager
3772 DataNode
5646 Jps
slave2
3745 DataNode
5650 Jps
3868 NodeManager

cat hdfs-core.xml
<configuration>
<!--指定namenode的地址-->
    <property>
                <name>fs.defaultFS</name>
                <value>hdfs://master:9000</value>
    </property>
    <!--用来指定使用hadoop时产生文件的存放目录-->
    <property>
             <name>hadoop.tmp.dir</name>
             <value>/home/hadoop/hadoopdata</value> 
    </property>
</configuration>


[root@master hadoop]# cat hdfs-site.xml 
<configuration>
	<!--指定hdfs保存数据的副本数量 -->
	<property>
		<name>dfs.replication</name>
		<value>1</value>
	</property>
	<property>
		<name>dfs.namenode.http.address</name>
		<value>slave1:50070</value>
	</property>
</configuration>

准备datax.txt
1 张一 21
2 张二 22
3 张三 23
4 张四 24
5 张五 25
datax2.txt
1 李一 21
2 李二 22
3 李三 23
4 李四 24
5 李五 25
以空格分隔,主要是和以下json文件中的
"fieldDelimiter": " "相对应即可

在这里插入图片描述

hdfs2stream.json

{
    "job": {
        "setting": {
            "speed": {
                "channel": 3
            }
        },
        "content": [{
            "reader": {
                "name": "hdfsreader",
                "parameter": {
                    "path": "/user/*",
                    "defaultFS": "hdfs://192.168.0.200:9000",
                    "column": [{
                            "index": 0,
                            "type": "long"
                        },
                        {
                            "index": 1,
                            "type": "string"
                        },
                        {
                            "index": 2,
                            "type": "long"
                        }
                    ],
                    "fileType": "text",
                    "encoding": "UTF-8",
                    "fieldDelimiter": " "
                }

            },
            "writer": {
                "name": "streamwriter",
                "parameter": {
                    "print": true
                }
            }
        }]
    }
}

注意:

"path": "/user/*",可以通过http://192.168.0.200:50070/explorer.html#/查看文件存放的位置
 "defaultFS": "hdfs://192.168.0.200:9000",
 对应与hdfs-core.xml中的namenode的地址
    <property>
                <name>fs.defaultFS</name>
                <value>hdfs://master:9000</value>
    </property>

结论

python datax.py hdfs2stream.json

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2020-10-30 16:19:59.688 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2020-10-30 16:19:59.698 [main] INFO  Engine - the machine info  =>

        osInfo: Oracle Corporation 1.8 25.261-b12
        jvmInfo:        Windows 10 amd64 10.0
        cpu num:        4

        totalPhysicalMemory:    -0.00G
        freePhysicalMemory:     -0.00G
        maxFileDescriptorCount: -1
        currentOpenFileDescriptorCount: -1

        GC Names        [PS MarkSweep, PS Scavenge]

        MEMORY_NAME                    | allocation_size                | init_size
        PS Eden Space                  | 256.00MB                       | 256.00MB
        Code Cache                     | 240.00MB                       | 2.44MB
        Compressed Class Space         | 1,024.00MB                     | 0.00MB
        PS Survivor Space              | 42.50MB                        | 42.50MB
        PS Old Gen                     | 683.00MB                       | 683.00MB
        Metaspace                      | -0.00MB                        | 0.00MB


2020-10-30 16:19:59.723 [main] INFO  Engine -
{
        "content":[
                {
                        "reader":{
                                "name":"hdfsreader",
                                "parameter":{
                                        "column":[
                                                {
                                                        "index":0,
                                                        "type":"long"
                                                },
                                                {
                                                        "index":1,
                                                        "type":"string"
                                                },
                                                {
                                                        "index":2,
                                                        "type":"long"
                                                }
                                        ],
                                        "defaultFS":"hdfs://192.168.0.200:9000",
                                        "encoding":"UTF-8",
                                        "fieldDelimiter":" ",
                                        "fileType":"text",
                                        "path":"/user/*"
                                }
                        },
                        "writer":{
                                "name":"streamwriter",
                                "parameter":{
                                        "print":true
                                }
                        }
                }
        ],
        "setting":{
                "speed":{
                        "channel":3
                }
        }
}

2020-10-30 16:19:59.748 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
2020-10-30 16:19:59.750 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2020-10-30 16:19:59.752 [main] INFO  JobContainer - DataX jobContainer starts job.
2020-10-30 16:19:59.757 [main] INFO  JobContainer - Set jobId = 0
2020-10-30 16:19:59.782 [job-0] INFO  HdfsReader$Job - init() begin...
2020-10-30 16:20:00.205 [job-0] INFO  HdfsReader$Job - hadoopConfig details:{"finalParameters":[]}
2020-10-30 16:20:00.206 [job-0] INFO  HdfsReader$Job - init() ok and end...
2020-10-30 16:20:00.222 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2020-10-30 16:20:00.222 [job-0] INFO  JobContainer - DataX Reader.Job [hdfsreader] do prepare work .
2020-10-30 16:20:00.223 [job-0] INFO  HdfsReader$Job - prepare(), start to getAllFiles...
2020-10-30 16:20:00.229 [job-0] INFO  HdfsReader$Job - get HDFS all files in path = [/user/*]
2020-10-30 16:20:07.291 [job-0] INFO  HdfsReader$Job - [hdfs://192.168.0.200:9000/user/datax.txt]是[text]类型的文件, 将 该文件加入source files列表
2020-10-30 16:20:07.740 [job-0] INFO  HdfsReader$Job - [hdfs://192.168.0.200:9000/user/datax2.txt]是[text]类型的文件, 将该文件加入source files列表
2020-10-30 16:20:07.744 [job-0] INFO  HdfsReader$Job - 您即将读取的文件数为: [2], 列表为: [hdfs://192.168.0.200:9000/user/datax2.txt,hdfs://192.168.0.200:9000/user/datax.txt]
2020-10-30 16:20:07.746 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2020-10-30 16:20:07.748 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2020-10-30 16:20:07.749 [job-0] INFO  JobContainer - Job set Channel-Number to 3 channels.
2020-10-30 16:20:07.751 [job-0] INFO  HdfsReader$Job - split() begin...
2020-10-30 16:20:07.754 [job-0] INFO  JobContainer - DataX Reader.Job [hdfsreader] splits to [2] tasks.
2020-10-30 16:20:07.754 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] splits to [2] tasks.
2020-10-30 16:20:07.774 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2020-10-30 16:20:07.789 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2020-10-30 16:20:07.794 [job-0] INFO  JobContainer - Running by standalone Mode.
2020-10-30 16:20:07.804 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [2] channels for [2] tasks.
2020-10-30 16:20:07.821 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2020-10-30 16:20:07.821 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2020-10-30 16:20:07.838 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2020-10-30 16:20:07.851 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started
2020-10-30 16:20:07.910 [0-0-1-reader] INFO  HdfsReader$Job - hadoopConfig details:{"finalParameters":["mapreduce.job.end-notification.max.retry.interval","mapreduce.job.end-notification.max.attempts"]}
2020-10-30 16:20:07.917 [0-0-1-reader] INFO  Reader$Task - read start
2020-10-30 16:20:07.944 [0-0-0-reader] INFO  HdfsReader$Job - hadoopConfig details:{"finalParameters":["mapreduce.job.end-notification.max.retry.interval","mapreduce.job.end-notification.max.attempts"]}
2020-10-30 16:20:07.957 [0-0-0-reader] INFO  Reader$Task - read start
2020-10-30 16:20:07.962 [0-0-0-reader] INFO  Reader$Task - reading file : [hdfs://192.168.0.200:9000/user/datax2.txt]
2020-10-30 16:20:07.959 [0-0-1-reader] INFO  Reader$Task - reading file : [hdfs://192.168.0.200:9000/user/datax.txt]
2020-10-30 16:20:08.019 [0-0-0-reader] INFO  UnstructuredStorageReaderUtil - CsvReader使用默认值[{"captureRawRecord":true,"columnCount":0,"comment":"#","currentRecord":-1,"delimiter":" ","escapeMode":1,"headerCount":0,"rawRecord":"","recordDelimiter":"\u0000","safetySwitch":false,"skipEmptyRecords":true,"textQualifier":"\"","trimWhitespace":true,"useComments":false,"useTextQualifier":true,"values":[]}],csvReaderConfig值为[null]
2020-10-30 16:20:08.019 [0-0-1-reader] INFO  UnstructuredStorageReaderUtil - CsvReader使用默认值[{"captureRawRecord":true,"columnCount":0,"comment":"#","currentRecord":-1,"delimiter":" ","escapeMode":1,"headerCount":0,"rawRecord":"","recordDelimiter":"\u0000","safetySwitch":false,"skipEmptyRecords":true,"textQualifier":"\"","trimWhitespace":true,"useComments":false,"useTextQualifier":true,"values":[]}],csvReaderConfig值为[null]
2020-10-30 16:20:08.038 [0-0-0-reader] INFO  Reader$Task - end read source files...
2020-10-30 16:20:08.038 [0-0-1-reader] INFO  Reader$Task - end read source files...
1       李一    21
2       李二    22
3       李三    23
4       李四    24
5       李五    25
1       张一    21
2       张二    22
3       张三    23
4       张四    24
5       张五    25
2020-10-30 16:20:08.063 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[226]ms
2020-10-30 16:20:08.064 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] is successed, used[214]ms
2020-10-30 16:20:08.066 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2020-10-30 16:20:17.822 [job-0] INFO  StandAloneJobContainerCommunicator - Total 10 records, 50 bytes | Speed 5B/s, 1 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.001s |  All Task WaitReaderTime 0.379s | Percentage 100.00%
2020-10-30 16:20:17.823 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2020-10-30 16:20:17.827 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do post work.
2020-10-30 16:20:17.831 [job-0] INFO  JobContainer - DataX Reader.Job [hdfsreader] do post work.
2020-10-30 16:20:17.837 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2020-10-30 16:20:17.839 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: d:\java\datax\hook
2020-10-30 16:20:17.849 [job-0] INFO  JobContainer -
         [total cpu info] =>
                averageCpu                     | maxDeltaCpu                    | minDeltaCpu
                -1.00%                         | -1.00%                         | -1.00%


         [total gc info] =>
                 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime
                 PS MarkSweep         | 1                  | 1                  | 1                  | 0.032s             | 0.032s             | 0.032s
                 PS Scavenge          | 1                  | 1                  | 1                  | 0.016s             | 0.016s             | 0.016s

2020-10-30 16:20:17.851 [job-0] INFO  JobContainer - PerfTrace not enable!
2020-10-30 16:20:17.853 [job-0] INFO  StandAloneJobContainerCommunicator - Total 10 records, 50 bytes | Speed 5B/s, 1 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.001s |  All Task WaitReaderTime 0.379s | Percentage 100.00%
2020-10-30 16:20:17.900 [job-0] INFO  JobContainer -
任务启动时刻                    : 2020-10-30 16:19:59
任务结束时刻                    : 2020-10-30 16:20:17
任务总计耗时                    :                 18s
任务平均流量                    :                5B/s
记录写入速度                    :              1rec/s
读出记录总数                    :                  10
读写失败总数                    :                   0


版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。

文章由半码博客整理,本文链接:https://www.bmabk.com/index.php/post/140800.html

(0)
飞熊的头像飞熊bm

相关推荐

发表回复

登录后才能评论
半码博客——专业性很强的中文编程技术网站,欢迎收藏到浏览器,订阅我们!