

导读:本篇文章讲解 datax-hdfs2stream,希望对大家有帮助,欢迎收藏,转发!站点地址,来源:原文



5728 SecondaryNameNode
7828 Jps
5893 ResourceManager
5531 NameNode
3895 NodeManager
3772 DataNode
5646 Jps
3745 DataNode
5650 Jps
3868 NodeManager

cat hdfs-core.xml

[root@master hadoop]# cat hdfs-site.xml 
	<!--指定hdfs保存数据的副本数量 -->

1 张一 21
2 张二 22
3 张三 23
4 张四 24
5 张五 25
1 李一 21
2 李二 22
3 李三 23
4 李四 24
5 李五 25
"fieldDelimiter": " "相对应即可



    "job": {
        "setting": {
            "speed": {
                "channel": 3
        "content": [{
            "reader": {
                "name": "hdfsreader",
                "parameter": {
                    "path": "/user/*",
                    "defaultFS": "hdfs://",
                    "column": [{
                            "index": 0,
                            "type": "long"
                            "index": 1,
                            "type": "string"
                            "index": 2,
                            "type": "long"
                    "fileType": "text",
                    "encoding": "UTF-8",
                    "fieldDelimiter": " "

            "writer": {
                "name": "streamwriter",
                "parameter": {
                    "print": true


"path": "/user/*",可以通过http://查看文件存放的位置
 "defaultFS": "hdfs://",


python hdfs2stream.json

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.

2020-10-30 16:19:59.688 [main] INFO  VMInfo - VMInfo# operatingSystem class =>
2020-10-30 16:19:59.698 [main] INFO  Engine - the machine info  =>

        osInfo: Oracle Corporation 1.8 25.261-b12
        jvmInfo:        Windows 10 amd64 10.0
        cpu num:        4

        totalPhysicalMemory:    -0.00G
        freePhysicalMemory:     -0.00G
        maxFileDescriptorCount: -1
        currentOpenFileDescriptorCount: -1

        GC Names        [PS MarkSweep, PS Scavenge]

        MEMORY_NAME                    | allocation_size                | init_size
        PS Eden Space                  | 256.00MB                       | 256.00MB
        Code Cache                     | 240.00MB                       | 2.44MB
        Compressed Class Space         | 1,024.00MB                     | 0.00MB
        PS Survivor Space              | 42.50MB                        | 42.50MB
        PS Old Gen                     | 683.00MB                       | 683.00MB
        Metaspace                      | -0.00MB                        | 0.00MB

2020-10-30 16:19:59.723 [main] INFO  Engine -
                                        "fieldDelimiter":" ",

2020-10-30 16:19:59.748 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
2020-10-30 16:19:59.750 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2020-10-30 16:19:59.752 [main] INFO  JobContainer - DataX jobContainer starts job.
2020-10-30 16:19:59.757 [main] INFO  JobContainer - Set jobId = 0
2020-10-30 16:19:59.782 [job-0] INFO  HdfsReader$Job - init() begin...
2020-10-30 16:20:00.205 [job-0] INFO  HdfsReader$Job - hadoopConfig details:{"finalParameters":[]}
2020-10-30 16:20:00.206 [job-0] INFO  HdfsReader$Job - init() ok and end...
2020-10-30 16:20:00.222 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2020-10-30 16:20:00.222 [job-0] INFO  JobContainer - DataX Reader.Job [hdfsreader] do prepare work .
2020-10-30 16:20:00.223 [job-0] INFO  HdfsReader$Job - prepare(), start to getAllFiles...
2020-10-30 16:20:00.229 [job-0] INFO  HdfsReader$Job - get HDFS all files in path = [/user/*]
2020-10-30 16:20:07.291 [job-0] INFO  HdfsReader$Job - [hdfs://]是[text]类型的文件, 将 该文件加入source files列表
2020-10-30 16:20:07.740 [job-0] INFO  HdfsReader$Job - [hdfs://]是[text]类型的文件, 将该文件加入source files列表
2020-10-30 16:20:07.744 [job-0] INFO  HdfsReader$Job - 您即将读取的文件数为: [2], 列表为: [hdfs://,hdfs://]
2020-10-30 16:20:07.746 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2020-10-30 16:20:07.748 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2020-10-30 16:20:07.749 [job-0] INFO  JobContainer - Job set Channel-Number to 3 channels.
2020-10-30 16:20:07.751 [job-0] INFO  HdfsReader$Job - split() begin...
2020-10-30 16:20:07.754 [job-0] INFO  JobContainer - DataX Reader.Job [hdfsreader] splits to [2] tasks.
2020-10-30 16:20:07.754 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] splits to [2] tasks.
2020-10-30 16:20:07.774 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2020-10-30 16:20:07.789 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2020-10-30 16:20:07.794 [job-0] INFO  JobContainer - Running by standalone Mode.
2020-10-30 16:20:07.804 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [2] channels for [2] tasks.
2020-10-30 16:20:07.821 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2020-10-30 16:20:07.821 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2020-10-30 16:20:07.838 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2020-10-30 16:20:07.851 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started
2020-10-30 16:20:07.910 [0-0-1-reader] INFO  HdfsReader$Job - hadoopConfig details:{"finalParameters":["mapreduce.job.end-notification.max.retry.interval","mapreduce.job.end-notification.max.attempts"]}
2020-10-30 16:20:07.917 [0-0-1-reader] INFO  Reader$Task - read start
2020-10-30 16:20:07.944 [0-0-0-reader] INFO  HdfsReader$Job - hadoopConfig details:{"finalParameters":["mapreduce.job.end-notification.max.retry.interval","mapreduce.job.end-notification.max.attempts"]}
2020-10-30 16:20:07.957 [0-0-0-reader] INFO  Reader$Task - read start
2020-10-30 16:20:07.962 [0-0-0-reader] INFO  Reader$Task - reading file : [hdfs://]
2020-10-30 16:20:07.959 [0-0-1-reader] INFO  Reader$Task - reading file : [hdfs://]
2020-10-30 16:20:08.019 [0-0-0-reader] INFO  UnstructuredStorageReaderUtil - CsvReader使用默认值[{"captureRawRecord":true,"columnCount":0,"comment":"#","currentRecord":-1,"delimiter":" ","escapeMode":1,"headerCount":0,"rawRecord":"","recordDelimiter":"\u0000","safetySwitch":false,"skipEmptyRecords":true,"textQualifier":"\"","trimWhitespace":true,"useComments":false,"useTextQualifier":true,"values":[]}],csvReaderConfig值为[null]
2020-10-30 16:20:08.019 [0-0-1-reader] INFO  UnstructuredStorageReaderUtil - CsvReader使用默认值[{"captureRawRecord":true,"columnCount":0,"comment":"#","currentRecord":-1,"delimiter":" ","escapeMode":1,"headerCount":0,"rawRecord":"","recordDelimiter":"\u0000","safetySwitch":false,"skipEmptyRecords":true,"textQualifier":"\"","trimWhitespace":true,"useComments":false,"useTextQualifier":true,"values":[]}],csvReaderConfig值为[null]
2020-10-30 16:20:08.038 [0-0-0-reader] INFO  Reader$Task - end read source files...
2020-10-30 16:20:08.038 [0-0-1-reader] INFO  Reader$Task - end read source files...
1       李一    21
2       李二    22
3       李三    23
4       李四    24
5       李五    25
1       张一    21
2       张二    22
3       张三    23
4       张四    24
5       张五    25
2020-10-30 16:20:08.063 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[226]ms
2020-10-30 16:20:08.064 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] is successed, used[214]ms
2020-10-30 16:20:08.066 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2020-10-30 16:20:17.822 [job-0] INFO  StandAloneJobContainerCommunicator - Total 10 records, 50 bytes | Speed 5B/s, 1 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.001s |  All Task WaitReaderTime 0.379s | Percentage 100.00%
2020-10-30 16:20:17.823 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2020-10-30 16:20:17.827 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do post work.
2020-10-30 16:20:17.831 [job-0] INFO  JobContainer - DataX Reader.Job [hdfsreader] do post work.
2020-10-30 16:20:17.837 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2020-10-30 16:20:17.839 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: d:\java\datax\hook
2020-10-30 16:20:17.849 [job-0] INFO  JobContainer -
         [total cpu info] =>
                averageCpu                     | maxDeltaCpu                    | minDeltaCpu
                -1.00%                         | -1.00%                         | -1.00%

         [total gc info] =>
                 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime
                 PS MarkSweep         | 1                  | 1                  | 1                  | 0.032s             | 0.032s             | 0.032s
                 PS Scavenge          | 1                  | 1                  | 1                  | 0.016s             | 0.016s             | 0.016s

2020-10-30 16:20:17.851 [job-0] INFO  JobContainer - PerfTrace not enable!
2020-10-30 16:20:17.853 [job-0] INFO  StandAloneJobContainerCommunicator - Total 10 records, 50 bytes | Speed 5B/s, 1 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.001s |  All Task WaitReaderTime 0.379s | Percentage 100.00%
2020-10-30 16:20:17.900 [job-0] INFO  JobContainer -
任务启动时刻                    : 2020-10-30 16:19:59
任务结束时刻                    : 2020-10-30 16:20:17
任务总计耗时                    :                 18s
任务平均流量                    :                5B/s
记录写入速度                    :              1rec/s
读出记录总数                    :                  10
读写失败总数                    :                   0

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。




