数据采集之Web端导入DB数据到HadoopHDFS

导读:本篇文章讲解 数据采集之Web端导入DB数据到HadoopHDFS,希望对大家有帮助,欢迎收藏,转发!站点地址:www.bmabk.com

前言

接上一篇,总共需求有3个,这篇记录如何通过Web端将MySQL表数据导入到HDFS中,主要是通过Sqoop2这个工具,之前已经写了一篇,不过那个是在命令行下操作的。

这回通过Java API的形式操作,其中还是有不少坑的。

环境

  • OS Debian 8.7
  • Hadoop 2.6.5
  • SpringBoot 1.5.1.RELEASE
  • MySQL 5.7.17 Community Server
  • Sqoop 1.99.7

项目依赖

废话不多说,直接先上pom.xml依赖文件。

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.infosys.sqoop</groupId>
    <artifactId>sqoop</artifactId>
    <version>1.0-SNAPSHOT</version>

    <name>sqoop</name>

    <packaging>jar</packaging>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>1.5.1.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
        <hadoop.version>2.6.5</hadoop.version>
        <sqoop.version>1.99.7</sqoop.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>

        <dependency>
            <groupId>javax.servlet</groupId>
            <artifactId>javax.servlet-api</artifactId>
            <version>3.1.0</version>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>${mysql.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.sqoop</groupId>
            <artifactId>sqoop-client</artifactId>
            <version>${sqoop.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.apache.derby</groupId>
            <artifactId>derby</artifactId>
            <version>10.10.2.0</version>
        </dependency>


        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <!-- Test -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.mrunit</groupId>
            <artifactId>mrunit</artifactId>
            <version>1.1.0</version>
            <classifier>hadoop2</classifier>
            <scope>test</scope>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-minicluster</artifactId>
            <version>${hadoop.version}</version>
            <scope>test</scope>
        </dependency>
    </dependencies>


    <build>
        <finalName>${project.artifactId}</finalName>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-archetype-plugin</artifactId>
                <version>2.2</version>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-resources-plugin</artifactId>
                <configuration>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>

                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>

            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>
</project>

其中SpringBoot只是为了快速搭建web框架,然后就是使用到了Hadoop ClientSqoop相关的Jar包。需要注意的是其中的日志框架,有好几个slf4j的实现,我们排除掉几个。

核心Sqoop

这里只是演示,不是真实项目,所以就偷懒下,逻辑都写一起了。真实项目中必须要抽取,毕竟不只是支持一种数据库,而且一些配置选项是可以通过web端传入的。

package com.infosys.sqoop.controller;

import com.infosys.sqoop.Driver;
import com.infosys.sqoop.ToFormat;
import org.apache.sqoop.client.SqoopClient;
import org.apache.sqoop.client.SubmissionCallback;
import org.apache.sqoop.model.*;
import org.apache.sqoop.validation.Status;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RestController;

import static com.infosys.sqoop.Constans.*;

/**
 * 描述:
 * 公司:infosys(印孚瑟斯技术有限公司)
 * 作者:luhaoyuan <lusyoe@163.com>
 * 版本:1.0
 * 创建日期:2017/3/12.
 */
@RestController
public class SqoopController {

    private static final Logger log = LoggerFactory.getLogger(SqoopController.class);


    @PostMapping(value = "/mysql2HDFS")
    public String mysqlToHDFS() throws Exception {

        Driver.mkdir(new String[]{HDFS_OUTPUT});

        SqoopClient client = new SqoopClient(SQOOP_URL);

        configSourceLink(client);

        configDestLink(client);

        configJob(client);

        startJob(client);

        return "SUCCESS!";
    }


    private void startJob(SqoopClient client) throws InterruptedException {

        // 启动JOB并设置回调
        MSubmission submission = client.startJob(JOB_NAME, DEFAULT_SUBMISSION_CALLBACKS, 100);

        log.debug("Job Submission Status: " + submission.getStatus());

        log.debug("Hadoop job id: " + submission.getExternalJobId());
        log.debug("Job link: " + submission.getExternalLink());

        if(submission.getStatus().isFailure()) {
            log.error("Submission has failed: " + submission.getError().getErrorSummary());
            log.error("Corresponding error details: " + submission.getError().getErrorDetails());
        }

    }


    protected static final SubmissionCallback DEFAULT_SUBMISSION_CALLBACKS = new SubmissionCallback() {
        @Override
        public void submitted(MSubmission submission) {
            log.info("Submission submitted: " + submission);
        }

        @Override
        public void updated(MSubmission submission) {
            log.info("Submission updated: " + submission);
        }

        @Override
        public void finished(MSubmission submission) {
            log.info("Submission finished: " + submission);
        }
    };



    private void configJob(SqoopClient client) {

        MJob job = client.createJob(FROM_LINK_NAME, TO_LINK_NAME);

        job.setName(JOB_NAME);

        // 设置源 link 配置
        MFromConfig fromJobConfig = job.getFromJobConfig();
        fromJobConfig.getStringInput("fromJobConfig.schemaName")
                .setValue(SOURCE_DB);

        fromJobConfig.getStringInput("fromJobConfig.tableName")
                .setValue(SOURCE_TABLE);


        fromJobConfig.getStringInput("fromJobConfig.partitionColumn")
                .setValue("id");


        // 设置目的 link 配置
        MToConfig toJobConfig = job.getToJobConfig();


        toJobConfig.getStringInput("toJobConfig.outputDirectory")
                .setValue(HDFS_OUTPUT);

        toJobConfig.getEnumInput("toJobConfig.outputFormat").setValue(ToFormat.TEXT_FILE);


        client.saveJob(job);
    }


    /**
     * 配置目的链接
     */
    private void configDestLink(SqoopClient client) {
        MLink link = client.createLink("hdfs-connector");

        link.setName(TO_LINK_NAME);

        MLinkConfig linkConfig = link.getConnectorLinkConfig();

        linkConfig.getStringInput("linkConfig.confDir")
                .setValue("/home/hadoop/hadoop/etc/hadoop");

        Status status = client.saveLink(link);

        if (status.canProceed()) {
            log.debug("Created Link with Link Name : " + link.getName());
        } else {
            log.debug("Something went wrong creating the link");
        }
    }


    /**
     * 配置数据源链接
     */
    private void configSourceLink(SqoopClient client) {


        MLink link = client.createLink("generic-jdbc-connector");

        link.setName(FROM_LINK_NAME);

        MLinkConfig linkConfig = link.getConnectorLinkConfig();

        // 配置link属性
        // 配置源连接
        linkConfig.getStringInput("linkConfig.connectionString")
                .setValue(DB_SCHEMA);

        // 配置MySQL数据库驱动
        linkConfig.getStringInput("linkConfig.jdbcDriver")
                .setValue("com.mysql.jdbc.Driver");

        // 配置数据连接用户和密码
        linkConfig.getStringInput("linkConfig.username")
                .setValue(DB_USERNAME);
        linkConfig.getStringInput("linkConfig.password")
                .setValue(DB_PASSWD);

        // mysql这里地方必须要注意了,如果不设置会报SQL语法错误,
        // 因为默认是使用"包裹数据库和表名的,这个mysql不支持
        linkConfig.getStringInput("dialect.identifierEnclose")
                .setValue("");

        log.debug("source link conf = " + linkConfig.toString());

        Status status = client.saveLink(link);

        if (status.canProceed()) {
            log.debug("Created Link with Link Name : " + link.getName());
        } else {
            log.debug("Something went wrong creating the link");
        }

    }

}

Hadoop操作

Sqoop导入到HDFS的输出目录必须为空,否则会报错。所以这里可以操作Hadoop Client创建一个路径。

package com.infosys.sqoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.IFile;
import org.apache.hadoop.util.GenericOptionsParser;

import java.io.IOException;

import static com.infosys.sqoop.Constans.FS_DEFAULT_FS;
import static com.infosys.sqoop.Constans.HDFS_HOST;

/**
 * 描述:
 * 公司:infosys(印孚瑟斯技术有限公司)
 * 作者:luhaoyuan <lusyoe@163.com>
 * 版本:1.0
 * 创建日期:2017/3/12.
 */
public class Driver {


    public static void mkdir(String[] args) throws IOException {

        Configuration conf = new Configuration();

        conf.set(FS_DEFAULT_FS, HDFS_HOST);

        GenericOptionsParser optionsParser = new GenericOptionsParser(conf, args);

        String[] remainingArgs = optionsParser.getRemainingArgs();
        if (remainingArgs.length < 1) {
            System.err.println("need a path");
            System.exit(2);
        }

        Path path = new Path(args[0]);

        FileSystem fs = FileSystem.get(conf);

        if (!fs.exists(path)) {
            fs.mkdirs(path);
        }

    }
}

常量

最后是一些常量配置,也贴出来吧~

package com.infosys.sqoop;

/**
 * 描述:
 * 公司:infosys(印孚瑟斯技术有限公司)
 * 作者:luhaoyuan <lusyoe@163.com>
 * 版本:1.0
 * 创建日期:2017/3/12.
 */
public final class Constans {

    public static final String FS_DEFAULT_FS = "fs.defaultFS";
    public static final String HDFS_HOST = "hdfs://e5:9000";

    public static final String SQOOP_URL = "http://192.168.1.2:12000/sqoop/";

    // mysql db name
    public static final String SOURCE_DB = "hadoopguide";

    // table name
    public static final String SOURCE_TABLE = "widgets";

    public static final String HDFS_OUTPUT = "/sqoop";

    // job name
    public static final String JOB_NAME = "web-job";

    // from link name
    public static final String FROM_LINK_NAME = "web-link";

    // to link name
    public static final String TO_LINK_NAME = "web-hdfs";


    // DB url
    public static final String DB_SCHEMA = "jdbc:mysql://192.168.1.2:3306/hadoopguide?useSSL=false";

    public static final String DB_USERNAME = "root";
    public static final String DB_PASSWD = "lu123";
}

后记

为了写这个都花了大半天时间了,其中的坑真的是蛮多的。一些配置又没有常量API提供,而且MapReduce Job又不好调试,这里下面再记录一下测试过程的一些错误。

java.lang.Throwable: 
The statement was aborted because it would have caused a duplicate key 
value in a unique or primary key constraint or unique index identified by 
'FK_SQ_LNK_NAME_UNIQUE' defined on 'SQ_LINK'

这个我怀疑是没设置数据库字段为Null时的处理方式导致的,反正我是把某个字段为Null的设置了一个值,然后就没报错了。

com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: 
You have an error in your SQL syntax; check the manual that corresponds 
to your MySQL server version for the right syntax to use near 
'"hadoopguide"."widgets"' at line 1

这个就是我上面说的边界符问题,在MySQL中不支持,必须要在配置源链接的时候设置一下:

linkConfig.getStringInput("dialect.identifierEnclose")
                .setValue("");

其他的数据库就没有测试了,如果有测试过的朋友可以说说哈~

参考:

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。

文章由极客之音整理,本文链接:https://www.bmabk.com/index.php/post/2032.html

(0)
小半的头像小半

相关推荐

极客之音——专业性很强的中文编程技术网站,欢迎收藏到浏览器,订阅我们!