Skip to content

HDFS Reader

HDFS Reader provides the ability to read data storage from distributed file system Hadoop HDFS.

Currently HdfsReader supports the following file formats:

  • textfile(text)
  • orcfile(orc)
  • rcfile(rc)
  • sequence file(seq)
  • Csv(csv)
  • parquet

Features and Limitations

  1. Supports textfile, orcfile, parquet, rcfile, sequence file and csv format files, and requires that the file content stores a logically two-dimensional table.
  2. Supports reading multiple types of data (represented using String), supports column pruning, supports column constants
  3. Supports recursive reading, supports regular expressions (* and ?).
  4. Supports common compression algorithms, including GZIP, SNAPPY, ZLIB, etc.
  5. Multiple Files can support concurrent reading.
  6. Supports sequence file data compression, currently supports lzo compression method.
  7. csv type supports compression formats: gzip, bz2, zip, lzo, lzo_deflate, snappy.
  8. Currently the Hive version in the plugin is 3.1.1, Hadoop version is 3.1.1, writes normally in Hadoop 2.7.x, Hadoop 3.1.x and Hive 2.x, hive 3.1.x test environments; other versions are theoretically supported, but please test further before using in production environments;
  9. Supports kerberos authentication

Configuration Example

json
{
  "job": {
    "setting": {
      "speed": {
        "channel": 3,
        "bytes": -1
      }
    },
    "content": {
      "reader": {
        "name": "hdfsreader",
        "parameter": {
          "path": "/user/hive/warehouse/mytable01/*",
          "defaultFS": "hdfs://xxx:port",
          "column": [
            {
              "index": 0,
              "type": "long"
            },
            {
              "index": 1,
              "type": "boolean"
            },
            {
              "type": "string",
              "value": "hello"
            },
            {
              "index": 2,
              "type": "double"
            }
          ],
          "fileType": "orc",
          "encoding": "UTF-8",
          "fieldDelimiter": ","
        }
      },
      "writer": {
        "name": "streamwriter",
        "parameter": {
          "print": true
        }
      }
    }
  }
}

Configuration Parameters

ConfigurationRequiredData TypeDefault ValueDescription
pathYesstringNoneFile path to read
defaultFSYesstringNoneHDFS NAMENODE node address, if HA mode is configured, it is the value of defaultFS
fileTypeYesstringNoneFile type
columnYeslist<map>NoneList of fields to read
fieldDelimiterNochar,Specify text file field delimiter, binary files do not need to specify this
encodingNostringutf-8File encoding configuration, currently only supports utf-8
nullFormatNostringNoneCharacters that can represent null, if user configures: "\\N", then if source data is "\N", it's treated as null field
haveKerberosNobooleanNoneWhether to enable Kerberos authentication, if enabled, need to configure the following two items
kerberosKeytabFilePathNostringNoneKerberos authentication credential file path, e.g. /your/path/addax.service.keytab
kerberosPrincipalNostringNoneKerberos authentication credential principal, e.g. addax/[email protected]
compressNostringNoneSpecify compression format of files to read
hadoopConfigNomapNoneCan configure some advanced parameters related to Hadoop, such as HA configuration
hdfsSitePathNostringNonePath to hdfs-site.xml, detailed explanation below