Skip to content

HDFS Writer

HDFS Writer provides the ability to write files in formats like TextFile, ORCFile, Parquet etc. to specified paths in HDFS file system. File content can be associated with tables in Hive.

Configuration Example

json
{
  "job": {
    "setting": {
      "speed": {
        "channel": 2,
        "bytes": -1
      }
    },
    "content": {
      "reader": {
        "name": "streamreader",
        "parameter": {
          "column": [
            {
              "value": "Addax",
              "type": "string"
            },
            {
              "value": 19890604,
              "type": "long"
            },
            {
              "value": "1989-06-04 00:00:00",
              "type": "date"
            },
            {
              "value": true,
              "type": "bool"
            },
            {
              "value": "test",
              "type": "bytes"
            },
            {
              "value": "['tag1', 'tag2', 'tag3']",
              "type": "string"
            },
            {
              "value": "{'loc':'HZ','num':'12'}",
              "type": "string"
            }
          ],
          "sliceRecordCount": 1000
        },
        "writer": {
          "name": "hdfswriter",
          "parameter": {
            "defaultFS": "hdfs://xxx:port",
            "fileType": "orc",
            "path": "/user/hive/warehouse/writerorc.db/orcfull",
            "fileName": "xxxx",
            "column": [
              {
                "name": "col1",
                "type": "string"
              },
              {
                "name": "col2",
                "type": "int"
              },
              {
                "name": "col3",
                "type": "string"
              },
              {
                "name": "col4",
                "type": "boolean"
              },
              {
                "name": "col5",
                "type": "string"
              },
              {
                "name": "col6",
                "type": "array<string>"
              },
              {
                "name": "col7",
                "type": "map<string,string>"
              }
            ],
            "writeMode": "overwrite",
            "fieldDelimiter": "\u0001",
            "compress": "SNAPPY"
          }
        }
      }
    }
  }
}

Parameters

ConfigurationRequiredData TypeDefault ValueDescription
pathYesstringNoneFile path to read
defaultFSYesstringNoneDetailed description below
fileTypeYesstringNoneFile type, detailed description below
fileNameYesstringNoneFilename to write, used as prefix
columnYeslist<map>NoneList of fields to write
writeModeYesstringNoneWrite mode, detailed description below
skipTrashNobooleanfalseWhether to skip trash, related to writeMode configuration
fieldDelimiterNostring,Field delimiter for text files, not needed for binary files
encodingNostringutf-8File encoding configuration, currently only supports utf-8
nullFormatNostringNoneDefine characters representing null, e.g. if user configures "\\N", then if source data is "\N", treat as null field
haveKerberosNobooleanfalseWhether to enable Kerberos authentication, if enabled, need to configure the following two items
kerberosKeytabFilePathNostringNoneCredential file path for Kerberos authentication, e.g. /your/path/addax.service.keytab
kerberosPrincipalNostringNoneCredential principal for Kerberos authentication, e.g. addax/[email protected]
compressNostringNoneFile compression format, see below
hadoopConfigNomapNoneCan configure some advanced parameters related to Hadoop, such as HA configuration
preShellNolistNoneShell commands to execute before writing data, e.g. hive -e "truncate table test.hello"
postShellNolistNoneShell commands to execute after writing data, e.g. hive -e "select count(1) from test.hello"
ignoreErrorNobooleanfalseWhether to ignore errors from preShell and postShell commands
hdfsSitePathNostringNonePath to hdfs-site.xml, see explanation below
createPathNobooleanfalseWhether to create the target path if it does not exist (default: no; the task will fail by default)
bloomColumnsNolistNoneList of column names for which to create a bloom filter (effective only when fileType is orc)
bloomFppNodouble0.05False positive probability for the bloom filter (effective only when fileType is orc)

Path

The HDFS path where data will be stored. HdfsWriter will write multiple files under the configured path directory according to the concurrency settings. To associate the output with a Hive table, specify the Hive table's storage path on HDFS. For example, if the Hive warehouse path is /user/hive/warehouse/, database test and table hello, the storage path would be /user/hive/warehouse/test.db/hello. (If the table was created with a location property, use that location instead.)

defaultFS

The NameNode address for the Hadoop HDFS filesystem. Format: hdfs://ip:port, for example hdfs://127.0.0.1:9000. If HA is enabled, use the service name (e.g. hdfs://sandbox).

fileType

File type. Supported values:

  • text — Text file format
  • orc — ORC file format
  • parquet — Parquet file format
  • rc — RCFile format
  • seq — SequenceFile format
  • csv — Plain HDFS CSV-like (logical 2D table)

column

The list of fields to write. Partial column writes are not supported — to associate with a Hive table you must specify all field names and types. Each column entry uses name for the column name and type for the Hive type.

Example:

json
{
 "column": [
  {
   "name": "userName",
   "type": "string"
  },
  {
   "name": "age",
   "type": "long"
  },
  {
   "name": "salary",
   "type": "decimal(8,2)"
  }
 ]
}

Notes for decimal type:

  1. If precision and scale are not specified, the default decimal(38,10) is used.
  2. If only precision is specified but not scale, scale defaults to 0 (i.e. decimal(p,0)).
  3. If both precision and scale are specified, use decimal(p,s) as provided.

Since 5.0.1, compound types array and map are supported.

writeMode

Controls how existing data under the target path is handled before writing:

  • append: Do not remove existing data; write using the configured fileName and ensure no filename conflicts.
  • overwrite: If the target directory contains data, delete it first, then write.
  • nonConflict: If files with the fileName prefix already exist in the directory, fail with an error.

skipTrash

When writeMode is overwrite, this option controls whether deleted files/folders are moved to the HDFS trash. By default deletions go to the trash; set skipTrash to true to delete directly.

Implementation detail: Addax checks the Hadoop fs.trash.interval parameter. If it is unset or set to 0, Addax will set it to 10080 (7 days) so that trashed files are retained for seven days. This behavior gives a recovery window for data mistakenly deleted during jobs.

compress

When fileType is csv, supported compression formats are: gzip, bz2, zip, lzo, lzo_deflate, hadoop-snappy, framing-snappy.

Note: LZO has two stream formats: lzo and lzo_deflate; be careful to select the correct one. Also, Snappy currently has multiple stream formats — Addax supports the two most common formats: hadoop-snappy (Hadoop's snappy stream format) and framing-snappy (the Google-recommended framing format).

hadoopConfig

hadoopConfig can be used to provide advanced Hadoop configuration, for example HA settings:

json
{
 "hadoopConfig": {
  "dfs.nameservices": "cluster",
  "dfs.ha.namenodes.cluster": "nn1,nn2",
  "dfs.namenode.rpc-address.cluster.nn1": "node1.example.com:8020",
  "dfs.namenode.rpc-address.cluster.nn2": "node2.example.com:8020",
  "dfs.client.failover.proxy.provider.cluster": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
 }
}

Here cluster is the HA nameservice name and should match the value used in defaultFS. If your environment uses a different nameservice identifier, replace cluster accordingly in the configuration keys above.

preShell and postShell

preShell and postShell allow you to run shell commands before and after data writing. Typical uses include truncating a table before loading or checking row counts after loading. For example, to add a partition before writing you might use:

hive -e "alter table test.hello add partition(dt='${logdate}')"

ignoreError

Controls whether errors from preShell and postShell commands should be ignored. If true, failures in these commands do not cause the job to fail — errors are logged and execution continues. If false, failures of preShell/postShell will cause the job to fail.

hdfsSitePath

Introduced in 4.2.4, hdfsSitePath specifies the path to an hdfs-site.xml file, for example:

json
{
 "hdfsSitePath": "/etc/hadoop/conf/hdfs-site.xml"
}

If hdfsSitePath is set, the plugin will read the necessary HDFS configuration from that file, often removing the need to populate hadoopConfig manually. This approach is recommended when Addax is deployed within a Hadoop cluster.

Bloom filter

The bloomColumns and bloomFpp parameters were added in 6.0.10 and apply only when fileType is orc. They control creating bloom filters in ORC files to improve query performance. bloomColumns lists column names to create bloom filters for; bloomFpp sets the false-positive probability (default 0.05).

Example:

json
{
 "fileType": "orc",
 "bloomColumns": ["userName", "age"],
 "bloomFpp": 0.01
}

Short-circuit issue

If you encounter a warning like the following when running a job:

shell
WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.

This indicates that the Hadoop native libraries are missing, which prevents the short-circuit local reads feature from being used. This feature can improve local read performance for HDFS files but is not strictly required.

You can resolve this issue in one of two ways:

  1. Add the Hadoop native library path to the LD_LIBRARY_PATH environment variable, for example:
shell
## for HDP distribution
export LD_LIBRARY_PATH=/usr/hdp/current/hadoop-client/lib/native:$LD_LIBRARY_PATH
## for CDH distribution
export LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native:$LD_LIBRARY_PATH
  1. Or pass the -Djava.library.path parameter to the addax.sh script, for example:
shell
bin/addax.sh -j  "-Djava.library.path=/usr/hdp/current/hadoop-client/lib/native" <your_job.json>

Type mapping

Addax internal typeHIVE data type
LongTINYINT, SMALLINT, INT, INTEGER, BIGINT
DoubleFLOAT, DOUBLE, DECIMAL
StringSTRING, VARCHAR, CHAR
BooleanBOOLEAN
DateDATE, TIMESTAMP
BytesBINARY
StringARRAY, MAP

Features and limitations

  1. Currently not supported: structs, union types.