InfoSphere Datastage – How to Improve Sequential File Performance Using Parallel Environment Variables

While extensive use of sequential files is not best practice, sometimes there is no way around it, due to legacy systems and/or existing processes. However, recently, I have encountered a number of customers who are seeing significant performance issues with sequential file intensive processes. Sometimes it’s the job design, but often when you look at the project configuration they still have the default values. This is a quick and easy thing to check and to adjust to get a quick performance win if they’ve not already been adjusted. These are delivered variables, but should seriously be considered for adjustment in nearly all data stage ETL projects. The adjustment must be based on the amount of available memory, the volume of workload that is sequential file intensive, and the environment you’re working in. Some experiential adjustment may be required, but I have provided a few recommendations below.

Environment Variable Properties

Category NameTypeParameter NamePromptSizeDefault Value
Parallel > Operator SpecificStringAPT_FILE_EXPORT_BUFFER_SIZESequential write buffer sizeAdjustable in 8 KB units.  Recommended values for Dev: 2048; Test & Prod: 4096.128
Parallel > Operator SpecificStringAPT_FILE_IMPORT_BUFFER_SIZESequential read buffer sizeAdjustable in 8 KB units.  Recommended values for Dev: 2048; Test & Prod: 4096.128

Related References

InfoSphere DataStage Performance Factors

It seems that the factors affecting Infosphere Datastage keep coming up, so, I thought it might be good to have an organized representation of what some of the major performance factors are and where they fit in the mix.

Load Strategy Zones

Source System

  • RDBMS
  • Indexes
  • Partitioning
  • Striping
  • Parallelism
  • Timestamps
  • Business rules
  • SQL Quality

Architecture

  • CPU’s
  • Memory
  • Drive Mappings & RAMDISK Use
  • Logical Nodes
  • Storage & Maintenance
  • Network I/O Latency

Infosphere and DataStage

  • Project Configuration
  • Nodes
  • Variable Configuration
  • Load Strategy
  • Job Design & Complexity
  • RDBMS Connection Type
  • SQL Quality
  • Licensing

Data Warehouse

  • RDBMS
  • Indexes
  • Partitioning
  • Striping
  • Parallelism
  • Timestamps
  • SQL Quality
  • Warehouse / Data Mart Pattern
  • Schema Structure

Related References

IBM Infosphere Datastage System CPU’s

When preparing to implement, size, and/or tune an IBM Infosphere Datastage system you must think holistically, not just in terms of Datastage needs.  Other system processes (e.g. backups, antivirus, Operations Console, etc.) consume resources and be taken into account.  Also, the size of your system can contribute to your performance issues and the need for tuning.

The Performance of InfoSphere DataStage is nearly exponentially to the number of CPU’s/Cores available for use for which licenses exist and for which nodes are aligned.  The allocation of CPU’s/Cores is cumulative to create the total number of CPU’s/Cores necessary for performance, working from the operating system foundation requirements upwards to additional performance memory.