Online traning On Datastage

I have found certainly the best videos on Datastage (Data warehousing) Really incredibly Informative

DataStage Online Training @ BigClasses.com

DataStage Online Training

Sequential files are capable of containing any kind of data. But this data might have some format issues. So after extracting records from file, we might need to validate the fields based on its data type. This can be done using IsValid function in DataStage.

- Rritu

Validating fields from file should be done to remove any invalid fields to be fed into system. We would look into validating date, numeric fields coming from source.
IsValid function requires three parameters, Below is the signature for isvalid function:
isValid('data type','Input Column','Format')

Data Type: This indicates the data type for which validation should be done.
Input Column: This maps to the incoming column which needs to be validated.
Format: This indicates the format which will be used for validation. This is an optional parameter.

We have a date field coming from source and we want to validate this fields whether this field contains correct format and a valid date.
To validate a date field with format mm/dd/yyyy, use a transformer wherein we can use the isvalid function:
isValid('date',StringToDate(trim_leading_trailing(), '%mm/%dd/%yyyy')
First parameter will tell that we are validating for a date. Second field contains the input column. Since input column is a string from source and hence we would first convert it to date by using StringToDate function. Third parameter provides the format which will indicate that this date input field contain format given.
Using this fucntion validation was done and wrong formats can be rejected by using constraint in the transformer.
However format given in above example is not intelligent enough to recognise single digit day and month fields. For this to work, use field function to add a zero before the fields for single digit day and month.
Validating integer or decimal can also be done using isvalid fucntion. Below is given example for the same:
Validating integer field:
IsValid('int32',input column)
Validating decimal field
IsValid('decimal[13,4],input column)
In exmaple given above, we are validating the input column for decimal(13,4).
This fucntion is very useful to validate the records coming from source and put records into reject file which are having wrong format.

Warning: Possible truncation of input string when converting from a higher length string to lower length string in Modify.
Resolution: In the Modify stage explicitly specify the length of output column.
Ex: ACCOUNT_NUM:string[max=16] = string_trim[" ", end, begin](ACCOUNT_NUM)
Warning: A Null result can come up in Non Nullable field.
Resolution: Use a Modify stage in between lookup and Database stage. When using a modify stage, use the handle_null clause.
ACCOUNT_NUM:string[max=21] = handle_null(ACCOUNT_NUM,0)
Zero will replace NULLs coming in ACCOUNT_NUM field.
Warning: A user defined sort operator does not satisfy the requirements.
Resolution: In the job flow just notice the columns on which sort is happening . The order of the columns also must be the same. i.e if you specify sort on columns in the order X,Y in sort stage and specify join on the columns Y,X in order then join stage will throw the warning, since it cannot recognise the change in order.

IBM Information Server 8.X (DataStage): Parallel Processing Methods

A knowledge warehouse is naturally a central integrated database containing data from heterogeneous source systems inside an organization. The results is transformed to lose inconsistencies, aggregated in summary data, and loaded into the data warehouse.

This database can be accessed by multiple users, ensuring that each group inside an organization is accessing valuable, stable data.For processing the vast volumes of data from heterogeneous source systems effectively, the ETL (Extraction, Transformation and fill) software's implemented the parallel processing.

Parallel processing divided into pipeline parallelism and partition parallelism.

IBM Information Server or DataStage allows us to utilise both parallel processing methods.

Pipeline Parallelism:

DataStage pipelines data (when possible) to the next stage into the next and nothing you have to do due to this if it affects. ETL (Extraction, Transformation and Load) Processes the results simultaneously in all the stages involved are operating simultaneously. Downstream process start as the data is supplied in the upstream. Pipeline parallelism eliminates the necessity of intermediate storing into a disk.

Partition Parallelism:

The aim of most partitioning operations would be to purchase documented partitions that are as near equal size as is possible, ensuring an even load across processors. This partition is fantastic for handling significantly large quantities of data by breaking the comprehensive data into partitions. Each partition is currently being handled by a different instance of all the job stages.

Combining pipeline and partition parallelism:Greater performance gain may be achieved by utilizing the pipeline and partition parallelism. The data is partitioned and partitioned data complete the pipeline to ensure that the downstream stage processes the partitioned data as the upstream continues to be running. DataStage allows us to make use of these parallel processing methods in the parallel jobs.

Repartition the partitioned data dictated by business requirements might be done in DataStage and repartition data do not load onto the disk.

Parallel processing environments:

The environment in which you run your DataStage jobs is defined within your system's architecture and hardware resources.

All parallel-processing environments can be categorized as

SMP (Symmetrical Multi Processing)

Clusters or MPP (Massive Parallel Processing)

SMP (symmetric multiprocessing), shared memory:

Some hardware resources can be shared among processors.

Processors communicate via shared memory and to have single operating systems.

All CPU's share system resources

MPP (massively parallel processing), shared-nothing:

An MPP being a some connected SMP's.

Each processor has exclusive use of hardware resources.

MPP systems are physically housed in exactly the same box.

Cluster Systems:

UNIX systems connected via networks

Cluster systems often is physically dispersed.

By understanding these concepts on various processing methods and environments enabled me to understand the complete parallel jobs architecture in DataStage.

Datastage Course Content

Duration: 40 Hr

DATAWAREHOUSING CONCEPTS

Introduction to Data Warehousing

What is Data Warehousing?

Who needs Data Warehousing?

Why Data Warehouse is required?

Types of Systems

OLTP

OLAP

Maintenance of Data Warehouse

Data Warehousing Life Cycle

Data Warehousing Architecture

Source

Integration Layer

Staging Area

Target

Analysis & Reporting

ODS

Multi-Dimensional Modeling

What is dimension modeling?

Difference between ER modeling and dimension modeling

What is a Dimension?

What is a Fact?

Start Schema

Snow Flake Schema

Difference between Star and snow flake schema

Fact Table

Different types of facts

Dimensional Tables

Fact less Fact Table

Confirmed Dimensions

Unconfirmed Dimensions

Junk Dimensions

Monster Dimensions

Degenerative Dimensions

What are slowly changing Dimensions?

Different types of SCD’s

IBM WEBSPHERE DATA STAGE AND QUALITY STAGE VERSION 8.0.1

Contents

Introduction about Data Stage

Difference between Data Stage 7.5.2 and 8.0.1

What’s new in Data Stage 8.0.1?

What is way ahead in Data Stage?

IBM Information Sever architecture

Datastage within the IBM Information Server architecture

Difference between Server Jobs and Parallel Jobs

Difference between Pipeline Parallelism and Partition Parallelism

Partition techniques (Round Robin, Random,

Hash, Entire, Same, Modules, Range, DB2, Auto

Configuration File

Difference between SMP/PMP(Cluster) Architecture

Data stage components (Server components /Client components)

Designer

Introduction about Designer

Repository

Palette

Type of Links

File Stages

Sequential file

Dataset file

File set

Lookup file set

Difference between Sequential file/Dataset/File set

Overview of iWay, Classic federation and netezza

Database Stages

Dynamic RDBMS

Oracle Enterprise

ODBC Enterprise

Stored Procedure

Processing Stages

Change Capture

Compare Stage

Difference Stage

Aggregate Stage

Transformer Stage

Difference between basic transformer and transformer

Surrogate Generator Stage

Join Stage

Merge Stage

Lookup Stage

Difference between Join/Lookup/Merge

Difference between Join/Lookup

Remove Duplicates

Switch

Pivot

Modify

Funnel

Generic stage

Different types of sorting and sort stage.

Different types of combining and collecting techniques.

Filter

External filter

Difference between filter, External filter and switch stages.

SCD stage

Encode and decode stages

FTP stage

Adding job parameters to a job

Parameter set

Difference between partitioning and re partitioning

Run time column propagation

Schema files

Debugging Stage

Head

Tail

Pea

Row Generator

Column Generator

Sample

Containers

Difference between Local Container and Shared Container

Local Container

Shared Container

Job Sequencers

Arrange job activities in sequencer

Triggers in Sequencer

Notification activity

Terminator Activity

Wait for file activity

Start loop activity

Execute command activity

Nested Condition activity

Routine activity

Exception handing activity

User variable activity

End loop activity

Adding Checkpoints

Data stage Director

Introduction to Data stage Director

Job Status View

View logs

Scheduling

Batches Creation

Cleaning resources using Administrator

Web sphere Manager in Designer

Introduction about Data stage Manager

Importing the Job

Exporting the Job

Importing Table Definition

Different types of table definitions and their differences.

Importing Flat File Definition

Routines

Dataset management and ORCHADMIN

Quick search and Advanced search

Data stage Administrator

Creating project, Editing project and Deleting project

Permissions to user

Different kinds of variables in Data Stage

Cleaning resources using Administrator

Web sphere Quality Stage

What is Date Quality and why do we for data quality?

Integration of Data Quality in Data Stage?

Data stage Quality stages

Investigate stage

Standardize stage Match Frequency stage

Unduplicate Match stage

Reference Match stage

Survive stage

Standardized rule sets.

Components of Standardized rule sets.

Match designer

WAVES

DataStage Online Training

Pages

Online traning On Datastage