A knowledge warehouse is naturally a central integrated database containing data from heterogeneous source systems inside an organization. The results is transformed to lose inconsistencies, aggregated in summary data, and loaded into the data warehouse.
This database can be accessed by multiple users, ensuring that each group inside an organization is accessing valuable, stable data.For processing the vast volumes of data from heterogeneous source systems effectively, the ETL (Extraction, Transformation and fill) software's implemented the parallel processing.
Parallel processing divided into pipeline parallelism and partition parallelism.
IBM Information Server or DataStage allows us to utilise both parallel processing methods.
Pipeline Parallelism:
DataStage pipelines data (when possible) to the next stage into the next and nothing you have to do due to this if it affects. ETL (Extraction, Transformation and Load) Processes the results simultaneously in all the stages involved are operating simultaneously. Downstream process start as the data is supplied in the upstream. Pipeline parallelism eliminates the necessity of intermediate storing into a disk.
Partition Parallelism:
The aim of most partitioning operations would be to purchase documented partitions that are as near equal size as is possible, ensuring an even load across processors. This partition is fantastic for handling significantly large quantities of data by breaking the comprehensive data into partitions. Each partition is currently being handled by a different instance of all the job stages.
Combining pipeline and partition parallelism:Greater performance gain may be achieved by utilizing the pipeline and partition parallelism. The data is partitioned and partitioned data complete the pipeline to ensure that the downstream stage processes the partitioned data as the upstream continues to be running. DataStage allows us to make use of these parallel processing methods in the parallel jobs.
Repartition the partitioned data dictated by business requirements might be done in DataStage and repartition data do not load onto the disk.
Parallel processing environments:
The environment in which you run your DataStage jobs is defined within your system's architecture and hardware resources.
All parallel-processing environments can be categorized as
SMP (Symmetrical Multi Processing)
Clusters or MPP (Massive Parallel Processing)
SMP (symmetric multiprocessing), shared memory:
Some hardware resources can be shared among processors.
Processors communicate via shared memory and to have single operating systems.
All CPU's share system resources
MPP (massively parallel processing), shared-nothing:
An MPP being a some connected SMP's.
Each processor has exclusive use of hardware resources.
MPP systems are physically housed in exactly the same box.
Cluster Systems:
UNIX systems connected via networks
Cluster systems often is physically dispersed.
By understanding these concepts on various processing methods and environments enabled me to understand the complete parallel jobs architecture in DataStage.