MACSio  0.9
Multi-purpose, Application-Centric, Scalable I/O Proxy App
 All Data Structures Files Functions Variables Typedefs Enumerations Enumerator Macros Modules Pages
MACSIO_MIF

Utilities supporting Multiple Indpendent File (MIF) Parallel I/O. More...

Data Structures

struct  _MACSIO_MIF_baton_t
 
struct  _MACSIO_MIF_ioFlags_t
 

Macros

#define MACSIO_MIF_READ   0
 
#define MACSIO_MIF_WRITE   1
 
#define MACSIO_MIF_SCR_OFF   0
 
#define MACSIO_MIF_SCR_ON   1
 

Typedefs

typedef struct _MACSIO_MIF_baton_t MACSIO_MIF_baton_t
 
typedef struct
_MACSIO_MIF_ioFlags_t 
MACSIO_MIF_ioFlags_t
 
typedef struct _MACSIO_MIF_baton_t MACSIO_MIF_baton_t
 
typedef void *(* MACSIO_MIF_CreateCB )(const char *fname, const char *nsname, void *udata)
 
typedef void *(* MACSIO_MIF_OpenCB )(const char *fname, const char *nsname, MACSIO_MIF_ioFlags_t ioFlags, void *udata)
 
typedef void(* MACSIO_MIF_CloseCB )(void *file, void *udata)
 

Functions

MACSIO_MIF_baton_tMACSIO_MIF_Init (int numFiles, MACSIO_MIF_ioFlags_t ioFlags, MPI_Comm mpiComm, int mpiTag, MACSIO_MIF_CreateCB createCb, MACSIO_MIF_OpenCB openCb, MACSIO_MIF_CloseCB closeCb, void *clientData)
 Initialize MACSIO_MIF for a MIF I/O operation. More...
 
void MACSIO_MIF_Finish (MACSIO_MIF_baton_t *bat)
 End a MACSIO_MIF I/O operation and free resources. More...
 
void * MACSIO_MIF_WaitForBaton (MACSIO_MIF_baton_t *Bat, char const *fname, char const *nsname)
 Wait for exclusive access to the group's file. More...
 
void MACSIO_MIF_HandOffBaton (MACSIO_MIF_baton_t const *Bat, void *file)
 Release exclusive access to the group's file. More...
 
int MACSIO_MIF_RankOfGroup (MACSIO_MIF_baton_t const *Bat, int rankInComm)
 Rank of the group in which a given (global) rank exists. More...
 
int MACSIO_MIF_RankInGroup (MACSIO_MIF_baton_t const *Bat, int rankInComm)
 Rank within a group of a given (global) rank. More...
 

Detailed Description

Utilities supporting Multiple Indpendent File (MIF) Parallel I/O.

In the multiple, independent file (MIF) paradigm, parallelism is achieved through simultaneous access to multiple files. The application divides itself into file groups. For each file group, the application manages exclusive access among all the tasks of the group. I/O is serial within groups but parallel across groups. The number of files (groups) is wholly independent from the number of processors or mesh-parts. It is often chosen to match the number of independent I/O pathways available in the hardware between the compute nodes and the filesystem. The MIF paradigm is sometimes also called N->M because it is N tasks writing to M files (M<N).

In this paradigm I/O requests are almost exclusively independent. However, there are scenarios where collective I/O requests can be made to work and might even make sense in the MIF paradigm. MIF is often referred to as Poor Man’s parallel I/O because the onus is on the application to manage the distribution of data across potentially many files. In truth, this illuminates the only salient distinction between Single Shared File (SSF) and MIF. In either paradigm, if you dig deep enough into the I/O stack, you soon discover that data is always being distributed across multiple files. The only difference is where in the stack the distribution into files is handled. In the MIF paradigm, it is handled explicitly by the application itself. In a typical SSF paradigm, it is handled by the underling filesystem.

In MIF, MPI tasks are divided into N groups and each group is responsible for creating one of the N files. At any one moment, only one MPI task from each group has exclusive access to the file. Hence, I/O is serial within a group. However, because one task in each group is writing to its group's own file, simultaneously, I/O is parallel across groups.

pmpio_diagram.png

A call to MACSIO_MIF_Init() establishes this mapping of MPI tasks to file groups.

Within a group, access to the group's file is handled in a round-robin fashion. The first MPI task in the group creates the file and then iterates over all mesh pieces it has. For each mesh piece, it creates a sub-directory within the file (e.g., a separate namespace for the objects). It repeats this process for each mesh piece it has. Then, the first MPI task closes the file and hands off exclusive access to the next task in the group. That MPI task opens the file and iterates over all domains in the same way. Exclusive access to the file is then handed off to the next task. This process continues until all processors in the group have written their domains to unique sub-directories in the file.

Calls to MACSIO_MIF_WaitForBaton() and MACSIO_MIF_HandOffBaton() handle this handshaking of file creations and/or opens and bracket blocks of code that are performing MIF I/O.

After all groups have finished with their files, an optional final step may involve creating a master file which contains special metadata objects that point at all the pieces of mesh (domains) scattered about in the N files.

A call to MACSIO_MIF_Finalize() frees up resources associated with handling MIF mappings.

The basic coding structure for a MIF I/O operation is as follows. . .

SomeFileType *fhndl = (SomeFileType *) MACSIO_MIF_WaitForBaton(bat, "GroupName", "ProcName");
.
.
.
< processor's work on the file >
.
.
.
MACSIO_MIF_HandOffBaton(bat, fhndl);
MACSIO_MIF_Finalize(bat);

Setting N to be equal to the number of MPI tasks, results in a file-per-process configuration, which is typically not recommended. However, some applications do indeed choose to run this way with good results. Alternatively, setting N equal to 1 results in effectively serializing the I/O and is also certainly not recommended. For large, parallel runs, there is typicall a sweet spot in the selection of N which results in peak I/O performance rates. If N is too large, the I/O subsystem will likely be overwhelmed; setting it too small will likely underutilize the system resources. This is illustrated of files and MPI task counts.

Ale3d_io_perf2.png

This approach to scalable, parallel I/O was originally developed in the late 1990s by Rob Neely, a lead software architect on ALE3D at the time. It and variations thereof have since been adopted by many codes and used productively through several transitions in orders of magnitude of MPI task counts from hundreds then to hundreds of thousands today.

There are a large number of advantages to MIF-IO over SSF-IO.


Data Structure Documentation

struct _MACSIO_MIF_baton_t

Definition at line 51 of file macsio_mif.c.

Data Fields
MACSIO_MIF_ioFlags_t ioFlags

Various flags controlling behavior.

MPI_Comm mpiComm

The MPI communicator being used

int commSize

The size of the MPI comm

int rankInComm

Rank of this processor in the MPI comm

int numGroups

Number of groups the MPI comm is divided into

int numGroupsWithExtraProc

Number of groups that contain one extra proc/rank

int groupSize

Nominal size of each group (some groups have one extra)

int groupRank

Rank of this processor's group

int commSplit

Rank of the last MPI task assigned to +1 groups

int rankInGroup

Rank of this processor within its group

int procBeforeMe

Rank of processor before this processor in the group

int procAfterMe

Rank of processor after this processor in the group

int mifErr

MIF error value

int mpiErr

MPI error value

int mpiTag

MPI message tag used for all messages here

MACSIO_MIF_CreateCB createCb

Create file callback

MACSIO_MIF_OpenCB openCb

Open file callback

MACSIO_MIF_CloseCB closeCb

Close file callback

void * clientData

Client data to be passed around in calls

struct _MACSIO_MIF_ioFlags_t

Definition at line 153 of file macsio_mif.h.

Data Fields
unsigned int do_wr: 1
unsigned int use_scr: 1

Macro Definition Documentation

#define MACSIO_MIF_READ   0

Definition at line 148 of file macsio_mif.h.

#define MACSIO_MIF_WRITE   1

Definition at line 149 of file macsio_mif.h.

#define MACSIO_MIF_SCR_OFF   0

Definition at line 150 of file macsio_mif.h.

#define MACSIO_MIF_SCR_ON   1

Definition at line 151 of file macsio_mif.h.

Typedef Documentation

Definition at line 159 of file macsio_mif.h.

typedef void*(* MACSIO_MIF_CreateCB)(const char *fname, const char *nsname, void *udata)

Definition at line 160 of file macsio_mif.h.

typedef void*(* MACSIO_MIF_OpenCB)(const char *fname, const char *nsname, MACSIO_MIF_ioFlags_t ioFlags, void *udata)

Definition at line 161 of file macsio_mif.h.

typedef void(* MACSIO_MIF_CloseCB)(void *file, void *udata)

Definition at line 164 of file macsio_mif.h.

Function Documentation

MACSIO_MIF_baton_t * MACSIO_MIF_Init ( int  numFiles,
MACSIO_MIF_ioFlags_t  ioFlags,
MPI_Comm  mpiComm,
int  mpiTag,
MACSIO_MIF_CreateCB  createCb,
MACSIO_MIF_OpenCB  openCb,
MACSIO_MIF_CloseCB  closeCb,
void *  clientData 
)

Initialize MACSIO_MIF for a MIF I/O operation.

Creates and returns a MACSIO_MIF baton object establishing the mapping between MPI ranks and file groups for a MIF I/O operation.

All processors in the mpiComm communicator must call this function collectively with identical values for numFiles, ioFlags, and mpiTag.

The resultant baton object is used in subsequent calls to WaitFor and HandOff the baton to the next processor in each group.

The createCb, openCb, closeCb callback functions are used by MACSIO_MIF to execute baton waits and handoffs during which time a group's file will be closed by the HandOff function and opened by the WaitFor method except for the first processor in each group which will create the file.

Processors in the mpiComm communicator are broken into numFiles groups. If there is a remainder, R, after dividing the communicator size into numFiles groups, then the first R groups will have one additional processor.

Returns
The MACSIO_MIF baton object
Parameters
[in]numFilesNumber of resultant files. Note: this is entirely independent of number of processors. Typically, this number is chosen to match the number of independent I/O pathways between the nodes the application is executing on and the filesystem. Pass MACSIO_MIF_MAX for file-per-processor. Pass MACSIO_MIF_AUTO (currently not supported) to request that MACSIO_MIF determine and use an optimum file count.
[in]ioFlagsSee MACSIO_MIF_ioFlags_t for meaning of flags.
[in]mpiCommThe MPI communicator containing all the MPI ranks that will marshall data in the MIF I/O operation.
[in]mpiTagMPI message tag MACSIO_MIF will use in all MPI messages for this MIF I/O operation.
[in]createCbCallback MACSIO_MIF should use to create a group's file
[in]openCbCallback MACSIO_MIF should use to open a group's file
[in]closeCbCallback MACSIO_MIF should use to close a group's file
[in]clientDataOptional, client specific data MACSIO_MIF will pass to callbacks

Definition at line 102 of file macsio_mif.c.

void MACSIO_MIF_Finish ( MACSIO_MIF_baton_t bat)

End a MACSIO_MIF I/O operation and free resources.

Parameters
[in]batThe MACSIO_MIF baton handle

Definition at line 191 of file macsio_mif.c.

void * MACSIO_MIF_WaitForBaton ( MACSIO_MIF_baton_t Bat,
char const *  fname,
char const *  nsname 
)

Wait for exclusive access to the group's file.

All processors call this function collectively. For the first processor in each group, this call returns immediately. For all others in the group, it blocks, waiting for the processor before it to finish its work on the group's file and call the HandOff function.

Returns
A void pointer to whatever data instance the createCb or openCb methods return. The caller must cast this returned pointer to the correct type.
Parameters
[in]BatThe MACSIO_MIF baton handle
[in]fnameThe filename
[in]nsnameThe namespace within the file to be used for objects in this code block.

Definition at line 210 of file macsio_mif.c.

void MACSIO_MIF_HandOffBaton ( MACSIO_MIF_baton_t const *  Bat,
void *  file 
)

Release exclusive access to the group's file.

This function closes the group's file for this processor and hands off control to the next processor in the group.

Parameters
[in]BatThe MACSIO_MIF baton handle
[in]fileA void pointer to the group's file handle

Definition at line 284 of file macsio_mif.c.

int MACSIO_MIF_RankOfGroup ( MACSIO_MIF_baton_t const *  Bat,
int  rankInComm 
)

Rank of the group in which a given (global) rank exists.

Given the rank of a processor in mpiComm used in the MACSIO_MIF_Init() call, this function returns the rank of the group in which the given processor exists. This function can be called from any rank and will return correct values for any rank it is passed.

Parameters
[in]BatThe MACSIO_MIF baton handle
[in]rankInCommThe (global) rank of a proccesor for which rank in group is desired

Definition at line 311 of file macsio_mif.c.

int MACSIO_MIF_RankInGroup ( MACSIO_MIF_baton_t const *  Bat,
int  rankInComm 
)

Rank within a group of a given (global) rank.

Given the rank of a processor in mpiComm used in the MACSIO_MIF_Init() call, this function returns its rank within its group. This function can be called from any rank and will return correct values for any rank it is passed.

Parameters
[in]BatThe MACSIO_MIF baton handle
[in]rankInCommThe (global) rank of a processor for which the rank within it's group is desired

Definition at line 339 of file macsio_mif.c.