File Manipulation

File Manipulation.

One of the last lessons we are doing in actual C usage. File manipulation is slightly strange in C in that there are some concepts to deal with first. C like to abstract you away from the mechanics of actually reading data from a CD, hard drive or floppy ( or anything else for that matter). The term abstraction means to add a layer between you and the actual mechanical bits. A good example of abstraction would be the ignition sequence on a car. You don’t care how the car starts up, you just stick your key in the ignition and turn. The car starts. Case closed. The interface is the same on every car, but underneath that, every car does it slightly differently, using different components and in some cases a complete different approach. But that is all hidden from you behind the ignition key. In other words, the start up sequence is abstracted from you. Get the idea?
Anyway, file access through C is also abstracted from you. You ask C to open a file (or a stream – we’ll get into that in a second), you bring in the data, you process it, then you close it, and that’s it. C does the rest for you. The interface to opening files is standard across C implementations, although each will probably be handling it a little different under the hood. Not that you care; that’s what abstraction is forJ.
Can you tell I’m just about to buy a new car?

fopen and fclose
Before you use any of these commands, you need the file “stdio.h” included in your project files.
To open a file, you use the fopen file command. The prototype looks like this

FILE *fopen(const char *filename, const char *mode);

So what you are getting back is a data type of FILE – which is defined inside of stdio.h by the way. So what is this? Well, it’s a data structure that the C runtime system uses to keep track of which file it has open in the operating system. You shouldn’t be messing with it at all, except to pass it through to other file functions you call. This FILE data structure is what allows you to have several files open at once, since you open one file, get one structure back, then open another one and get a different one back. When you want to do a file operation on one of the files, you would call a fwrite or fread and pass it the data structure of the file you want to manipulate. Easy enough.
If, however, you ask for a file that the system doesn’t see, or won’t let you have, it will tell you by returning a NULL in the FILE pointer the function hands back.
Now, the file name is the file you want to open. This can and should contain the path to the file we want to open. Now we need a quick discussion on pathing and current path. In a windows system, usually the system has a current path it is operating on. Usually that’s the path to the executable it’s running right then. So if you run a program in the windows directory, then the system is currently ‘looking’ at that directory. If you hand fopen a filename of “jake.bla” then it’s going to go looking for that in the directory of /windows. And it probably won’t find it. It’s usually safest to just put the file path of the file in the filename  – that way messy assumptions are avoided. So the proper filename would be “c:/data/jake.bla”. That way the system will know “Oh, I should be forget /windows and start at the top of the directory structure of the C harddrive.”. With windows 95 and up, you can use spaces and have long files names.
Now, the mode thing can get complicated. This is how you pass the file system the info on what you want to do with the file, and what kind of file it is that you want to access.
This takes the form of a string, and is in two parts – two characters that is.
For instance

FILE data_file;

data_file = fopen(“c:/data/jake.bla”, “rb”);
if (!data_file)
{
     printf(“Could not open file c:/data/jake.bla for reading”);
     return;
}

Will open the file c:/data/jake.bla for reading, with the file as a data type and put the file pointer type into the variable text_file. If the system couldn’t open the file for some reason, it will return a null into data_file, and you can see us checking to see if this has indeed occurred. If it has, we give an error message and return from the function.
So those two characters – what are the possibilities for me, and why do I care if it’s a text file or just a normal file? Ok – the two files types that C natively supports are text and binary files. If you open the file as a text file, you can effectively print text to it – with a command that looks suspiciously like the printf command we used earlier. Remember that?

If it’s a binary file, then you are just writing raw data straight out to the file. The major difference really, at the lowest level, is the way the system handles characters of zero. In a text file, a zero signals that end of a string, and file operations cease for that function. In a binary file, zero’s mean nothing – you control the access to a higher level. You’ll see what I mean when I list out some of the possible C file functions.
The different ways you can access a file are read, write or append. Reading, well, you opened the file, all you can do is read in data. Write, obviously you are creating a file if none exists, or if one does, you are zeroing out the data that’s already there. For append, you are doing the same as writing, but you aren’t zeroing what’s already there, you are just tacking your data on the end of the current file.
So, the options are:-

r – read a text file.
w – write a text file
a – append a text file
rb – read a binary file
wb – write a binary file
ab – append a binary file

Now it is possible to read and write in the same operation. C gives you the capability, but to be honest, it’s possible to really get yourself in a messy situation here. Most of the time, when you need to do this, the best way to handle it is to read the entire file into memory, copy what you still need into a new area of memory, inserting the new stuff then, and then writing the whole thing out over the top of the one that’s there. Most apps do it this way. Of course there’s always a situation where the file size you are manipulating outstrips available memory. That’s when you would be using the read-write option. My point here though is not to assume you will use it, try and use it only in special cases when you really need to. Getting it slightly wrong is one real fast way to screw up important data files.
In order to use this mode, you would use

r+, w+ or a+ to read/write or append/read/write a text file
r+b, w+b or a+b to read/write or append/read/write a binary file

fread, fscanf, fprintf, fwrite & fflush

To actually read in data from a file, in the case of a binary file, you require the use of the fread command. This reads in blocks of data in one go. You feed this function the size of the data you want to read in, a pointer to where to put it (remember to malloc space first!) and the file you want to read from. For some reason the designers of C decided to give you two options for deciding the size of the data you want to read in. Apparently the idea was for one option to be the size of the data structure you are reading, the other is the number of those structures. Quite why it was beyond them that you can actually do the multiplication yourself I don’t know. Suffice to say I always do the calculation myself, and leave the other variable to 1. If you want to read in set strings, then use the fscanf function. This works the same as described in the input section, so I won’t repeat it here, nor will I for the fprintf function. That’s the same as printf is.

Fwrite is pretty much the same as fread, you feed it the size of the data you are writing, a pointer to where that data is, and the file you want it to get written to.

With fwrite comes another command, fflush. This one flushes all data in the outgoing stream and forces the system to actually write to a file. Often, when writing data to a file, the internals of the C runtime system won’t actually write anything until sufficient data has been gathered to make a call to the operating system to actually write out the data to a file worthwhile. Usually that’s around the 32k mark – a full disk block. Anything less than that and the C system just keeps the data around internally until either the file is closed, at which time it gets written, more than 32k is stored internally, or an fflush call is made. The usefulness of this function is somewhat limited with today’s systems, but it’s still there should the occasion demand it.

fread
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);

Reads data from the given stream into the array pointed to by ptr. It reads nmemb number of elements of size size. The total number of bytes read is (size*nmemb).
On success the number of elements read is returned. On error or end-of-file the total number of elements successfully read (which may be zero) is returned.

fscanf
int fscanf(FILE *stream, const char *format, …); 

This function is the same as the scanf function described in the advanced input section.

fwrite
size_t fwrite(const void *ptr, size_t size, size_t nmemb, FILE *stream);

Writes data from the array pointed to by ptr to the given stream. It writes nmemb number of elements of size size. The total number of bytes written is (size*nmemb).
On success the number of elements writen is returned. On error the total number of elements successfully written (which may be zero) is returned.

fprintf
int fprintf(FILE *stream, const char *format, …);

This function is the same as the printf function described in the advanced input section. The only difference is that instead of it printing to the screen, it prints to a file. Very useful for creating editable files, logging files, or any files which need to be viewed using an ascii editor.

fflush
int fflush(FILE *stream);

Flushes the output buffer of a stream. If stream is a null pointer, then all output buffers are flushed.
If successful, it returns zero. On error it returns EOF.

foef, ferror, clearer, rename & remove
A few more file related functions are provided to help you out. If you are processing the file byte by byte, or string by string, you will need to be able to tell when the file is ended. Testing feof will provide you with that functionality. If there are any in-file errors encountered – reading from a corrupt file for instance – then testing ferror will give you the exact nature of the error. Almost all of the function calls will return if there is an error, but only ferror will tell you what that error is. In order to clear this error status, you must call clearer – this is automatically done when a file is opened. And then there are rename and remove. Fairly obvious what these are:).

feof
int feof(FILE *stream);

Tests the end-of-file indicator for the given stream. If the stream is at the end-of-file, then it returns a nonzero value. If it is not at the end of the file, then it returns zero.

ferror
int ferror(FILE *stream);

Tests the error indicator for the given stream. If the error indicator is set, then it returns a nonzero value. If the error indicator is not set, then it returns zero.

clearerr
void clearerr(FILE *stream);

Clears the end-of-file and error indicators for the given stream. As long as the error indicator is set, all stream operations will return an error until clearerr or rewind is called.

rename
int rename(const char *old_filename, const char *new_filename);

Causes the filename referred to by old_filename to be changed to new_filename. If the filename pointed to by new_filename exists, the result is implementation-defined.
On success zero is returned. On error a nonzero value is returned and the file is still accessible by its old filename.

remove
int remove(const char *filename);

Deletes the given filename so that it is no longer accessible (unlinks the file). If the file is currently open, then the result is implementation-defined.
On success zero is returned. On failure a nonzero value is returned.

File Pointers
The way that C reads in data internally is to use a file pointer. This is an internal pointer, and although you can get at it directly (it’s usually in the FILE data structure), you shouldn’t. stdio.h provides you with functions to affect this file pointer should you need to modify it. However, even though this is an internal pointer, you do need to be aware of it’s existence and what it might be set to. When you open a file for reading, it’s automatically set to the start of the file. When you open a file for writing, if you select the ‘w’ option, then it’s set to the start of the file on disk. If you select the append type ‘a’ then it’s set to the end of the file – this is pretty much the only major difference between append and write to be honest.
Now when you read data in, the system increments the file pointer itself, and you don’t need to worry about it. In fact, if you simply open the file, read / write data and then close it, then you never even need to worry about the file pointer at all; it’s all done internally. However, some of the more sophisticated ways of handling a file is to read it in all in one go. This requires you to know how big the file you are reading in is, and to do this you need to mess about with the pointers a bit.

fseek, fsetpos, ftell & fgetpos
These are the commands that you would use to move the file pointers around inside of the file access system. To put the file pointer at a specific place, you would either use fseek or fsetpos. If you use fsetpos, then you need to know specifically where in the file you want the pointer to go to. Using fseek means that the pointer will be set relative to a known point in the file, like either where you are now, the start of the file, or the end of the file. To get to the end of the file, you would simply fseek to the end of the file, with an offset of zero. An example of the use of fseek is given at the end of this lesson.
The differences between ftell & fgetpos and fseek and fsetpos is that both ftell and fseek work in bytes directly related to the file size. fsetpos and fgetpos work on a data structure fpos_t. You need to define this yourself, and it’s something you need to get from the file system before you tamper with it. Basically you will need to get it before you can write it.

ftell
long int ftell(FILE *stream);

Returns the current file position of the given stream. If it is a binary stream, then the value is the number of bytes from the beginning of the file. If it is a text stream, then the value is a value useable by the fseek function to return the file position to the current position.
On success the current file position is returned. On error a value of -1L is returned and errno is set.

fsetpos
int fsetpos(FILE *stream, const fpos_t *pos);

Sets the file position of the given stream to the given position. The argument pos is a position given by the function fgetpos. The end-of-file indicator is cleared.
On success zero is returned. On error a nonzero value is returned and the variable errno is set.

ftell
long int ftell(FILE *stream);

Returns the current file position of the given stream. If it is a binary stream, then the value is the number of bytes from the beginning of the file. If it is a text stream, then the value is a value useable by the fseek function to return the file position to the current position.
On success the current file position is returned. On error a value of -1L is returned and errno is set.

fgetpos
int fgetpos(FILE *stream, fpos_t *pos);

Gets the current file position of the stream and writes it to pos.
If successful, it returns zero. On error it returns a nonzero value and stores the error number in the variable errno.

Streams
So lets take a second and talk about streams. As we’ve seen, we can access files in two ways that have been demonstrated above (and there are more, but we don’t need to go into those right now, since that’s beyond the scope of this primer). Since the there are two different ways of accessing the files, we give it a verbal layer of abstraction, and call the input methods “streams”. It’s a slightly geeky thing, but the idea is to get away from thinking in terms of ‘files’ and think in terms of input streams. The idea is that you can open a stream to any input device, not just files from the hard drive (although that’s almost always what you are going to end up doing). Streams can be from the keyboard, from sound devices like midi keyboards, CD ROM’s, video or audio streams from the internet and so on. The designers of C realized that they wouldn’t be able to predict what was coming in the future, so they designed C to be able to accommodate what might come, hence the use of streams. In actual fact, the two different ways of accessing data (binary or text) are considered two different stream types, even though they are just different ways of bringing in or writing out to a data file on disk. The end file is the same, it’s just the way C is allowing you to view it. Very much like the discussion of pointers in fact.

Some stuff to remember.
Remember to check to see if your operations are successful. File writing is one of those area’s that you have the least control over, since you are dealing with a third entity, the hard drive, and you can never predict when that will get full or have problems. If you are dealing with files on a network (or just on your own machine) you may run into access denial problems. Sometimes, if the program crashes when writing to a file, and doesn’t get to shut it down correctly, then the operating system can assume the file is still in use by some one, and not allow you to either re-write to it, or delete it, until you reboot. Rebooting usually has the effect of resetting who the operating system has rights to this file – windows 2k and windows NT users take note, this can bite you in the ass.
Even if an operation like an fopen or a fwrite may come back with an error, ALWAYS CLOSE THE FILE. Some OS’s get positively nasty if you don’t do this correctly.

Something I’ve learned through my years of developing is that any time you are writing to a file, back up the original one first. Just rename it to whatever.bak, then commence manipulation on the ‘new’ file. Once you do the fclose on the new file, then go back and delete the .bak file. This means that if you hit the fclose, then nothing bad happened while you were writing to the file. This avoids system crashes while you are using a file, and getting corruption of what might be vital data.
When I load that file up for reading, I always look first to see if it’s integrity is ok – I try and load it and if the data I get is not correct for what I’m expecting, I warn the user, then go see if there is a .bak version of the file. If there is, I load that up instead. It’s just one way of protecting your data from operating system glitches or crashes.

Another word of advice. When reading in files, it is usually better to read the file into memory in one go, then close it, and then parse the memory chunk. The less time the file is actually open the better, just in case of bad things happening else where in the operating system. Below is some sample code that opens a file, sees how big it is, allocates some memory, then loads it in, and closes the file.

bool readInFile(char *filename)
{

FILE            *file_handle;
char            *file_data = NULL;
int            length;

// open the file and tell us if something went bad.

file_handle = fopen(filename, “rb”);
if (!file_handle)
{
            printf(“Could not open %s\n”, filename);
            return false;
}

// work out how long the file is, then reset the file pointer
fseek(handle, 0, SEEK_END);
length = ftell(handle);
fseek(handle, 0, SEEK_SET);

// allocate buffer space for the file, if we didn’t get it, tell us
file_data = malloc(length);

if (!file_dat)
{

            printf(“file %s too big to fit into memory\n”, filename);

            return false;
}

// read in data and then close file
fread(file_data, length, 1, file_handle);
fclose(file_handle)

// process data

// free malloced mem
free(file_data);

return true;

}

One nice thing about using this approach is that you don’t need to constantly be testing to see if you’ve run off the end of the file you are reading in. You know exactly how big the file is, and can just check that.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>