Sanity Checking your Import/Export Data
A key issue that has bought many a process to a halt is the problem of data consistency. If you are transferring data to a system that relies on this data being syntactically correct, such as the transfer of plain text configuration files to a server system, then inconsistent data needs to be either sanitised or weeded-out before use. This is particularly the case when updating existing filesystem data.
Take the scenario of a company (Acme Servers) who provide small-scale hosting for a large number of customers, across a fleet of 50 servers. Using their customer database, they can define all the websites and domain for a certain customer, and which server these sites should reside on. Acme Servers wrote a custom report that defines each of the virtual hosts for each of the servers (like the ‘vhosts.conf’ file), and frequently transfers the correct file to each server in turn. This both means that they can rapidly update all their servers in one go, and don’t have to manage custom software on each server to “understand” a new data format, as the configuration files prepared are already in the correct format.
One day their office customer database goes down, and they can’t get to their customer data for an hour. Meanwhile, the server configuration ‘reports’ continue to be sent to each server, although they are totally blank. The net effect is that each of their 50 servers are rendered useless, due to a problem at their central office that really should not have affected their server fleet.
There are a number of ways to avoid this issue. The first is to sanitise data before it leaves; in the case of the server fleet example, the export script could ensure that no blank files are sent, by checking the file size. The transfer could be further safeguarded by using a local copy of apache to check the configuration syntax before it is used.
Another method I’ve used to avoid faulty exports is to pack your data into an archive before it leaves. This method involves storing all your data in a Zip or Tar archive stream, transferring, and unpacking it at the final destination. This can be easily accomplished on-the-fly with PHP or Perl in your export system.
When your destination picks up the archive and unzips it, if there was no useful data then no files will be exported - and so if something went wrong, there will be no ‘overwriting’ of the previous data that existed.
Locking and Uniqueness
Once your data arrives safely there are a dozen ways things can still go wrong - be it in the way the data is imported, or in the actions that occur afterwards.
If you are relying on a direct ‘file drop’ method like SCP with a periodic import mechanism, the following will be relevant factors:
- If your import process begins periodically, and there is a large batch of existing files to import, the previous import may not yet have finished. The net effect is chaos: while the first import script is importing, a second kicks in. The first script tries to delete the completed files, but they’re locked by the second. A third import script kicks in while the second is running, and so on. Implementing a mutex could help to ensure that the import script is modal.
- Other locking problems could be caused by an administrator viewing the file. General file locks can be solved on UNIX systems using the fuser tool, and can be programmatically resolved by implementing an intermediary import stage that moves the file to a temporary location and checks that the original was deleted before begininng the import cycle.
- Finally, a common problem is that files that are being delivered may not have finished transferring at the time the import script picks them up. This too can be solved by file lock checking and intermediary import stages.
Another approach to file drop systems is to remove the ‘periodic import’ method and put in place a drop hook system. This could be accomplished on UNIX based systems by tailing the system auth log and reading new filenames that have been completed into your import script.
- Technologies and Protocols
- Caveats and Sanitation
- Speed and Security