Cedar Backup Software Manual Kenneth J. Pronovici Copyright © 2005-2008 Kenneth J. Pronovici This work is free; you can redistribute it and/or modify it under the terms of the GNU General Public License (the "GPL"), Version 2, as published by the Free Software Foundation. For the purposes of the GPL, the "preferred form of modification" for this work is the original Docbook XML text files. If you choose to distribute this work in a compiled form (i.e. if you distribute HTML, PDF or Postscript documents based on the original Docbook XML text files), you must also consider image files to be "source code" if those images are required in order to construct a complete and readable compiled version of the work. This work is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Copies of the GNU General Public License are available from the Free Software Foundation website, http://www.gnu.org/. You may also write the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA. ------------------------------------------------------------------------------- Table of Contents Preface Purpose Audience Conventions Used in This Book Typographic Conventions Icons Organization of This Manual Acknowledgments 1. Introduction What is Cedar Backup? How to Get Support History 2. Basic Concepts General Architecture Data Recovery Cedar Backup Pools The Backup Process The Collect Action The Stage Action The Store Action The Purge Action The All Action The Validate Action The Initialize Action The Rebuild Action Coordination between Master and Clients Managed Backups Media and Device Types Incremental Backups Extensions 3. Installation Background Installing on a Debian System Installing from Source Installing Dependencies Installing the Source Package 4. Command Line Tools Overview The cback command Introduction Syntax Switches Actions The cback-span command Introduction Syntax Switches Using cback-span Sample run 5. Configuration Overview Configuration File Format Sample Configuration File Reference Configuration Options Configuration Peers Configuration Collect Configuration Stage Configuration Store Configuration Purge Configuration Extensions Configuration Setting up a Pool of One Step 1: Decide when you will run your backup. Step 2: Make sure email works. Step 3: Configure your writer device. Step 4: Configure your backup user. Step 5: Create your backup tree. Step 6: Create the Cedar Backup configuration file. Step 7: Validate the Cedar Backup configuration file. Step 8: Test your backup. Step 9: Modify the backup cron jobs. Setting up a Client Peer Node Step 1: Decide when you will run your backup. Step 2: Make sure email works. Step 3: Configure the master in your backup pool. Step 4: Configure your backup user. Step 5: Create your backup tree. Step 6: Create the Cedar Backup configuration file. Step 7: Validate the Cedar Backup configuration file. Step 8: Test your backup. Step 9: Modify the backup cron jobs. Setting up a Master Peer Node Step 1: Decide when you will run your backup. Step 2: Make sure email works. Step 3: Configure your writer device. Step 4: Configure your backup user. Step 5: Create your backup tree. Step 6: Create the Cedar Backup configuration file. Step 7: Validate the Cedar Backup configuration file. Step 8: Test connectivity to client machines. Step 9: Test your backup. Step 10: Modify the backup cron jobs. Configuring your Writer Device Device Types Devices identified by by device name Devices identified by SCSI id Linux Notes Finding your Linux CD Writer Mac OS X Notes Optimized Blanking Stategy 6. Official Extensions System Information Extension Subversion Extension MySQL Extension PostgreSQL Extension Mbox Extension Encrypt Extension Split Extension Capacity Extension A. Extension Architecture Interface B. Dependencies C. Data Recovery Finding your Data Recovering Filesystem Data Full Restore Partial Restore Recovering MySQL Data Recovering Subversion Data Recovering Mailbox Data Recovering Data split by the Split Extension D. Securing Password-less SSH Connections E. Copyright Preface Table of Contents Purpose Audience Conventions Used in This Book Typographic Conventions Icons Organization of This Manual Acknowledgments Purpose This software manual has been written to document the 2.0 series of Cedar Backup, originally released in early 2005. Audience This manual has been written for computer-literate administrators who need to use and configure Cedar Backup on their Linux or UNIX-like system. The examples in this manual assume the reader is relatively comfortable with UNIX and command-line interfaces. Conventions Used in This Book This section covers the various conventions used in this manual. Typographic Conventions Term Used for first use of important terms. Command Used for commands, command output, and switches Replaceable Used for replaceable items in code and text Filenames Used for file and directory names Icons Note This icon designates a note relating to the surrounding text. Tip This icon designates a helpful tip relating to the surrounding text. Warning This icon designates a warning relating to the surrounding text. Organization of This Manual Chapter 1, Introduction Provides some background about how Cedar Backup came to be, its history, some general information about what needs it is intended to meet, etc. Chapter 2, Basic Concepts Discusses the basic concepts of a Cedar Backup infrastructure, and specifies terms used throughout the rest of the manual. Chapter 3, Installation Explains how to install the Cedar Backup package either from the Python source distribution or from the Debian package. Chapter 4, Command Line Tools Discusses the various Cedar Backup command-line tools, including the primary cback command. Chapter 5, Configuration Provides detailed information about how to configure Cedar Backup. Chapter 6, Official Extensions Describes each of the officially-supported Cedar Backup extensions. Appendix A, Extension Architecture Interface Specifies the Cedar Backup extension architecture interface, through which third party developers can write extensions to Cedar Backup. Appendix B, Dependencies Provides some additional information about the packages which Cedar Backup relies on, including information about how to find documentation and packages on non-Debian systems. Appendix C, Data Recovery Cedar Backup provides no facility for restoring backups, assuming the administrator can handle this infrequent task. This appendix provides some notes for administrators to work from. Appendix D, Securing Password-less SSH Connections Password-less SSH connections are a necessary evil when remote backup processes need to execute without human interaction. This appendix describes some ways that you can reduce the risk to your backup pool should your master machine be compromised. Acknowledgments The structure of this manual and some of the basic boilerplate has been taken from the book Version Control with Subversion. Many thanks to the authors (and O'Reilly) for making this excellent reference available under a free and open license. There are not very many Cedar Backup users today, but almost all of them have contributed in some way to the documentation in this manual, either by asking questions, making suggestions or finding bugs. I'm glad to have them as users, and I hope that this new release meets their needs even better than the previous release. My wife Julie puts up with a lot. It's sometimes not easy to live with someone who hacks on open source code in his free time ? even when you're a pretty good engineer yourself, like she is. First, she managed to live with a dual-boot Debian and Windoze machine; then she managed to get used to IceWM rather than a prettier desktop; and eventually she even managed to cope with vim when she needed to. Now, even after all that, she has graciously volunteered to edit this manual. I much appreciate her skill with a red pen. Chapter 1. Introduction Table of Contents What is Cedar Backup? How to Get Support History ?Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it.?? Linus Torvalds, at the release of Linux 2.0.8 in July of 1996. What is Cedar Backup? Cedar Backup is a software package designed to manage system backups for a pool of local and remote machines. Cedar Backup understands how to back up filesystem data as well as MySQL and PostgreSQL databases and Subversion repositories. It can also be easily extended to support other kinds of data sources. Cedar Backup is focused around weekly backups to a single CD or DVD disc, with the expectation that the disc will be changed or overwritten at the beginning of each week. If your hardware is new enough (and almost all hardware is today), Cedar Backup can write multisession discs, allowing you to add incremental data to a disc on a daily basis. Besides offering command-line utilities to manage the backup process, Cedar Backup provides a well-organized library of backup-related functionality, written in the Python programming language. There are many different backup software implementations out there in the free software and open source world. Cedar Backup aims to fill a niche: it aims to be a good fit for people who need to back up a limited amount of important data to CD or DVD on a regular basis. Cedar Backup isn't for you if you want to back up your MP3 collection every night, or if you want to back up a few hundred machines. However, if you administer a small set machines and you want to run daily incremental backups for things like system configuration, current email, small web sites, a CVS or Subversion repository, or a small MySQL database, then Cedar Backup is probably worth your time. Cedar Backup has been developed on a Debian GNU/Linux system and is primarily supported on Debian and other Linux systems. However, since it is written in portable Python, it should run without problems on just about any UNIX-like operating system. In particular, full Cedar Backup functionality is known to work on Debian and SuSE Linux systems, and client functionality is also known to work on FreeBSD and Mac OS X systems. To run a Cedar Backup client, you really just need a working Python installation. To run a Cedar Backup master, you will also need a set of other executables, most of which are related to building and writing CD/DVD images. A full list of dependencies is provided in the section called ?Installing Dependencies?. How to Get Support Cedar Backup is open source software that is provided to you at no cost. It is provided with no warranty, not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. However, that said, someone can usually help you solve whatever problems you might see. If you experience a problem, your best bet is to write the Cedar Backup Users mailing list. ^[1] This is a public list for all Cedar Backup users. If you write to this list, you might get help from me, or from some other user who has experienced the same thing you have. If you know that the problem you have found constitutes a bug, or if you would like to make an enhancement request, then feel free to file a bug report in the Cedar Solutions Bug Tracking System. ^[2] If you are not comfortable discussing your problem in public or listing it in a public database, or if you need to send along information that you do not want made public, then you can write . That mail will go directly to me or to someone else who can help you. If you write the support address about a bug, a ?scrubbed? bug report will eventually end up in the public bug database anyway, so if at all possible you should use the public reporting mechanisms. One of the strengths of the open-source software development model is its transparency. Regardless of how you report your problem, please try to provide as much information as possible about the behavior you observed and the environment in which the problem behavior occurred. ^[3] In particular, you should provide: the version of Cedar Backup that you are using; how you installed Cedar Backup (i.e. Debian package, source package, etc.); the exact command line that you executed; any error messages you received, including Python stack traces (if any); and relevant sections of the Cedar Backup log. It would be even better if you could describe exactly how to reproduce the problem, for instance by including your entire configuration file and/or specific information about your system that might relate to the problem. However, please do not provide huge sections of debugging logs unless you are sure they are relevant or unless someone asks for them. Tip Sometimes, the error that Cedar Backup displays can be rather cryptic. This is because under internal error conditions, the text related to an exception might get propogated all of the way up to the user interface. If the message you receive doesn't make much sense, or if you suspect that it results from an internal error, you might want to re-run Cedar Backup with the --stack option. This forces Cedar Backup to dump the entire Python stack trace associated with the error, rather than just printing the last message it received. This is good information to include along with a bug report, as well. History Cedar Backup began life in late 2000 as a set of Perl scripts called kbackup. These scripts met an immediate need (which was to back up skyjammer.com and some personal machines) but proved to be unstable, overly verbose and rather difficult to maintain. In early 2002, work began on a rewrite of kbackup. The goal was to address many of the shortcomings of the original application, as well as to clean up the code and make it available to the general public. While doing research related to code I could borrow or base the rewrite on, I discovered that there was already an existing backup package with the name kbackup, so I decided to change the name to Cedar Backup instead. Because I had become fed up with the prospect of maintaining a large volume of Perl code, I decided to abandon that language in favor of Python. ^[4] At the time, I chose Python mostly because I was interested in learning it, but in retrospect it turned out to be a very good decision. From my perspective, Python has almost all of the strengths of Perl, but few of its inherent weaknesses (I feel that primarily, Python code often ends up being much more readable than Perl code). Around this same time, skyjammer.com and cedar-solutions.com were converted to run Debian GNU/Linux (potato) ^[5] and I entered the Debian new maintainer queue, so I also made it a goal to implement Debian packages along with a Python source distribution for the new release. Version 1.0 of Cedar Backup was released in June of 2002. We immediately began using it to back up skyjammer.com and cedar-solutions.com, where it proved to be much more stable than the original code. Since then, we have continued to use Cedar Backup for those sites, and Cedar Backup has picked up a handful of other users who have occasionally reported bugs or requested minor enhancements. In the meantime, I continued to improve as a Python programmer and also started doing a significant amount of professional development in Java. It soon became obvious that the internal structure of Cedar Backup 1.0, while much better than kbackup, still left something to be desired. In November 2003, I began an attempt at cleaning up the codebase. I converted all of the internal documentation to use Epydoc, ^[6] and updated the code to use the newly-released Python logging package ^[7] after having a good experience with Java's log4j. However, I was still not satisfied with the code, which did not lend itself to the automated regression testing I had used when working with junit in my Java code. So, rather than releasing the cleaned-up code, I instead began another ground-up rewrite in May 2004. With this rewrite, I applied everything I had learned from other Java and Python projects I had undertaken over the last few years. I structured the code to take advantage of Python's unique ability to blend procedural code with object-oriented code, and I made automated unit testing a primary requirement. The result is the 2.0 release, which is cleaner, more compact, better focused, and better documented than any release before it. Utility code is less application-specific, and is now usable as a general-purpose library. The 2.0 release also includes a complete regression test suite of over 3000 tests, which will help to ensure that quality is maintained as development continues into the future. ^[8] -------------- ^[1] See ?SF Mailing Lists? at http://cedar-backup.sourceforge.net/. ^[2] See ?SF Bug Tracking? at http://cedar-backup.sourceforge.net/. ^[3] See Simon Tatham's excellent bug reporting tutorial: http:// www.chiark.greenend.org.uk/~sgtatham/bugs.html . ^[4] See http://www.python.org/ . ^[5] Debian's stable releases are named after characters in the Toy Story movie. ^[6] Epydoc is a Python code documentation tool. See http:// epydoc.sourceforge.net/. ^[7] See http://docs.python.org/lib/module-logging.html . ^[8] Tests are implemented using Python's unit test framework. See http:// docs.python.org/lib/module-unittest.html. Chapter 2. Basic Concepts Table of Contents General Architecture Data Recovery Cedar Backup Pools The Backup Process The Collect Action The Stage Action The Store Action The Purge Action The All Action The Validate Action The Initialize Action The Rebuild Action Coordination between Master and Clients Managed Backups Media and Device Types Incremental Backups Extensions General Architecture Cedar Backup is architected as a Python package (library) and a single executable (a Python script). The Python package provides both application-specific code and general utilities that can be used by programs other than Cedar Backup. It also includes modules that can be used by third parties to extend Cedar Backup or provide related functionality. The cback script is designed to run as root, since otherwise it's difficult to back up system directories or write to the CD/DVD device. However, pains are taken to use the backup user's effective user id (specified in configuration) when appropriate. Note: this does not mean that cback runs setuid^[9] or setgid . However, all files on disk will be owned by the backup user, and and all rsh-based network connections will take place as the backup user. The cback script is configured via command-line options and an XML configuration file on disk. The configuration file is normally stored in /etc/ cback.conf, but this path can be overridden at runtime. See Chapter 5, Configuration for more information on how Cedar Backup is configured. Warning You should be aware that backups to CD/DVD media can probably be read by any user which has permissions to mount the CD/DVD writer. If you intend to leave the backup disc in the drive at all times, you may want to consider this when setting up device permissions on your machine. See also the section called ?Encrypt Extension?. Data Recovery Cedar Backup does not include any facility to restore backups. Instead, it assumes that the administrator (using the procedures and references in Appendix C, Data Recovery) can handle the task of restoring their own system, using the standard system tools at hand. If I were to maintain recovery code in Cedar Backup, I would almost certainly end up in one of two situations. Either Cedar Backup would only support simple recovery tasks, and those via an interface a lot like that of the underlying system tools; or Cedar Backup would have to include a hugely complicated interface to support more specialized (and hence useful) recovery tasks like restoring individual files as of a certain point in time. In either case, I would end up trying to maintain critical functionality that would be rarely used, and hence would also be rarely tested by end-users. I am uncomfortable asking anyone to rely on functionality that falls into this category. My primary goal is to keep the Cedar Backup codebase as simple and focused as possible. I hope you can understand how the choice of providing documentation, but not code, seems to strike the best balance between managing code complexity and providing the functionality that end-users need. Cedar Backup Pools There are two kinds of machines in a Cedar Backup pool. One machine (the master ) has a CD or DVD writer on it and writes the backup to disc. The others ( clients) collect data to be written to disc by the master. Collectively, the master and client machines in a pool are called peer machines. Cedar Backup has been designed primarily for situations where there is a single master and a set of other clients that the master interacts with. However, it will just as easily work for a single machine (a backup pool of one) and in fact more users seem to use it like this than any other way. The Backup Process The Cedar Backup backup process is structured in terms of a set of decoupled actions which execute independently (based on a schedule in cron) rather than through some highly coordinated flow of control. This design decision has both positive and negative consequences. On the one hand, the code is much simpler and can choose to simply abort or log an error if its expectations are not met. On the other hand, the administrator must coordinate the various actions during initial set-up. See the section called ?Coordination between Master and Clients? (later in this chapter) for more information on this subject. A standard backup run consists of four steps (actions), some of which execute on the master machine, and some of which execute on one or more client machines. These actions are: collect, stage, store and purge. In general, more than one action may be specified on the command-line. If more than one action is specified, then actions will be taken in a sensible order (generally collect, stage, store, purge). A special all action is also allowed, which implies all of the standard actions in the same sensible order. The cback command also supports several actions that are not part of the standard backup run and cannot be executed along with any other actions. These actions are validate, initialize and rebuild. All of the various actions are discussed further below. See Chapter 5, Configuration for more information on how a backup run is configured. Flexibility Cedar Backup was designed to be flexible. It allows you to decide for yourself which backup steps you care about executing (and when you execute them), based on your own situation and your own priorities. As an example, I always back up every machine I own. I typically keep 7-10 days of staging directories around, but switch CD/DVD media mostly every week. That way, I can periodically take a disc off-site in case the machine gets stolen or damaged. If you're not worried about these risks, then there's no need to write to disc. In fact, some users prefer to use their master machine as a simple ?consolidation point?. They don't back up any data on the master, and don't write to disc at all. They just use Cedar Backup to handle the mechanics of moving backed-up data to a central location. This isn't quite what Cedar Backup was written to do, but it is flexible enough to meet their needs. The Collect Action The collect action is the first action in a standard backup run. It executes both master and client nodes. Based on configuration, this action traverses the peer's filesystem and gathers files to be backed up. Each configured high-level directory is collected up into its own tar file in the collect directory. The tarfiles can either be uncompressed (.tar) or compressed with either gzip (.tar.gz) or bzip2 (.tar.bz2). There are three supported collect modes: daily, weekly and incremental. Directories configured for daily backups are backed up every day. Directories configured for weekly backups are backed up on the first day of the week. Directories configured for incremental backups are traversed every day, but only the files which have changed (based on a saved-off SHA hash) are actually backed up. Collect configuration also allows for a variety of ways to filter files and directories out of the backup. For instance, administrators can configure an ignore indicator file ^[10] or specify absolute paths or filename patterns ^[11 ] to be excluded. You can even configure a backup ?link farm? rather than explicitly listing files and directories in configuration. This action is optional on the master. You only need to configure and execute the collect action on the master if you have data to back up on that machine. If you plan to use the master only as a ?consolidation point? to collect data from other machines, then there is no need to execute the collect action there. If you run the collect action on the master, it behaves the same there as anywhere else, and you have to stage the master's collected data just like any other client (typically by configuring a local peer in the stage action). The Stage Action The stage action is the second action in a standard backup run. It executes on the master peer node. The master works down the list of peers in its backup pool and stages (copies) the collected backup files from each of them into a daily staging directory by peer name. For the purposes of this action, the master node can be configured to treat itself as a client node. If you intend to back up data on the master, configure the master as a local peer. Otherwise, just configure each of the clients as a remote peer. Local and remote client peers are treated differently. Local peer collect directories are assumed to be accessible via normal copy commands (i.e. on a mounted filesystem) while remote peer collect directories are accessed via an RSH-compatible command such as ssh. If a given peer is not ready to be staged, the stage process will log an error, abort the backup for that peer, and then move on to its other peers. This way, one broken peer cannot break a backup for other peers which are up and running. Keep in mind that Cedar Backup is flexible about what actions must be executed as part of a backup. If you would prefer, you can stop the backup process at this step, and skip the store step. In this case, the staged directories will represent your backup rather than a disc. Note Directories ?collected? by another process can be staged by Cedar Backup. If the file cback.collect exists in a collect directory when the stage action is taken, then that directory will be staged. The Store Action The store action is the third action in a standard backup run. It executes on the master peer node. The master machine determines the location of the current staging directory, and then writes the contents of that staging directory to disc. After the contents of the directory have been written to disc, an optional validation step ensures that the write was successful. If the backup is running on the first day of the week, if the drive does not support multisession discs, or if the --full option is passed to the cback command, the disc will be rebuilt from scratch. Otherwise, a new ISO session will be added to the disc each day the backup runs. This action is entirely optional. If you would prefer to just stage backup data from a set of peers to a master machine, and have the staged directories represent your backup rather than a disc, this is fine. Warning The store action is not supported on the Mac OS X (darwin) platform. On that platform, the ?automount? function of the Finder interferes significantly with Cedar Backup's ability to mount and unmount media and write to the CD or DVD hardware. The Cedar Backup writer and image functionality works on this platform, but the effort required to fight the operating system about who owns the media and the device makes it nearly impossible to execute the store action successfully. Current Staging Directory The store action tries to be smart about finding the current staging directory. It first checks the current day's staging directory. If that directory exists, and it has not yet been written to disc (i.e. there is no store indicator), then it will be used. Otherwise, the store action will look for an unused staging directory for either the previous day or the next day, in that order. A warning will be written to the log under these circumstances (controlled by the configuration value). This behavior varies slightly when the --full option is in effect. Under these circumstances, any existing store indicator will be ignored. Also, the store action will always attempt to use the current day's staging directory, ignoring any staging directories for the previous day or the next day. This way, running a full store action more than once concurrently will always produce the same results. (You might imagine a use case where a person wants to make several copies of the same full backup.) The Purge Action The purge action is the fourth and final action in a standard backup run. It executes both on the master and client peer nodes. Configuration specifies how long to retain files in certain directories, and older files and empty directories are purged. Typically, collect directories are purged daily, and stage directories are purged weekly or slightly less often (if a disc gets corrupted, older backups may still be available on the master). Some users also choose to purge the configured working directory (which is used for temporary files) to eliminate any leftover files which might have resulted from changes to configuration. The All Action The all action is a pseudo-action which causes all of the actions in a standard backup run to be executed together in order. It cannot be combined with any other actions on the command line. Extensions cannot be executed as part of the all action. If you need to execute an extended action, you must specify the other actions you want to run individually on the command line. ^[12] The all action does not have its own configuration. Instead, it relies on the individual configuration sections for all of the other actions. The Validate Action The validate action is used to validate configuration on a particular peer node, either master or client. It cannot be combined with any other actions on the command line. The validate action checks that the configuration file can be found, that the configuration file is valid, and that certain portions of the configuration file make sense (for instance, making sure that specified users exist, directories are readable and writable as necessary, etc.). The Initialize Action The initialize action is used to initialize media for use with Cedar Backup. This is an optional step. By default, Cedar Backup does not need to use initialized media and will write to whatever media exists in the writer device. However, if the ?check media? store configuration option is set to true, Cedar Backup will check the media before writing to it and will error out if the media has not been initialized. Initializing the media consists of writing a mostly-empty image using a known media label (the media label will begin with ?CEDAR BACKUP?). Note that only rewritable media (CD-RW, DVD+RW) can be initialized. It doesn't make any sense to initialize media that cannot be rewritten (CD-R, DVD+R), since Cedar Backup would then not be able to use that media for a backup. You can still configure Cedar Backup to check non-rewritable media; in this case, the check will also pass if the media is apparently unused (i.e. has no media label). The Rebuild Action The rebuild action is an exception-handling action that is executed independent of a standard backup run. It cannot be combined with any other actions on the command line. The rebuild action attempts to rebuild ?this week's? disc from any remaining unpurged staging directories. Typically, it is used to make a copy of a backup, replace lost or damaged media, or to switch to new media mid-week for some other reason. To decide what data to write to disc again, the rebuild action looks back and finds first day of the current week. Then, it finds any remaining staging directories between that date and the current date. If any staging directories are found, they are all written to disc in one big ISO session. The rebuild action does not have its own configuration. It relies on configuration for other other actions, especially the store action. Coordination between Master and Clients Unless you are using Cedar Backup to manage a ?pool of one?, you will need to set up some coordination between your clients and master to make everything work properly. This coordination isn't difficult ? it mostly consists of making sure that operations happen in the right order ? but some users are suprised that it is required and want to know why Cedar Backup can't just ?take care of it for me?. Essentially, each client must finish collecting all of its data before the master begins staging it, and the master must finish staging data from a client before that client purges its collected data. Administrators may need to experiment with the time between the collect and purge entries so that the master has enough time to stage data before it is purged. Managed Backups Cedar Backup also supports an optional feature called the ?managed backup?. This feature is intended for use with remote clients where cron is not available (for instance, SourceForge shell accounts). When managed backups are enabled, managed clients must still be configured as usual. However, rather than using a cron job on the client to execute the collect and purge actions, the master executes these actions on the client via a remote shell. To make this happen, first set up one or more managed clients in Cedar Backup configuration. Then, invoke Cedar Backup with the --managed command-line option. Whenever Cedar Backup invokes an action locally, it will invoke the same action on each of the managed clients. Technically, this feature works for any client, not just clients that don't have cron available. Used this way, it can simplify the setup process, because cron only has to be configured on the master. For some users, that may be motivation enough to use this feature all of the time. However, please keep in mind that this feature depends on a stable network. If your network connection drops, your backup will be interrupted and will not be complete. It is even possible that some of the Cedar Backup metadata (like incremental backup state) will be corrupted. The risk is not high, but it is something you need to be aware of if you choose to use this optional feature. Media and Device Types Cedar Backup is focused around writing backups to CD or DVD media using a standard SCSI or IDE writer. In Cedar Backup terms, the disc itself is referred to as the media, and the CD/DVD drive is referred to as the device or sometimes the backup device. ^[13] When using a new enough backup device, a new ?multisession? ISO image ^[14] is written to the media on the first day of the week, and then additional multisession images are added to the media each day that Cedar Backup runs. This way, the media is complete and usable at the end of every backup run, but a single disc can be used all week long. If your backup device does not support multisession images ? which is really unusual today ? then a new ISO image will be written to the media each time Cedar Backup runs (and you should probably confine yourself to the ?daily? backup mode to avoid losing data). Cedar Backup currently supports four different kinds of CD media: cdr-74 74-minute non-rewritable CD media cdrw-74 74-minute rewritable CD media cdr-80 80-minute non-rewritable CD media cdrw-80 80-minute rewritable CD media I have chosen to support just these four types of CD media because they seem to be the most ?standard? of the various types commonly sold in the U.S. as of this writing (early 2005). If you regularly use an unsupported media type and would like Cedar Backup to support it, send me information about the capacity of the media in megabytes (MB) and whether it is rewritable. Cedar Backup also supports two kinds of DVD media: dvd+r Single-layer non-rewritable DVD+R media dvd+rw Single-layer rewritable DVD+RW media The underlying growisofs utility does support other kinds of media (including DVD-R, DVD-RW and BlueRay) which work somewhat differently than standard DVD+R and DVD+RW media. I don't support these other kinds of media because I haven't had any opportunity to work with them. The same goes for dual-layer media of any type. Incremental Backups Cedar Backup supports three different kinds of backups for individual collect directories. These are daily, weekly and incremental backups. Directories using the daily mode are backed up every day. Directories using the weekly mode are only backed up on the first day of the week, or when the --full option is used. Directories using the incremental mode are always backed up on the first day of the week (like a weekly backup), but after that only the files which have changed are actually backed up on a daily basis. In Cedar Backup, incremental backups are not based on date, but are instead based on saved checksums, one for each backed-up file. When a full backup is run, Cedar Backup gathers a checksum value ^[15] for each backed-up file. The next time an incremental backup is run, Cedar Backup checks its list of file/ checksum pairs for each file that might be backed up. If the file's checksum value does not match the saved value, or if the file does not appear in the list of file/checksum pairs, then it will be backed up and a new checksum value will be placed into the list. Otherwise, the file will be ignored and the checksum value will be left unchanged. Cedar Backup stores the file/checksum pairs in .sha files in its working directory, one file per configured collect directory. The mappings in these files are reset at the start of the week or when the --full option is used. Because these files are used for an entire week, you should never purge the working directory more frequently than once per week. Extensions Imagine that there is a third party developer who understands how to back up a certain kind of database repository. This third party might want to integrate his or her specialized backup into the Cedar Backup process, perhaps thinking of the database backup as a sort of ?collect? step. Prior to Cedar Backup 2.0, any such integration would have been completely independent of Cedar Backup itself. The ?external? backup functionality would have had to maintain its own configuration and would not have had access to any Cedar Backup configuration. Starting with version 2.0, Cedar Backup allows extensions to the backup process. An extension is an action that isn't part of the standard backup process, (i.e. not collect, stage, store or purge) but can be executed by Cedar Backup when properly configured. Extension authors implement an ?action process? function with a certain interface, and are allowed to add their own sections to the Cedar Backup configuration file, so that all backup configuration can be centralized. Then, the action process function is associated with an action name which can be executed from the cback command line like any other action. Hopefully, as the Cedar Backup 2.0 user community grows, users will contribute their own extensions back to the community. Well-written general-purpose extensions will be accepted into the official codebase. Note Users should see Chapter 5, Configuration for more information on how extensions are configured, and Chapter 6, Official Extensions for details on all of the officially-supported extensions. Developers may be interested in Appendix A, Extension Architecture Interface. -------------- ^[9] See http://en.wikipedia.org/wiki/Setuid ^[10] Analagous to .cvsignore in CVS ^[11] In terms of Python regular expressions ^[12] Some users find this surprising, because extensions are configured with sequence numbers. I did it this way because I felt that running extensions as part of the all action would sometimes result in surprising behavior. I am not planning to change the way this works. ^[13] My original backup device was an old Sony CRX140E 4X CD-RW drive. It has since died, and I currently develop using a Lite-On 1673S DVD±RW drive. ^[14] An ISO image is the standard way of creating a filesystem to be copied to a CD or DVD. It is essentially a ?filesystem-within-a-file? and many UNIX operating systems can actually mount ISO image files just like hard drives, floppy disks or actual CDs. See Wikipedia for more information: http:// en.wikipedia.org/wiki/ISO_image. ^[15] The checksum is actually an SHA cryptographic hash. See Wikipedia for more information: http://en.wikipedia.org/wiki/SHA-1. Chapter 3. Installation Table of Contents Background Installing on a Debian System Installing from Source Installing Dependencies Installing the Source Package Background There are two different ways to install Cedar Backup. The easiest way is to install the pre-built Debian packages. This method is painless and ensures that all of the correct dependencies are available, etc. If you are running a Linux distribution other than Debian or you are running some other platform like FreeBSD or Mac OS X, then you must use the Python source distribution to install Cedar Backup. When using this method, you need to manage all of the dependencies yourself. Non-Linux Platforms Cedar Backup has been developed on a Debian GNU/Linux system and is primarily supported on Debian and other Linux systems. However, since it is written in portable Python, it should run without problems on just about any UNIX-like operating system. In particular, full Cedar Backup functionality is known to work on Debian and SuSE Linux systems, and client functionality is also known to work on FreeBSD and Mac OS X systems. To run a Cedar Backup client, you really just need a working Python installation. To run a Cedar Backup master, you will also need a set of other executables, most of which are related to building and writing CD/DVD images. A full list of dependencies is provided further on in this chapter. If you would like to use Cedar Backup on a non-Linux system, you should install the Python source distribution along with all of the indicated dependencies. Then, please report back to the Cedar Backup Users mailing list ^[16] with information about your platform and any problems you encountered. Installing on a Debian System The easiest way to install Cedar Backup onto a Debian system is by using a tool such as apt-get or aptitude. If you are running a Debian release which contains Cedar Backup, you can use your normal Debian mirror as an APT data source. (The Debian ?etch? release is the first release to contain Cedar Backup.) Otherwise, you need to install from the Cedar Solutions APT data source. To do this, add the Cedar Solutions APT data source to your /etc/apt/sources.list file. ^[17] After you have configured the proper APT data source, install Cedar Backup using this set of commands: $ apt-get update $ apt-get install cedar-backup2 cedar-backup2-doc Several of the Cedar Backup dependencies are listed as ?recommended? rather than required. If you are installing Cedar Backup on a master machine, you must install some or all of the recommended dependencies, depending on which actions you intend to execute. The stage action normally requires ssh, and the store action requires eject and either cdrecord/mkisofs or dvd+rw-tools. Clients must also install some sort of ssh server if a remote master will collect backups from them. If you would prefer, you can also download the .deb files and install them by hand with a tool such as dpkg. You can find these files files in the Cedar Solutions APT source. ^[18] In either case, once the package has been installed, you can proceed to configuration as described in Chapter 5, Configuration. Note The Debian package-management tools must generally be run as root. It is safe to install Cedar Backup to a non-standard location and run it as a non-root user. However, to do this, you must install the source distribution instead of the Debian package. Installing from Source On platforms other than Debian, Cedar Backup is installed from a Python source distribution. ^[19] You will have to manage dependencies on your own. Tip Many UNIX-like distributions provide an automatic or semi-automatic way to install packages like the ones Cedar Backup requires (think RPMs for Mandrake or RedHat, Gentoo's Portage system, the Fink project for Mac OS X, or the BSD ports system). If you are not sure how to install these packages on your system, you might want to check out Appendix B, Dependencies. This appendix provides links to ?upstream? source packages, plus as much information as I have been able to gather about packages for non-Debian platforms. Installing Dependencies Cedar Backup requires a number of external packages in order to function properly. Before installing Cedar Backup, you must make sure that these dependencies are met. Cedar Backup is written in Python and requires version 2.3 or greater of the language. Version 2.3 was released on 29 July 2003, so by now most current Linux and BSD distributions should include it. You must install Python on every peer node in a pool (master or client). Additionally, remote client peer nodes must be running an RSH-compatible server, such as the ssh server, and master nodes must have an RSH-compatible client installed if they need to connect to remote peer machines. Master machines also require several other system utilities, most having to do with writing and validating CD/DVD media. On master machines, you must make sure that these utilities are available if you want to to run the store action: * mkisofs * eject * mount * unmount * volname Then, you need this utility if you are writing CD media: * cdrecord or these utilities if you are writing DVD media: * growisofs All of these utilities are common and are easy to find for almost any UNIX-like operating system. Installing the Source Package Python source packages are fairly easy to install. They are distributed as .tar.gz files which contain Python source code, a manifest and an installation script called setup.py. Once you have downloaded the source package from the Cedar Solutions website, ^ [18] untar it: $ zcat CedarBackup2-2.0.0.tar.gz | tar xvf - This will create a directory called (in this case) CedarBackup2-2.0.0. The version number in the directory will always match the version number in the filename. If you have root access and want to install the package to the ?standard? Python location on your system, then you can install the package in two simple steps: $ cd CedarBackup2-2.0.0 $ python setup.py install Make sure that you are using Python 2.3 or better to execute setup.py. You may also wish to run the unit tests before actually installing anything. Run them like so: python util/test.py If any unit test reports a failure on your system, please email me the output from the unit test, so I can fix the problem. ^[20] This is particularly important for non-Linux platforms where I do not have a test system available to me. Some users might want to choose a different install location or change other install parameters. To get more information about how setup.py works, use the --help option: $ python setup.py --help $ python setup.py install --help In any case, once the package has been installed, you can proceed to configuration as described in Chapter 5, Configuration. -------------- ^[16] See ?SF Mailing Lists? at http://cedar-backup.sourceforge.net/. ^[17] See ?SF Bug Tracking? at http://cedar-backup.sourceforge.net/. ^[18] See http://cedar-solutions.com/debian.html. ^[19] See http://docs.python.org/lib/module-distutils.html . ^[20] Chapter 4. Command Line Tools Table of Contents Overview The cback command Introduction Syntax Switches Actions The cback-span command Introduction Syntax Switches Using cback-span Sample run Overview Cedar Backup comes with two command-line programs, the cback and cback-span commands. The cback command is the primary command line interface and the only Cedar Backup program that most users will ever need. Users that have a lot of data to back up ? more than will fit on a single CD or DVD ? can use the interactive cback-span tool to split their data between multiple discs. The cback command Introduction Cedar Backup's primary command-line interface is the cback command. It controls the entire backup process. Syntax The cback command has the following syntax: Usage: cback [switches] action(s) The following switches are accepted: -h, --help Display this usage/help listing -V, --version Display version information -b, --verbose Print verbose output as well as logging to disk -q, --quiet Run quietly (display no output to the screen) -c, --config Path to config file (default: /etc/cback.conf) -f, --full Perform a full backup, regardless of configuration -M, --managed Include managed clients when executing actions -N, --managed-only Include ONLY managed clients when executing actions -l, --logfile Path to logfile (default: /var/log/cback.log) -o, --owner Logfile ownership, user:group (default: root:adm) -m, --mode Octal logfile permissions mode (default: 640) -O, --output Record some sub-command (i.e. cdrecord) output to the log -d, --debug Write debugging information to the log (implies --output) -s, --stack Dump a Python stack trace instead of swallowing exceptions -D, --diagnostics Print runtime diagnostics to the screen and exit The following actions may be specified: all Take all normal actions (collect, stage, store, purge) collect Take the collect action stage Take the stage action store Take the store action purge Take the purge action rebuild Rebuild "this week's" disc if possible validate Validate configuration only initialize Initialize media for use with Cedar Backup You may also specify extended actions that have been defined in configuration. You must specify at least one action to take. More than one of the "collect", "stage", "store" or "purge" actions and/or extended actions may be specified in any arbitrary order; they will be executed in a sensible order. The "all", "rebuild", "validate", and "initialize" actions may not be combined with other actions. Note that the all action only executes the standard four actions. It never executes any of the configured extensions. ^[21] Switches -h, --help Display usage/help listing. -V, --version Display version information. -b, --verbose Print verbose output to the screen as well writing to the logfile. When this option is enabled, most information that would normally be written to the logfile will also be written to the screen. -q, --quiet Run quietly (display no output to the screen). -c, --config Specify the path to an alternate configuration file. The default configuration file is /etc/cback.conf. -f, --full Perform a full backup, regardless of configuration. For the collect action, this means that any existing information related to incremental backups will be ignored and rewritten; for the store action, this means that a new disc will be started. -M, --managed Include managed clients when executing actions. If the action being executed is listed as a managed action for a managed client, execute the action on that client after executing the action locally. -N, --managed-only Include only managed clients when executing actions. If the action being executed is listed as a managed action for a managed client, execute the action on that client ? but do not execute the action locally. -l, --logfile Specify the path to an alternate logfile. The default logfile file is /var/ log/cback.log. -o, --owner Specify the ownership of the logfile, in the form user:group. The default ownership is root:adm, to match the Debian standard for most logfiles. This value will only be used when creating a new logfile. If the logfile already exists when the cback command is executed, it will retain its existing ownership and mode. Only user and group names may be used, not numeric uid and gid values. -m, --mode Specify the permissions for the logfile, using the numeric mode as in chmod (1). The default mode is 0640 (-rw-r-----). This value will only be used when creating a new logfile. If the logfile already exists when the cback command is executed, it will retain its existing ownership and mode. -O, --output Record some sub-command output to the logfile. When this option is enabled, all output from system commands will be logged. This might be useful for debugging or just for reference. Cedar Backup uses system commands mostly for dealing with the CD/DVD recorder and its media. -d, --debug Write debugging information to the logfile. This option produces a high volume of output, and would generally only be needed when debugging a problem. This option implies the --output option, as well. -s, --stack Dump a Python stack trace instead of swallowing exceptions. This forces Cedar Backup to dump the entire Python stack trace associated with an error, rather than just propagating last message it received back up to the user interface. Under some circumstances, this is useful information to include along with a bug report. -D, --diagnostics Display runtime diagnostic information and then exit. This diagnostic information is often useful when filing a bug report. Actions You can find more information about the various actions in the section called ?The Backup Process? (in Chapter 2, Basic Concepts). In general, you may specify any combination of the collect, stage, store or purge actions, and the specified actions will be executed in a sensible order. Or, you can specify one of the all, rebuild, validate, or initialize actions (but these actions may not be combined with other actions). If you have configured any Cedar Backup extensions, then the actions associated with those extensions may also be specified on the command line. If you specify any other actions along with an extended action, the actions will be executed in a sensible order per configuration. The all action never executes extended actions, however. The cback-span command Introduction Cedar Backup was designed ? and is still primarily focused ? around weekly backups to a single CD or DVD. Most users who back up more data than fits on a single disc seem to stop their backup process at the stage step, using Cedar Backup as an easy way to collect data. However, some users have expressed a need to write these large kinds of backups to disc ? if not every day, then at least occassionally. The cback-span tool was written to meet those needs. If you have staged more data than fits on a single CD or DVD, you can use cback-span to split that data between multiple discs. cback-span is not a general-purpose disc-splitting tool. It is a specialized program that requires Cedar Backup configuration to run. All it can do is read Cedar Backup configuration, find any staging directories that have not yet been written to disc, and split the files in those directories between discs. cback-span accepts many of the same command-line options as cback, but must be run interactively. It cannot be run from cron. This is intentional. It is intended to be a useful tool, not a new part of the backup process (that is the purpose of an extension). In order to use cback-span, you must configure your backup such that the largest individual backup file can fit on a single disc. The command will not split a single file onto more than one disc. All it can do is split large directories onto multiple discs. Files in those directories will be arbitrarily split up so that space is utilized most efficiently. Syntax The cback-span command has the following syntax: Usage: cback-span [switches] Cedar Backup 'span' tool. This Cedar Backup utility spans staged data between multiple discs. It is a utility, not an extension, and requires user interaction. The following switches are accepted, mostly to set up underlying Cedar Backup functionality: -h, --help Display this usage/help listing -V, --version Display version information -b, --verbose Print verbose output as well as logging to disk -c, --config Path to config file (default: /etc/cback.conf) -l, --logfile Path to logfile (default: /var/log/cback.log) -o, --owner Logfile ownership, user:group (default: root:adm) -m, --mode Octal logfile permissions mode (default: 640) -O, --output Record some sub-command (i.e. cdrecord) output to the log -d, --debug Write debugging information to the log (implies --output) -s, --stack Dump a Python stack trace instead of swallowing exceptions Switches -h, --help Display usage/help listing. -V, --version Display version information. -b, --verbose Print verbose output to the screen as well writing to the logfile. When this option is enabled, most information that would normally be written to the logfile will also be written to the screen. -c, --config Specify the path to an alternate configuration file. The default configuration file is /etc/cback.conf. -l, --logfile Specify the path to an alternate logfile. The default logfile file is /var/ log/cback.log. -o, --owner Specify the ownership of the logfile, in the form user:group. The default ownership is root:adm, to match the Debian standard for most logfiles. This value will only be used when creating a new logfile. If the logfile already exists when the cback command is executed, it will retain its existing ownership and mode. Only user and group names may be used, not numeric uid and gid values. -m, --mode Specify the permissions for the logfile, using the numeric mode as in chmod (1). The default mode is 0640 (-rw-r-----). This value will only be used when creating a new logfile. If the logfile already exists when the cback command is executed, it will retain its existing ownership and mode. -O, --output Record some sub-command output to the logfile. When this option is enabled, all output from system commands will be logged. This might be useful for debugging or just for reference. Cedar Backup uses system commands mostly for dealing with the CD/DVD recorder and its media. -d, --debug Write debugging information to the logfile. This option produces a high volume of output, and would generally only be needed when debugging a problem. This option implies the --output option, as well. -s, --stack Dump a Python stack trace instead of swallowing exceptions. This forces Cedar Backup to dump the entire Python stack trace associated with an error, rather than just propagating last message it received back up to the user interface. Under some circumstances, this is useful information to include along with a bug report. Using cback-span As discussed above, the cback-span is an interactive command. It cannot be run from cron. You can typically use the default answer for most questions. The only two questions that you may not want the default answer for are the fit algorithm and the cushion percentage. The cushion percentage is used by cback-span to determine what capacity to shoot for when splitting up your staging directories. A 650 MB disc does not fit fully 650 MB of data. It's usually more like 627 MB of data. The cushion percentage tells cback-span how much overhead to reserve for the filesystem. The default of 4% is usually OK, but if you have problems you may need to increase it slightly. The fit algorithm tells cback-span how it should determine which items should be placed on each disc. If you don't like the result from one algorithm, you can reject that solution and choose a different algorithm. The four available fit algorithms are: worst The worst-fit algorithm. The worst-fit algorithm proceeds through a sorted list of items (sorted from smallest to largest) until running out of items or meeting capacity exactly. If capacity is exceeded, the item that caused capacity to be exceeded is thrown away and the next one is tried. The algorithm effectively includes the maximum number of items possible in its search for optimal capacity utilization. It tends to be somewhat slower than either the best-fit or alternate-fit algorithm, probably because on average it has to look at more items before completing. best The best-fit algorithm. The best-fit algorithm proceeds through a sorted list of items (sorted from largest to smallest) until running out of items or meeting capacity exactly. If capacity is exceeded, the item that caused capacity to be exceeded is thrown away and the next one is tried. The algorithm effectively includes the minimum number of items possible in its search for optimal capacity utilization. For large lists of mixed-size items, it's not unusual to see the algorithm achieve 100% capacity utilization by including fewer than 1% of the items. Probably because it often has to look at fewer of the items before completing, it tends to be a little faster than the worst-fit or alternate-fit algorithms. first The first-fit algorithm. The first-fit algorithm proceeds through an unsorted list of items until running out of items or meeting capacity exactly. If capacity is exceeded, the item that caused capacity to be exceeded is thrown away and the next one is tried. This algorithm generally performs more poorly than the other algorithms both in terms of capacity utilization and item utilization, but can be as much as an order of magnitude faster on large lists of items because it doesn't require any sorting. alternate A hybrid algorithm that I call alternate-fit. This algorithm tries to balance small and large items to achieve better end-of-disk performance. Instead of just working one direction through a list, it alternately works from the start and end of a sorted list (sorted from smallest to largest), throwing away any item which causes capacity to be exceeded. The algorithm tends to be slower than the best-fit and first-fit algorithms, and slightly faster than the worst-fit algorithm, probably because of the number of items it considers on average before completing. It often achieves slightly better capacity utilization than the worst-fit algorithm, while including slightly fewer items. Sample run Below is a log showing a sample cback-span run. ================================================ Cedar Backup 'span' tool ================================================ This the Cedar Backup span tool. It is used to split up staging data when that staging data does not fit onto a single disc. This utility operates using Cedar Backup configuration. Configuration specifies which staging directory to look at and which writer device and media type to use. Continue? [Y/n]: === Cedar Backup store configuration looks like this: Source Directory...: /tmp/staging Media Type.........: cdrw-74 Device Type........: cdwriter Device Path........: /dev/cdrom Device SCSI ID.....: None Drive Speed........: None Check Data Flag....: True No Eject Flag......: False Is this OK? [Y/n]: === Please wait, indexing the source directory (this may take a while)... === The following daily staging directories have not yet been written to disc: /tmp/staging/2007/02/07 /tmp/staging/2007/02/08 /tmp/staging/2007/02/09 /tmp/staging/2007/02/10 /tmp/staging/2007/02/11 /tmp/staging/2007/02/12 /tmp/staging/2007/02/13 /tmp/staging/2007/02/14 The total size of the data in these directories is 1.00 GB. Continue? [Y/n]: === Based on configuration, the capacity of your media is 650.00 MB. Since estimates are not perfect and there is some uncertainly in media capacity calculations, it is good to have a "cushion", a percentage of capacity to set aside. The cushion reduces the capacity of your media, so a 1.5% cushion leaves 98.5% remaining. What cushion percentage? [4.00]: === The real capacity, taking into account the 4.00% cushion, is 627.25 MB. It will take at least 2 disc(s) to store your 1.00 GB of data. Continue? [Y/n]: === Which algorithm do you want to use to span your data across multiple discs? The following algorithms are available: first....: The "first-fit" algorithm best.....: The "best-fit" algorithm worst....: The "worst-fit" algorithm alternate: The "alternate-fit" algorithm If you don't like the results you will have a chance to try a different one later. Which algorithm? [worst]: === Please wait, generating file lists (this may take a while)... === Using the "worst-fit" algorithm, Cedar Backup can split your data into 2 discs. Disc 1: 246 files, 615.97 MB, 98.20% utilization Disc 2: 8 files, 412.96 MB, 65.84% utilization Accept this solution? [Y/n]: n === Which algorithm do you want to use to span your data across multiple discs? The following algorithms are available: first....: The "first-fit" algorithm best.....: The "best-fit" algorithm worst....: The "worst-fit" algorithm alternate: The "alternate-fit" algorithm If you don't like the results you will have a chance to try a different one later. Which algorithm? [worst]: alternate === Please wait, generating file lists (this may take a while)... === Using the "alternate-fit" algorithm, Cedar Backup can split your data into 2 discs. Disc 1: 73 files, 627.25 MB, 100.00% utilization Disc 2: 181 files, 401.68 MB, 64.04% utilization Accept this solution? [Y/n]: y === Please place the first disc in your backup device. Press return when ready. === Initializing image... Writing image to disc... -------------- ^[21] Some users find this surprising, because extensions are configured with sequence numbers. I did it this way because I felt that running extensions as part of the all action would sometimes result in ?surprising? behavior. Better to be definitive than confusing. Chapter 5. Configuration Table of Contents Overview Configuration File Format Sample Configuration File Reference Configuration Options Configuration Peers Configuration Collect Configuration Stage Configuration Store Configuration Purge Configuration Extensions Configuration Setting up a Pool of One Step 1: Decide when you will run your backup. Step 2: Make sure email works. Step 3: Configure your writer device. Step 4: Configure your backup user. Step 5: Create your backup tree. Step 6: Create the Cedar Backup configuration file. Step 7: Validate the Cedar Backup configuration file. Step 8: Test your backup. Step 9: Modify the backup cron jobs. Setting up a Client Peer Node Step 1: Decide when you will run your backup. Step 2: Make sure email works. Step 3: Configure the master in your backup pool. Step 4: Configure your backup user. Step 5: Create your backup tree. Step 6: Create the Cedar Backup configuration file. Step 7: Validate the Cedar Backup configuration file. Step 8: Test your backup. Step 9: Modify the backup cron jobs. Setting up a Master Peer Node Step 1: Decide when you will run your backup. Step 2: Make sure email works. Step 3: Configure your writer device. Step 4: Configure your backup user. Step 5: Create your backup tree. Step 6: Create the Cedar Backup configuration file. Step 7: Validate the Cedar Backup configuration file. Step 8: Test connectivity to client machines. Step 9: Test your backup. Step 10: Modify the backup cron jobs. Configuring your Writer Device Device Types Devices identified by by device name Devices identified by SCSI id Linux Notes Finding your Linux CD Writer Mac OS X Notes Optimized Blanking Stategy Overview Configuring Cedar Backup is unfortunately somewhat complicated. The good news is that once you get through the initial configuration process, you'll hardly ever have to change anything. Even better, the most typical changes (i.e. adding and removing directories from a backup) are easy. First, familiarize yourself with the concepts in Chapter 2, Basic Concepts. In particular, be sure that you understand the differences between a master and a client. (If you only have one machine, then your machine will act as both a master and a client, and we'll refer to your setup as a pool of one.) Then, install Cedar Backup per the instructions in Chapter 3, Installation. Once everything has been installed, you are ready to begin configuring Cedar Backup. Look over the section called ?The cback command? (in Chapter 4, Command Line Tools) to become familiar with the command line interface. Then, look over the section called ?Configuration File Format? (below) and create a configuration file for each peer in your backup pool. To start with, create a very simple configuration file, then expand it later. Decide now whether you will store the configuration file in the standard place (/etc/cback.conf) or in some other location. After you have all of the configuration files in place, configure each of your machines, following the instructions in the appropriate section below (for master, client or pool of one). Since the master and client(s) must communicate over the network, you won't be able to fully configure the master without configuring each client and vice-versa. The instructions are clear on what needs to be done. Which Platform? Cedar Backup has been designed for use on all UNIX-like systems. However, since it was developed on a Debian GNU/Linux system, and because I am a Debian developer, the packaging is prettier and the setup is somewhat simpler on a Debian system than on a system where you install from source. The configuration instructions below have been generalized so they should work well regardless of what platform you are running (i.e. RedHat, Gentoo, FreeBSD, etc.). If instructions vary for a particular platform, you will find a note related to that platform. I am always open to adding more platform-specific hints and notes, so write me if you find problems with these instructions. Configuration File Format Cedar Backup is configured through an XML ^[22] configuration file, usually called /etc/cback.conf. The configuration file contains the following sections: reference, options, collect, stage, store, purge and extensions. All configuration files must contain the two general configuration sections, the reference section and the options section. Besides that, administrators need only configure actions they intend to use. For instance, on a client machine, administrators will generally only configure the collect and purge sections, while on a master machine they will have to configure all four action-related sections. ^[23] The extensions section is always optional and can be omitted unless extensions are in use. Note Even though the Mac OS X (darwin) filesystem is not case-sensitive, Cedar Backup configuration is generally case-sensitive on that platform, just like on all other platforms. For instance, even though the files ?Ken? and ?ken? might be the same on the Mac OS X filesystem, an exclusion in Cedar Backup configuration for ?ken? will only match the file if it is actually on the filesystem with a lower-case ?k? as its first letter. This won't surprise the typical UNIX user, but might surprise someone who's gotten into the ?Mac Mindset?. Sample Configuration File Both the Python source distribution and the Debian package come with a sample configuration file. The Debian package includes a stripped config file in /etc/ cback.conf and a larger sample in /usr/share/doc/cedar-backup2/examples/ cback.conf.sample. This is a sample configuration file similar to the one provided in the source package. Documentation below provides more information about each of the individual configuration sections. Kenneth J. Pronovici 1.3 Sample tuesday /opt/backup/tmp backup group /usr/bin/scp -B debian local /opt/backup/collect /opt/backup/collect daily targz .cbignore /etc incr /home/root/.profile weekly /opt/backup/staging /opt/backup/staging cdrw-74 cdwriter /dev/cdrw 0,0,0 4 Y Y Y /opt/backup/stage 7 /opt/backup/collect 0 Reference Configuration The reference configuration section contains free-text elements that exist only for reference.. The section itself is required, but the individual elements may be left blank if desired. This is an example reference configuration section: Kenneth J. Pronovici Revision 1.3 Sample Yet to be Written Config Tool (tm) The following elements are part of the reference configuration section: author Author of the configuration file. Restrictions: None revision Revision of the configuration file. Restrictions: None description Description of the configuration file. Restrictions: None generator Tool that generated the configuration file, if any. Restrictions: None Options Configuration The options configuration section contains configuration options that are not specific to any one action. This is an example options configuration section: tuesday /opt/backup/tmp backup backup /usr/bin/scp -B /usr/bin/ssh /usr/bin/cback collect, purge cdrecord /opt/local/bin/cdrecord mkisofs /opt/local/bin/mkisofs collect echo "I AM A PRE-ACTION HOOK RELATED TO COLLECT" collect echo "I AM A POST-ACTION HOOK RELATED TO COLLECT" The following elements are part of the options configuration section: starting_day Day that starts the week. Cedar Backup is built around the idea of weekly backups. The starting day of week is the day that media will be rebuilt from scratch and that incremental backup information will be cleared. Restrictions: Must be a day of the week in English, i.e. monday, tuesday, etc. The validation is case-sensitive. working_dir Working (temporary) directory to use for backups. This directory is used for writing temporary files, such as tar file or ISO filesystem images as they are being built. It is also used to store day-to-day information about incremental backups. The working directory should contain enough free space to hold temporary tar files (on a client) or to build an ISO filesystem image (on a master). Restrictions: Must be an absolute path backup_user Effective user that backups should run as. This user must exist on the machine which is being configured and should not be root (although that restriction is not enforced). This value is also used as the default remote backup user for remote peers. Restrictions: Must be non-empty backup_group Effective group that backups should run as. This group must exist on the machine which is being configured, and should not be root or some other ?powerful? group (although that restriction is not enforced). Restrictions: Must be non-empty rcp_command Default rcp-compatible copy command for staging. The rcp command should be the exact command used for remote copies, including any required options. If you are using scp, you should pass it the -B option, so scp will not ask for any user input (which could hang the backup). A common example is something like /usr/bin/scp -B. This value is used as the default value for all remote peers. Technically, this value is not needed by clients, but we require it for all config files anyway. Restrictions: Must be non-empty rsh_command Default rsh-compatible command to use for remote shells. The rsh command should be the exact command used for remote shells, including any required options. This value is used as the default value for all managed clients. It is optional, because it is only used when executing actions on managed clients. However, each managed client must either be able to read the value from options configuration or must set the value explicitly. Restrictions: Must be non-empty cback_command Default cback-compatible command to use on managed remote clients. The cback command should be the exact command used for for executing cback on a remote managed client, including any required command-line options. Do not list any actions in the command line, and do not include the --full command-line option. This value is used as the default value for all managed clients. It is optional, because it is only used when executing actions on managed clients. However, each managed client must either be able to read the value from options configuration or must set the value explicitly. Note: if this command-line is complicated, it is often better to create a simple shell script on the remote host to encapsulate all of the options. Then, just reference the shell script in configuration. Restrictions: Must be non-empty managed_actions Default set of actions that are managed on remote clients. This is a comma-separated list of actions that the master will manage on behalf of remote clients. Typically, it would include only collect-like actions and purge. This value is used as the default value for all managed clients. It is optional, because it is only used when executing actions on managed clients. However, each managed client must either be able to read the value from options configuration or must set the value explicitly. Restrictions: Must be non-empty. override Command to override with a customized path. This is a subsection which contains a command to override with a customized path. This functionality would be used if root's $PATH does not include a particular required command, or if there is a need to use a version of a command that is different than the one listed on the $PATH. Most users will only use this section when directed to, in order to fix a problem. This section is optional, and can be repeated as many times as necessary. This subsection must contain the following two fields: command Name of the command to be overridden, i.e. ?cdrecord?. Restrictions: Must be a non-empty string. abs_path The absolute path where the overridden command can be found. Restrictions: Must be an absolute path. pre_action_hook Hook configuring a command to be executed before an action. This is a subsection which configures a command to be executed immediately before a named action. It provides a way for administrators to associate their own custom functionality with standard Cedar Backup actions or with arbitrary extensions. This section is optional, and can be repeated as many times as necessary. This subsection must contain the following two fields: action Name of the Cedar Backup action that the hook is associated with. The action can be a standard backup action (collect, stage, etc.) or can be an extension action. No validation is done to ensure that the configured action actually exists. Restrictions: Must be a non-empty string. command Name of the command to be executed. This item can either specify the path to a shell script of some sort (the recommended approach) or can include a complete shell command. Note: if you choose to provide a complete shell command rather than the path to a script, you need to be aware of some limitations of Cedar Backup's command-line parser. You cannot use a subshell (via the `command` or $(command) syntaxes) or any shell variable in your command line. Additionally, the command-line parser only recognizes the double-quote character (") to delimit groupings or strings on the command-line. The bottom line is, you are probably best off writing a shell script of some sort for anything more sophisticated than very simple shell commands. Restrictions: Must be a non-empty string. post_action_hook Hook configuring a command to be executed after an action. This is a subsection which configures a command to be executed immediately after a named action. It provides a way for administrators to associate their own custom functionality with standard Cedar Backup actions or with arbitrary extensions. This section is optional, and can be repeated as many times as necessary. This subsection must contain the following two fields: action Name of the Cedar Backup action that the hook is associated with. The action can be a standard backup action (collect, stage, etc.) or can be an extension action. No validation is done to ensure that the configured action actually exists. Restrictions: Must be a non-empty string. command Name of the command to be executed. This item can either specify the path to a shell script of some sort (the recommended approach) or can include a complete shell command. Note: if you choose to provide a complete shell command rather than the path to a script, you need to be aware of some limitations of Cedar Backup's command-line parser. You cannot use a subshell (via the `command` or $(command) syntaxes) or any shell variable in your command line. Additionally, the command-line parser only recognizes the double-quote character (") to delimit groupings or strings on the command-line. The bottom line is, you are probably best off writing a shell script of some sort for anything more sophisticated than very simple shell commands. Restrictions: Must be a non-empty string. Peers Configuration The peers configuration section contains a list of the peers managed by a master. This section is only required on a master. This is an example peers configuration section: machine1 local /opt/backup/collect machine2 remote backup /opt/backup/collect machine3 remote Y backup /opt/backup/collect /usr/bin/scp /usr/bin/ssh /usr/bin/cback collect, purge The following elements are part of the peers configuration section: peer (local version) Local client peer in a backup pool. This is a subsection which contains information about a specific local client peer managed by a master. This section can be repeated as many times as is necessary. At least one remote or local peer must be configured. The local peer subsection must contain the following fields: name Name of the peer, typically a valid hostname. For local peers, this value is only used for reference. However, it is good practice to list the peer's hostname here, for consistency with remote peers. Restrictions: Must be non-empty, and unique among all peers. type Type of this peer. This value identifies the type of the peer. For a local peer, it must always be local. Restrictions: Must be local. collect_dir Collect directory to stage from for this peer. The master will copy all files in this directory into the appropriate staging directory. Since this is a local peer, the directory is assumed to be reachable via normal filesystem operations (i.e. cp). Restrictions: Must be an absolute path. peer (remote version) Remote client peer in a backup pool. This is a subsection which contains information about a specific remote client peer managed by a master. A remote peer is one which can be reached via an rsh-based network call. This section can be repeated as many times as is necessary. At least one remote or local peer must be configured. The remote peer subsection must contain the following fields: name Hostname of the peer. For remote peers, this must be a valid DNS hostname or IP address which can be resolved during an rsh-based network call. Restrictions: Must be non-empty, and unique among all peers. type Type of this peer. This value identifies the type of the peer. For a remote peer, it must always be remote. Restrictions: Must be remote. managed Indicates whether this peer is managed. A managed peer (or managed client) is a peer for which the master manages all of the backup activites via a remote shell. This field is optional. If it doesn't exist, then N will be assumed. Restrictions: Must be a boolean (Y or N). collect_dir Collect directory to stage from for this peer. The master will copy all files in this directory into the appropriate staging directory. Since this is a remote peer, the directory is assumed to be reachable via rsh-based network operations (i.e. scp or the configured rcp command). Restrictions: Must be an absolute path. backup_user Name of backup user on the remote peer. This username will be used when copying files from the remote peer via an rsh-based network connection. This field is optional. if it doesn't exist, the backup will use the default backup user from the options section. Restrictions: Must be non-empty. rcp_command The rcp-compatible copy command for this peer. The rcp command should be the exact command used for remote copies, including any required options. If you are using scp, you should pass it the -B option, so scp will not ask for any user input (which could hang the backup). A common example is something like /usr/bin/scp -B. This field is optional. if it doesn't exist, the backup will use the default rcp command from the options section. Restrictions: Must be non-empty. rsh_command The rsh-compatible command for this peer. The rsh command should be the exact command used for remote shells, including any required options. This value only applies if the peer is managed. This field is optional. if it doesn't exist, the backup will use the default rsh command from the options section. Restrictions: Must be non-empty cback_command The cback-compatible command for this peer. The cback command should be the exact command used for for executing cback on the peer as part of a managed backup. This value must include any required command-line options. Do not list any actions in the command line, and do not include the --full command-line option. This value only applies if the peer is managed. This field is optional. if it doesn't exist, the backup will use the default cback command from the options section. Note: if this command-line is complicated, it is often better to create a simple shell script on the remote host to encapsulate all of the options. Then, just reference the shell script in configuration. Restrictions: Must be non-empty managed_actions Set of actions that are managed for this peer. This is a comma-separated list of actions that the master will manage on behalf this peer. Typically, it would include only collect-like actions and purge. This value only applies if the peer is managed. This field is optional. if it doesn't exist, the backup will use the default list of managed actions from the options section. Restrictions: Must be non-empty. Collect Configuration The collect configuration section contains configuration options related the the collect action. This section contains a variable number of elements, including an optional exclusion section and a repeating subsection used to specify which directories and/or files to collect. You can also configure an ignore indicator file, which lets users mark their own directories as not backed up. Using a Link Farm Sometimes, it's not very convenient to list directories one by one in the Cedar Backup configuration file. For instance, when backing up your home directory, you often exclude as many directories as you include. The ignore file mechanism can be of some help, but it still isn't very convenient if there are a lot of directories to ignore (or if new directories pop up all of the time). In this situation, one option is to use a link farm rather than listing all of the directories in configuration. A link farm is a directory that contains nothing but a set of soft links to other files and directories. Normally, Cedar Backup does not follow soft links, but you can override this behavior for individual directories using the link_depth and dereference options (see below). When using a link farm, you still have to deal with each backed-up directory individually, but you don't have to modify configuration. Some users find that this works better for them. In order to actually execute the collect action, you must have configured at least one collect directory or one collect file. However, if you are only including collect configuration for use by an extension, then it's OK to leave out these sections. The validation will take place only when the collect action is executed. This is an example collect configuration section: /opt/backup/collect daily targz .cbignore /etc .*\.conf /home/root/.profile /etc /var/log incr /opt weekly /opt/large backup .*tmp The following elements are part of the collect configuration section: collect_dir Directory to collect files into. On a client, this is the directory which tarfiles for individual collect directories are written into. The master then stages files from this directory into its own staging directory. This field is always required. It must contain enough free space to collect all of the backed-up files on the machine in a compressed form. Restrictions: Must be an absolute path collect_mode Default collect mode. The collect mode describes how frequently a directory is backed up. See the section called ?The Collect Action? (in Chapter 2, Basic Concepts) for more information. This value is the collect mode that will be used by default during the collect process. Individual collect directories (below) may override this value. If all individual directories provide their own value, then this default value may be omitted from configuration. Note: if your backup device does not suppport multisession discs, then you should probably use the daily collect mode to avoid losing data. Restrictions: Must be one of daily, weekly or incr. archive_mode Default archive mode for collect files. The archive mode maps to the way that a backup file is stored. A value tar means just a tarfile (file.tar); a value targz means a gzipped tarfile (file.tar.gz); and a value tarbz2 means a bzipped tarfile (file.tar.bz2) This value is the archive mode that will be used by default during the collect process. Individual collect directories (below) may override this value. If all individual directories provide their own value, then this default value may be omitted from configuration. Restrictions: Must be one of tar, targz or tarbz2. ignore_file Default ignore file name. The ignore file is an indicator file. If it exists in a given directory, then that directory will be recursively excluded from the backup as if it were explicitly excluded in configuration. The ignore file provides a way for individual users (who might not have access to Cedar Backup configuration) to control which of their own directories get backed up. For instance, users with a ~/tmp directory might not want it backed up. If they create an ignore file in their directory (e.g. ~/tmp/.cbignore), then Cedar Backup will ignore it. This value is the ignore file name that will be used by default during the collect process. Individual collect directories (below) may override this value. If all individual directories provide their own value, then this default value may be omitted from configuration. Restrictions: Must be non-empty exclude List of paths or patterns to exclude from the backup. This is a subsection which contains a set of absolute paths and patterns to be excluded across all configured directories. For a given directory, the set of absolute paths and patterns to exclude is built from this list and any list that exists on the directory itself. Directories cannot override or remove entries that are in this list, however. This section is optional, and if it exists can also be empty. The exclude subsection can contain one or more of each of the following fields: abs_path An absolute path to be recursively excluded from the backup. If a directory is excluded, then all of its children are also recursively excluded. For instance, a value /var/log/apache would exclude any files within /var/log/apache as well as files within other directories under /var/log/apache. This field can be repeated as many times as is necessary. Restrictions: Must be an absolute path. pattern A pattern to be recursively excluded from the backup. The pattern must be a Python regular expression. ^[24] It is assumed to be bounded at front and back by the beginning and end of the string (i.e. it is treated as if it begins with ^ and ends with $). If the pattern causes a directory to be excluded, then all of the children of that directory are also recursively excluded. For instance, a value .*apache.* might match the /var/log/apache directory. This would exclude any files within /var/log/apache as well as files within other directories under /var/log/apache. This field can be repeated as many times as is necessary. Restrictions: Must be non-empty file A file to be collected. This is a subsection which contains information about a specific file to be collected (backed up). This section can be repeated as many times as is necessary. At least one collect directory or collect file must be configured when the collect action is executed. The collect file subsection contains the following fields: abs_path Absolute path of the file to collect. Restrictions: Must be an absolute path. collect_mode Collect mode for this file The collect mode describes how frequently a file is backed up. See the section called ?The Collect Action? (in Chapter 2, Basic Concepts) for more information. This field is optional. If it doesn't exist, the backup will use the default collect mode. Note: if your backup device does not suppport multisession discs, then you should probably confine yourself to the daily collect mode, to avoid losing data. Restrictions: Must be one of daily, weekly or incr. archive_mode Archive mode for this file. The archive mode maps to the way that a backup file is stored. A value tar means just a tarfile (file.tar); a value targz means a gzipped tarfile (file.tar.gz); and a value tarbz2 means a bzipped tarfile (file.tar.bz2) This field is optional. if it doesn't exist, the backup will use the default archive mode. Restrictions: Must be one of tar, targz or tarbz2. dir A directory to be collected. This is a subsection which contains information about a specific directory to be collected (backed up). This section can be repeated as many times as is necessary. At least one collect directory or collect file must be configured when the collect action is executed. The collect directory subsection contains the following fields: abs_path Absolute path of the directory to collect. The path may be either a directory, a soft link to a directory, or a hard link to a directory. All three are treated the same at this level. The contents of the directory will be recursively collected. The backup will contain all of the files in the directory, as well as the contents of all of the subdirectories within the directory, etc. Soft links within the directory are treated as files, i.e. they are copied verbatim (as a link) and their contents are not backed up. Restrictions: Must be an absolute path. collect_mode Collect mode for this directory The collect mode describes how frequently a directory is backed up. See the section called ?The Collect Action? (in Chapter 2, Basic Concepts) for more information. This field is optional. If it doesn't exist, the backup will use the default collect mode. Note: if your backup device does not suppport multisession discs, then you should probably confine yourself to the daily collect mode, to avoid losing data. Restrictions: Must be one of daily, weekly or incr. archive_mode Archive mode for this directory. The archive mode maps to the way that a backup file is stored. A value tar means just a tarfile (file.tar); a value targz means a gzipped tarfile (file.tar.gz); and a value tarbz2 means a bzipped tarfile (file.tar.bz2) This field is optional. if it doesn't exist, the backup will use the default archive mode. Restrictions: Must be one of tar, targz or tarbz2. ignore_file Ignore file name for this directory. The ignore file is an indicator file. If it exists in a given directory, then that directory will be recursively excluded from the backup as if it were explicitly excluded in configuration. The ignore file provides a way for individual users (who might not have access to Cedar Backup configuration) to control which of their own directories get backed up. For instance, users with a ~/tmp directory might not want it backed up. If they create an ignore file in their directory (e.g. ~/tmp/.cbignore), then Cedar Backup will ignore it. This field is optional. If it doesn't exist, the backup will use the default ignore file name. Restrictions: Must be non-empty link_depth Link depth value to use for this directory. The link depth is maximum depth of the tree at which soft links should be followed. So, a depth of 0 does not follow any soft links within the collect directory, a depth of 1 follows only links immediately within the collect directory, a depth of 2 follows the links at the next level down, etc. This field is optional. If it doesn't exist, the backup will assume a value of zero, meaning that soft links within the collect directory will never be followed. Restrictions: If set, must be an integer ? 0. dereference Whether to dereference soft links. If this flag is set, links that are being followed will be dereferenced before being added to the backup. The link will be added (as a link), and then the directory or file that the link points at will be added as well. This value only applies to a directory where soft links are being followed (per the link_depth configuration option). It never applies to a configured collect directory itself, only to other directories within the collect directory. This field is optional. If it doesn't exist, the backup will assume that links should never be dereferenced. Restrictions: Must be a boolean (Y or N). exclude List of paths or patterns to exclude from the backup. This is a subsection which contains a set of paths and patterns to be excluded within this collect directory. This list is combined with the program-wide list to build a complete list for the directory. This section is entirely optional, and if it exists can also be empty. The exclude subsection can contain one or more of each of the following fields: abs_path An absolute path to be recursively excluded from the backup. If a directory is excluded, then all of its children are also recursively excluded. For instance, a value /var/log/apache would exclude any files within /var/log/apache as well as files within other directories under /var/log/apache. This field can be repeated as many times as is necessary. Restrictions: Must be an absolute path. rel_path A relative path to be recursively excluded from the backup. The path is assumed to be relative to the collect directory itself. For instance, if the configured directory is /opt/web a configured relative path of something/else would exclude the path /opt/web/ something/else. If a directory is excluded, then all of its children are also recursively excluded. For instance, a value something/else would exclude any files within something/else as well as files within other directories under something/else. This field can be repeated as many times as is necessary. Restrictions: Must be non-empty. pattern A pattern to be excluded from the backup. The pattern must be a Python regular expression. ^[24] It is assumed to be bounded at front and back by the beginning and end of the string (i.e. it is treated as if it begins with ^ and ends with $). If the pattern causes a directory to be excluded, then all of the children of that directory are also recursively excluded. For instance, a value .*apache.* might match the /var/log/apache directory. This would exclude any files within /var/log/apache as well as files within other directories under /var/log/apache. This field can be repeated as many times as is necessary. Restrictions: Must be non-empty Stage Configuration The stage configuration section contains configuration options related the the stage action. The section indicates where date from peers can be staged to. This section can also (optionally) override the list of peers so that not all peers are staged. If you provide any peers in this section, then the list of peers here completely replaces the list of peers in the peers configuration section for the purposes of staging. This is an example stage configuration section for the simple case where the list of peers is taken from peers configuration: /opt/backup/stage This is an example stage configuration section that overrides the default list of peers: /opt/backup/stage machine1 local /opt/backup/collect machine2 remote backup /opt/backup/collect The following elements are part of the stage configuration section: staging_dir Directory to stage files into. This is the directory into which the master stages collected data from each of the clients. Within the staging directory, data is staged into date-based directories by peer name. For instance, peer ?daystrom? backed up on 19 Feb 2005 would be staged into something like 2005/02/19/daystrom relative to the staging directory itself. This field is always required. The directory must contain enough free space to stage all of the files collected from all of the various machines in a backup pool. Many administrators set up purging to keep staging directories around for a week or more, which requires even more space. Restrictions: Must be an absolute path peer (local version) Local client peer in a backup pool. This is a subsection which contains information about a specific local client peer to be staged (backed up). A local peer is one whose collect directory can be reached without requiring any rsh-based network calls. It is possible that a remote peer might be staged as a local peer if its collect directory is mounted to the master via NFS, AFS or some other method. This section can be repeated as many times as is necessary. At least one remote or local peer must be configured. Remember, if you provide any local or remote peer in staging configuration, the global peer configuration is completely replaced by the staging peer configuration. The local peer subsection must contain the following fields: name Name of the peer, typically a valid hostname. For local peers, this value is only used for reference. However, it is good practice to list the peer's hostname here, for consistency with remote peers. Restrictions: Must be non-empty, and unique among all peers. type Type of this peer. This value identifies the type of the peer. For a local peer, it must always be local. Restrictions: Must be local. collect_dir Collect directory to stage from for this peer. The master will copy all files in this directory into the appropriate staging directory. Since this is a local peer, the directory is assumed to be reachable via normal filesystem operations (i.e. cp). Restrictions: Must be an absolute path. peer (remote version) Remote client peer in a backup pool. This is a subsection which contains information about a specific remote client peer to be staged (backed up). A remote peer is one whose collect directory can only be reached via an rsh-based network call. This section can be repeated as many times as is necessary. At least one remote or local peer must be configured. Remember, if you provide any local or remote peer in staging configuration, the global peer configuration is completely replaced by the staging peer configuration. The remote peer subsection must contain the following fields: name Hostname of the peer. For remote peers, this must be a valid DNS hostname or IP address which can be resolved during an rsh-based network call. Restrictions: Must be non-empty, and unique among all peers. type Type of this peer. This value identifies the type of the peer. For a remote peer, it must always be remote. Restrictions: Must be remote. collect_dir Collect directory to stage from for this peer. The master will copy all files in this directory into the appropriate staging directory. Since this is a remote peer, the directory is assumed to be reachable via rsh-based network operations (i.e. scp or the configured rcp command). Restrictions: Must be an absolute path. backup_user Name of backup user on the remote peer. This username will be used when copying files from the remote peer via an rsh-based network connection. This field is optional. if it doesn't exist, the backup will use the default backup user from the options section. Restrictions: Must be non-empty. rcp_command The rcp-compatible copy command for this peer. The rcp command should be the exact command used for remote copies, including any required options. If you are using scp, you should pass it the -B option, so scp will not ask for any user input (which could hang the backup). A common example is something like /usr/bin/scp -B. This field is optional. if it doesn't exist, the backup will use the default rcp command from the options section. Restrictions: Must be non-empty. Store Configuration The store configuration section contains configuration options related the the store action. This section contains several optional fields. Most fields control the way media is written using the writer device. This is an example store configuration section: /opt/backup/stage cdrw-74 cdwriter /dev/cdrw 0,0,0 4 Y Y Y N weekly 1.3 The following elements are part of the store configuration section: source_dir Directory whose contents should be written to media. This directory must be a Cedar Backup staging directory, as configured in the staging configuration section. Only certain data from that directory (typically, data from the current day) will be written to disc. Restrictions: Must be an absolute path device_type Type of the device used to write the media. This field controls which type of writer device will be used by Cedar Backup. Currently, Cedar Backup supports CD writers (cdwriter) and DVD writers (dvdwriter). This field is optional. If it doesn't exist, the cdwriter device type is assumed. Restrictions: If set, must be either cdwriter or dvdwriter. media_type Type of the media in the device. Unless you want to throw away a backup disc every week, you are probably best off using rewritable media. You must choose a media type that is appropriate for the device type you chose above. For more information on media types, see the section called ?Media and Device Types? (in Chapter 2, Basic Concepts). Restrictions: Must be one of cdr-74, cdrw-74, cdr-80 or cdrw-80 if device type is cdwriter; or one of dvd+r or dvd+rw if device type is dvdwriter. target_device Filesystem device name for writer device. This value is required for both CD writers and DVD writers. This is the UNIX device name for the writer drive, for instance /dev/scd0 or /dev/cdrw. In some cases, this device name is used to directly write to media. This is true all of the time for DVD writers, and is true for CD writers when a SCSI id (see below) has not been specified. Besides this, the device name is also needed in order to do several pre-write checks (such as whether the device might already be mounted) as well as the post-write consistency check, if enabled. Restrictions: Must be an absolute path. target_scsi_id SCSI id for the writer device. This value is optional for CD writers and is ignored for DVD writers. If you have configured your CD writer hardware to work through the normal filesystem device path, then you can leave this parameter unset. Cedar Backup will just use the target device (above) when talking to cdrecord. Otherwise, if you have SCSI CD writer hardware or you have configured your non-SCSI hardware to operate like a SCSI device, then you need to provide Cedar Backup with a SCSI id it can use when talking with cdrecord. For the purposes of Cedar Backup, a valid SCSI identifier must either be in the standard SCSI identifier form scsibus,target,lun or in the specialized-method form :scsibus,target,lun. An example of a standard SCSI identifier is 1,6,2. Today, the two most common examples of the specialized-method form are ATA:scsibus,target,lun and ATAPI:scsibus,target,lun, but you may occassionally see other values (like OLDATAPI in some forks of cdrecord). See the section called ?Configuring your Writer Device? for more information on writer devices and how they are configured. Restrictions: If set, must be a valid SCSI identifier. drive_speed Speed of the drive, i.e. 2 for a 2x device. This field is optional. If it doesn't exist, the underlying device-related functionality will use the default drive speed. For DVD writers, it is best to leave this value unset, so growisofs can pick an appropriate speed. For CD writers, since media can be speed-sensitive, it is probably best to set a sensible value based on your specific writer and media. Restrictions: If set, must be an integer ? 1. check_data Whether the media should be validated. This field indicates whether a resulting image on the media should be validated after the write completes, by running a consistency check against it. If this check is enabled, the contents of the staging directory are directly compared to the media, and an error is reported if there is a mismatch. Practice shows that some drives can encounter an error when writing a multisession disc, but not report any problems. This consistency check allows us to catch the problem. By default, the consistency check is disabled, but most users should choose to enable it unless they have a good reason not to. This field is optional. If it doesn't exist, then N will be assumed. Restrictions: Must be a boolean (Y or N). check_media Whether the media should be checked before writing to it. By default, Cedar Backup does not check its media before writing to it. It will write to any media in the backup device. If you set this flag to Y, Cedar Backup will make sure that the media has been initialized before writing to it. (Rewritable media is initialized using the initialize action.) If the configured media is not rewritable (like CD-R), then this behavior is modified slightly. For this kind of media, the check passes either if the media has been initialized or if the media appears unused. This field is optional. If it doesn't exist, then N will be assumed. Restrictions: Must be a boolean (Y or N). warn_midnite Whether to generate warnings for crossing midnite. This field indicates whether warnings should be generated if the store operation has to cross a midnite boundary in order to find data to write to disc. For instance, a warning would be generated if valid store data was only found in the day before or day after the current day. Configuration for some users is such that the store operation will always cross a midnite boundary, so they will not care about this warning. Other users will expect to never cross a boundary, and want to be notified that something ?strange? might have happened. This field is optional. If it doesn't exist, then N will be assumed. Restrictions: Must be a boolean (Y or N). no_eject Indicates that the writer device should not be ejected. Under some circumstances, Cedar Backup ejects (opens and closes) the writer device. This is done because some writer devices need to re-load the media before noticing a media state change (like a new session). For most writer devices this is safe, because they have a tray that can be opened and closed. If your writer device does not have a tray and Cedar Backup does not properly detect this, then set this flag. Cedar Backup will not ever issue an eject command to your writer. Note: this could cause problems with your backup. For instance, with many writers, the check data step may fail if the media is not reloaded first. If this happens to you, you may need to get a different writer device. This field is optional. If it doesn't exist, then N will be assumed. Restrictions: Must be a boolean (Y or N). blank_behavior Optimized blanking strategy. For more information about Cedar Backup's optimized blanking strategy, see the section called ?Optimized Blanking Stategy?. This entire configuration section is optional. However, if you choose to provide it, you must configure both a blanking mode and a blanking factor. blank_mode Blanking mode. Restrictions:Must be one of "daily" or "weekly". blank_factor Blanking factor. Restrictions:Must be a floating point number ? 0. Purge Configuration The purge configuration section contains configuration options related the the purge action. This section contains a set of directories to be purged, along with information about the schedule at which they should be purged. Typically, Cedar Backup should be configured to purge collect directories daily (retain days of 0). If you are tight on space, staging directories can also be purged daily. However, if you have space to spare, you should consider purging about once per week. That way, if your backup media is damaged, you will be able to recreate the week's backup using the rebuild action. You should also purge the working directory periodically, once every few weeks or once per month. This way, if any unneeded files are left around, perhaps because a backup was interrupted or because configuration changed, they will eventually be removed. The working directory should not be purged any more frequently than once per week, otherwise you will risk destroying data used for incremental backups. This is an example purge configuration section: /opt/backup/stage 7 /opt/backup/collect 0 The following elements are part of the purge configuration section: dir A directory to purge within. This is a subsection which contains information about a specific directory to purge within. This section can be repeated as many times as is necessary. At least one purge directory must be configured. The purge directory subsection contains the following fields: abs_path Absolute path of the directory to purge within. The contents of the directory will be purged based on age. The purge will remove any files that were last modified more than ?retain days? days ago. Empty directories will also eventually be removed. The purge directory itself will never be removed. The path may be either a directory, a soft link to a directory, or a hard link to a directory. Soft links within the directory (if any) are treated as files. Restrictions: Must be an absolute path. retain_days Number of days to retain old files. Once it has been more than this many days since a file was last modified, it is a candidate for removal. Restrictions: Must be an integer ? 0. Extensions Configuration The extensions configuration section is used to configure third-party extensions to Cedar Backup. If you don't intend to use any extensions, or don't know what extensions are, then you can safely leave this section out of your configuration file. It is optional. Extensions configuration is used to specify ?extended actions? implemented by code external to Cedar Backup. An administrator can use this section to map command-line Cedar Backup actions to third-party extension functions. Each extended action has a name, which is mapped to a Python function within a particular module. Each action also has an index associated with it. This index is used to properly order execution when more than one action is specified on the command line. The standard actions have predefined indexes, and extended actions are interleaved into the normal order of execution using those indexes. The collect action has index 100, the stage index has action 200, the store action has index 300 and the purge action has index 400. Warning Extended actions should always be configured to run before the standard action they are associated with. This is because of the way indicator files are used in Cedar Backup. For instance, the staging process considers the collect action to be complete for a peer if the file cback.collect can be found in that peer's collect directory. If you were to run the standard collect action before your other collect-like actions, the indicator file would be written after the collect action completes but before all of the other actions even run. Because of this, there's a chance the stage process might back up the collect directory before the entire set of collect-like actions have completed ? and you would get no warning about this in your email! So, imagine that a third-party developer provided a Cedar Backup extension to back up a certain kind of database repository, and you wanted to map that extension to the ?database? command-line action. You have been told that this function is called ?foo.bar()?. You think of this backup as a ?collect? kind of action, so you want it to be performed immediately before the collect action. To configure this extension, you would list an action with a name ?database?, a module ?foo?, a function name ?bar? and an index of ?99?. This is how the hypothetical action would be configured: database foo bar 99 The following elements are part of the extensions configuration section: action This is a subsection that contains configuration related to a single extended action. This section can be repeated as many times as is necessary. The action subsection contains the following fields: name Name of the extended action. Restrictions: Must be a non-empty string consisting of only lower-case letters and digits. module Name of the Python module associated with the extension function. Restrictions: Must be a non-empty string and a valid Python identifier. function Name of the Python extension function within the module. Restrictions: Must be a non-empty string and a valid Python identifier. index Index of action, for execution ordering. Restrictions: Must be an integer ? 0. Setting up a Pool of One Cedar Backup has been designed primarily for situations where there is a single master and a set of other clients that the master interacts with. However, it will just as easily work for a single machine (a backup pool of one). Once you complete all of these configuration steps, your backups will run as scheduled out of cron. Any errors that occur will be reported in daily emails to your root user (or the user that receives root's email). If you don't receive any emails, then you know your backup worked. Note: all of these configuration steps should be run as the root user, unless otherwise indicated. Tip This setup procedure discusses how to set up Cedar Backup in the ?normal case? for a pool of one. If you would like to modify the way Cedar Backup works (for instance, by ignoring the store stage and just letting your backup sit in a staging directory), you can do that. You'll just have to modify the procedure below based on information in the remainder of the manual. Step 1: Decide when you will run your backup. There are four parts to a Cedar Backup run: collect, stage, store and purge. The usual way of setting off these steps is through a set of cron jobs. Although you won't create your cron jobs just yet, you should decide now when you will run your backup so you are prepared for later. Backing up large directories and creating ISO filesystem images can be intensive operations, and could slow your computer down significantly. Choose a backup time that will not interfere with normal use of your computer. Usually, you will want the backup to occur every day, but it is possible to configure cron to execute the backup only one day per week, three days per week, etc. Warning Because of the way Cedar Backup works, you must ensure that your backup always runs son the first day of your configured week. This is because Cedar Backup will only clear incremental backup information and re-initialize your media when running on the first day of the week. If you skip running Cedar Backup on the first day of the week, your backups will likely be ?confused? until the next week begins, or until you re-run the backup using the --full flag. Step 2: Make sure email work