From 3b2675635c3fcadbc55a5338b1279fe26d352452 Mon Sep 17 00:00:00 2001 From: Aaron Schulz Date: Thu, 18 Oct 2012 14:03:15 -0700 Subject: [PATCH] [FileBackend] Created README file for the file backend classes. This patch adds some documentation for the MediaWiki FileBackend system. Change-Id: I3bf8690f97be783c056c3daf39ff200766e7c8e0 --- includes/filebackend/README | 198 ++++++++++++++++++++++++++++++++++++ 1 file changed, 198 insertions(+) create mode 100644 includes/filebackend/README diff --git a/includes/filebackend/README b/includes/filebackend/README new file mode 100644 index 0000000000..b6ced005a2 --- /dev/null +++ b/includes/filebackend/README @@ -0,0 +1,198 @@ +/*! +\ingroup FileBackend +\page file_backend_design File backend design + +Some notes on the FileBackend architecture. + +\section intro Introduction + +To abstract away the differences among different types of storage media, +MediaWiki is providing an interface known as FileBackend. Any MediaWiki +interaction with stored files should thus use a FileBackend object. + +Different types of backing storage media are supported (ranging from local +filesystem to distributed object stores). The types include: + +* FSFileBackend (used for mounted filesystems) +* SwiftFileBackend (used for Swift or Ceph Rados+RGW object stores) +* FileBackendMultiWrite (useful for transitioning from one backend to another) + +Configuration documentation for each type of backend is to be found in their +__construct() inline documentation. + + +\section setup Setup + +File backends are registered in LocalSettings.php via the global variable +$wgFileBackends. To access one of those defined backend, one would use +FileBackendStore::get( ) which will bring back a FileBackend object +handle. Such handles are reused for any subsequent get() call (singleton +paradigm). The FileBackends objects are caching request calls such as file stats, +SHA1 requests or TCP connection handles. + +\par Note: +Some backends may require additional PHP extensions to be enabled or can rely on a +MediaWiki extension. This is often the case when a FileBackend subclass makes use of an +upstream client API for communicating with the backing store. + + +\section fileoperations File operations + +The MediaWiki FileBackend API supports various operations on either files or +directories. See FileBackend.php for full documentation for each function. + + +\subsection reading Reading + +The following operations are supported for reading from a backend: + +On files: +* state a file for basic information (timestamp, size) +* read a file into a string or several files into a map of path names to strings +* download a file or set of files to a temporary file (on a mounted file system) +* get the SHA1 hash of a file +* get various properties of a file (stat information, content time, mime information, ...) + +On directories: +* get a list of files directly under a directory +* get a recursive list of files under a directory +* get a list of directories directly under a directory +* get a recursive list of directories under a directory + +\par Note: +Backend handles should return directory listings as iterators, all though in some cases +they may just be simple arrays (which can still be iterated over). Iterators allow for callers to +traverse a large number of file listings without consuming excessive RAM in the process. Either the +memory consumed is flatly bounded (if the iterator does paging) or it is proportional to the depth +of the portion of the directory tree being traversed (if the iterator works via recursion). + + +\subsection writing Writing + +The following operations are supported for writing or changing in the backend: + +On files: +* store (copying a mounted filesystem file into storage) +* create (creating a file within storage from a string) +* copy (within storage) +* move (within storage) +* delete (within storage) +* lock/unlock (lock or unlock a file in storage) + +The following operations are supported for writing directories in the backend: +* prepare (create parent container and directories for a path) +* secure (try to lock-down access to a container) +* publish (try to reverse the effects of secure) +* clean (remove empty containers or directories) + + +\subsection invokingoperation Invoking an operation + +Generally, callers should use doOperations() or doQuickOperations() when doing +batches of changes, rather than making a suite of single operation calls. This +makes the system tolerate high latency much better by pipelining operations +when possible. + +doOperations() should be used for working on important original data, i.e. when +consistency is important. The former will only pipeline operations that do not +depend on each other. It is best if the operations that do not depend on each +other occur in consecutive groups. This function can also log file changes to +a journal (see FileJournal), which can be used to sync two backend instances. +One might use this function for user uploads of file for example. + +doQuickOperations() is more geared toward ephemeral items that can be easily +regenerated from original data. It will always pipeline without checking for +dependencies within the operation batch. One might use this function for +creating and purging generated thumbnails of original files for example. + + +\section consistency Consistency + +Not all backing stores are sequentially consistent by default. Various FileBackend functions +offer a "latest" option that can be passed in to assure (or try to assure) that the latest +version of the file is read. Some backing stores are consistent by default, but callers should +always assume that without this option, stale data may be read. This is actually true for stores +that have eventual consistency. + +Note that file listing functions have no "latest" flag, and thus some systems may return stale +data. Thus callers should avoid assuming that listings contain changes made my the current client +or any other client from a very short time ago. For example, creating a file under a directory +and then immediately doing a file listing operation on that directory may result in a listing +that does not include that file. + + +\section locking Locking + +Locking is effective if and only if a proper lock manager is registered and is +actually being used by the backend. Lock managers can be registered in LocalSettings.php +using the $wgLockManagers global configuration variable. + +For object stores, locking is not generally useful for avoiding partially +written or read objects, since most stores use Multi Version Concurrency +Control (MVCC) to avoid this. However, locking can be important when: +* One or more operations must be done without objects changing in the meantime. +* It can also be useful when a file read is used to determine a file write or DB change. + For example, doOperations() first checks that there will be no "file already exists" + or "file does not exist" type errors before attempted a given operation batch. This works + by stating the files first, and is only safe if the files are locked in the meantime. + +When locking, callers also should use the latest available file data for reads. +Also, one should always lock the file *before* reading it, not after. If stale data is used +to determine a write, there will be some data corruption, even when reads of the original file +finally start returning the updated data without using the "latest" option (eventual consistency). + +Since acquiring locks can fail, and lock managers can be non-blocking, callers should: +* Acquire all required locks up font +* Be prepared for the case where locks fail to be acquired +* Possible retry acquiring certain locks + +MVCC is also a useful pattern to use on top of the backend interface, because operations +are not atomic, even with doOperations(), so doing complex batch file changes or changing files +and updating a database row can result in partially written "transactions". One should avoid +changing files once they have been stored, except perhaps with ephemeral data that are tolerant +of some inconsistency. + +Callers can use their own locking (e.g. SELECT FOR UPDATE) if it is more convenient, but note +that all callers that change any of the files should then go through functions that acquire these +locks. For example, if a caller just directly uses the file backend store() function, it will +ignore any custom "FOR UPDATE" locks, which can cause problems. + +\section objectstore Object stores + +Support for object stores (like Amazon S3/Swift) drive much of the API and design +decisions of FileBackend, but using any POSIX compliant file systems works fine. +The system essentially stores "files" in "containers". For a mounted file +system as a backing store, these will just be "files" under "directories". For +an object store as a backing store, the "files" will be "objects" stored in +"containers". + + +\section file_obj_diffs File and Object store differences + +An advantage of objects stores is the reduced Round-Trip Times. This is +achieved by avoiding the need to create each parent directory before placing a +file somewhere. It gets worse the deeper the directory hierarchy is. Both with +object stores and file systems using "/" in filenames will allow for the +intuitive use of directory functions. For example, creating a file in Swift +called "container/a/b/file1" will mean that: +- a "directory listing" of "container/a" will contain "b", +- and a "file listing" of "b" will contain "file1" + +This means that switching from an object store to a file system and vise versa +using the FileBackend interface will generally be harmless. You must aware of +some reserves though: + +* In a filesystem, you cannot have a file and a directory within the same path + whereas it is possible in an object stores. +* Some file systems have file name length restrictions or overall path length + restrictions that others do not. The same goes with object stores wich might + have maximum object length or a limitation regarding the number of files + under a container or volume. +* Latency vary among systems, certain access patterns may not be tolerable for + certain backends but may hold up for others. Some backend subclasses use + MediaWiki's object caching for serving stat requests, which can greatly + reduce latency. Making sure that the backend has pipelining (see the + "parallelize" and "concurrency" settings) enabled can also combat latency in + batch operation scenarios. + +*/ -- 2.20.1