How the File Upload Process Works
The File Migration service maintains in memory information regarding directories,
subdirectories, and files within each directory tree to monitor. When started, the service
first looks for changed and new files in the configured directories. To detect changes to
existing files, the service looks at the extended attribute
user.s3uploadtime
, which is set on each file and contains the modification
time (in milliseconds) when the file was last uploaded to the AWS S3 bucket. It then places
the new and changed files in a queue for upload, sorted by the oldest modification timestamp.
The service then uploads multiple files in parallel and provides logs as files upload, fail to
upload, or complete uploading.
.tmp
or all
files in /tmp
directory can be set up to be ignored. Additionally, only files
are considered for upload; empty directories, symbolic links, and tables or streams are
ignored.Scanning for New and Changed Files
The service scans for new and changed files in the monitored directories using a full scan or a lite scan.
During a full scan, the service examines each file in a directory to determine if it has changed since the last time the directory was examined. If the file contains changes, then the file is examined more carefully.
If the file was uploaded, the service retrieves the modification time of the file, which was set on the file at the time the file was uploaded. If the file was not uploaded or if the file was uploaded but the modification time changed since the last upload, the service places the file in the queue for upload.
A full scan also discovers new directories and adds them to the list of directories to be monitored.
Queueing Files for Upload
Files are placed in the queue with a configurable delay from the last known modification time of the file to avoid uploading a file that is still changing.
For example, with the default of 15 seconds, a file that was modified 5 seconds ago would remain on the queue for another 10 seconds, whereas a file modified 5 minutes ago would be immediately considered for upload. This allows files to be uploaded roughly in order by last modification time.
The service places (by default, 10) files in the queue for parallel upload. The scanning service is paused if the pending queue exceeds 10 times the number of files allowed for parallel update to avoid too many pending files.
Uploading Files to S3
Uploads are in one of three states: waiting to be uploaded, uploading, finished
with uploading. By default, the service performs a check on the status of the files
being uploaded every 3 seconds and detects uploads that are approaching completion.
The service then sets the extended attribute (user.s3uploadtime
)
to the file on file system to indicate that the file was uploaded.
If an error occurs during upload, the file is requeued for another try after a brief wait, which is 5 minutes by default. If the file could not be uploaded after a specified number of attempts, the file is dequeued to allow other files to upload. Note that the file is rediscovered, if it is still there, during the next full scan and queued for upload.
/tmp/foo
after upload becomes the key
tmp/foo
on AWS S3 bucket.Setting Extended Attributes
When the upload completes, the service uses an extended attribute,
user.s3uploadtime
, to set the timestamp on the file to when the
file was uploaded. This extended attribute, which contains the last upload
timestamp, is used by the service to determine if a file has changed since the
file was last uploaded.
Purging the Files and Directories
The service also optionally deletes files after uploading and deletes empty directories. You can configure file deletion after upload in hours or fractions of an hour and configure empty directory deletion in minutes. Both can be set to zero, which disables the deletion function. Both functions are disabled by default.
Deletion of files and directories happen only during a full scan:
If the file has been uploaded in its present form and if
it matches or exceeds the interval configured for file deletion, the file is
immediately deleted. If it does not match the configured interval for file
deletion, a message is printed (to the log file, filemigrate.log
) as
to when the file will be deleted if it will be deleted fairly soon.
Files that were never uploaded (such as the files that were ignored) will not be deleted by the service.
If the directory is empty and the modification timestamp matches or exceeds the configured interval of time for directory deletion, the directory is deleted.
The act of deleting a file or directory from a parent directory causes the modification timestamp of the parent directory to change. Therefore, an idle or empty (parent) directory is deleted only after the configured interval of time since the deletion of its children (file or subdirectory).
The recursive deletion of directories has a slow bottom up behavior: first empty leaf directories are deleted and then the parents are deleted during a future full scan after the interval of time has elapsed.
If a directory contains files that were never uploaded (ignored), the directory will not be deleted.