btrfs_scan_one_device is calling set_blocksize() which can race
with a concurrent process making dirty page cache pages. It can end up
dropping dirty page cache pages on the floor, which isn't very nice when
someone is just running btrfs dev scan to find filesystems on the
box.
Now that udev is registering btrfs devices as it discovers them, we can
actually end up racing with our own mkfs program too. When this
happens, we drop some of the important blocks written by mkfs.
This commit changes scan_one_device to read the super out of the page
cache instead of trying to use bread. This way we don't have to care
about the blocksize of the device.
This also drops the invalidate_bdev() call. It wasn't very polite to
invalidate during the scan either. mkfs is putting the super into the
page cache, there's no reason to invalidate at this point.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Even though nlmclnt_reclaim() is only one call into the stack frame,
928 bytes on the stack seems like a lot. Recode to dynamically
allocate the request structure once from within the reclaimer task,
then pass this pointer into nlmclnt_reclaim() for reuse on
subsequent calls.
smatch analysis:
fs/lockd/clntproc.c:620 nlmclnt_reclaim() warn: 'reqst' puts
928 bytes on stack
Also remove redundant assignment of 0 after memset.
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Currently, NFSd is ready to operate in network namespace based containers.
So let's drop check for "init_net" and make it able to fly.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This tracker uses khelper kthread to execute binaries.
Execution itself is done from kthread context - i.e. global root is used.
This is not suitable for containers with own root.
So, disable this tracker for a while.
Note: one of possible solutions can be pass "init" callback to khelper, which
will swap root to desired one.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Functuon "exports_open" is used for both "/proc/fs/nfs/exports" and
"/proc/fs/nfsd/exports" files.
Now NFSd filesystem is containerised, so proper net can be taken from
superblock for "/proc/fs/nfsd/exports" reader.
But for "/proc/fs/nfsd/exports" only current->nsproxy->net_ns can be used.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This patch makes NFSD file system superblock to be created per net.
This makes possible to get proper network namespace from superblock instead of
using hard-coded "init_net".
Note: NFSd fs super-block holds network namespace. This garantees, that
network namespace won't disappear from underneath of it.
This, obviously, means, that in case of kill of a container's "init" (which is not a mount
namespace, but network namespace creator) netowrk namespace won't be
destroyed.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Passing this pointer is redundant since it's stored on cache_detail structure,
which is also passed to sunrpc_cache_pipe_upcall () function.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
For most of SUNRPC caches (except NFS DNS cache) cache_detail->cache_upcall is
redundant since all that it's implementations are doing is calling
sunrpc_cache_pipe_upcall() with proper function address argument.
Cache request function address is now stored on cache_detail structure and
thus all the code can be simplified.
Now, for those cache details, which doesn't have cache_upcall callback (the
only one, which still has is nfs_dns_resolve_template)
sunrpc_cache_pipe_upcall will be called instead.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This callback will allow to simplify upcalls in further patches in this
series.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This is a cleanup patch.
Such helpers like nfs_cache_init() and nfs_cache_destroy() are redundant,
because they are just a wrappers around sunrpc_init_cache_detail() and
sunrpc_destroy_cache_detail() respectively.
So let's remove them completely and move corresponding logic to
nfs_cache_register_net() and nfs_cache_unregister_net() respectively (since
they are called together anyway).
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This cache was the first containerized and doesn't use net-aware cache
creation and destruction helpers.
This is a cleanup patch which just makes code looks clearer and reduce amount
of lines of code.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Use ERR_PTR()/IS_ERR() abstraction instead of passing in a separate
pointer to an integer for the error code, as a code cleanup.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The code to read in directory blocks and verify their metadata
checksums was replicated in ten different places across
fs/ext4/namei.c, and the code was buggy in subtle ways in a number of
those replicated sites. In some cases, ext4_error() was called with a
training newline. In others, in particularly in empty_dir(), it was
possible to call ext4_dirent_csum_verify() on an index block, which
would trigger false warnings requesting the system adminsitrator to
run e2fsck.
By refactoring the code, we make the code more readable, as well as
shrinking the compiled object file by over 700 bytes and 50 lines of
code.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When replaying a log tree with qgroups enabled, tree_mod_log_rewind does a
sanity-check of the number of items against the maximum possible number.
It calculates that number with the nodesize of fs_root. Unfortunately
fs_root is not yet set at this stage. So instead use the nodesize from
tree_root, which is already initialized.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When we are converting local data to an extent format as a result of
adding an attribute, the type of data contained in the local fork
determines the behaviour that needs to occur.
xfs_bmap_add_attrfork_local() already handles the directory data
case specially by using S_ISDIR() and calling out to
xfs_dir2_sf_to_block(), but with verifiers we now need to handle
each different type of metadata specially and different metadata
formats require different verifiers (and eventually block header
initialisation).
There is only a single place that we add and attribute fork to
the inode, but that is in the attribute code and it knows nothing
about the specific contents of the data fork. It is only the case of
local data that is the issue here, so adding code to hadnle this
case in the attribute specific code is wrong. Hence we are really
stuck trying to detect the data fork contents in
xfs_bmap_add_attrfork_local() and performing the correct callout
there.
Luckily the current cases can be determined by S_IS* macros, and we
can push the work off to data specific callouts, but each of those
callouts does a lot of work in common with
xfs_bmap_local_to_extents(). The only reason that this fails for
symlinks right now is is that xfs_bmap_local_to_extents() assumes
the data fork contains extent data, and so attaches a a bmap extent
data verifier to the buffer and simply copies the data fork
information straight into it.
To fix this, allow us to pass a "formatting" callback into
xfs_bmap_local_to_extents() which is responsible for setting the
buffer type, initialising it and copying the data fork contents over
to the new buffer. This allows callers to specify how they want to
format the new buffer (which is necessary for the upcoming CRC
enabled metadata blocks) and hence make xfs_bmap_local_to_extents()
useful for any type of data fork content.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The trylock log force invoked via xfs_buf_item_push() can attempt
to acquire xa_lock, thus leading to a recursion bug when called
with xa_lock held.
This log force was originally added to xfs_buf_trylock() to address
xfsaild stalls due to pinned and stale buffers. Since the addition
of this behavior, the log item pushing code had been reworked to
detect and track pinned items to inform xfsaild to issue a log
force itself when necessary. As such, the log force on trylock
failure is redundant and safe to remove.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The buffer pinned check and trylock sequence in xfs_buf_item_push()
can race with an active transaction on marking the buffer pinned.
This can result in the buffer becoming pinned and stale after the
initial check and the trylock failure, but before the check in
xfs_buf_trylock() that issues a log force. If the log force is
issued from this context, a spinlock recursion occurs on xa_lock.
Prepare xfs_buf_item_push() to handle the race by detecting a
pinned buffer after the trylock failure so xfsaild issues a log
force from a safe context. This, along with various previous fixes,
renders the log force in xfs_buf_trylock() redundant.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Speculative preallocation based on the current file size works well
for contiguous files, but is sub-optimal for sparse files where the
EOF preallocation can fill holes and result in large amounts of
zeros being written when it is not necessary.
The algorithm is modified to prevent EOF speculative preallocation
from triggering larger allocations on IO patterns of
truncate--to-zero-seek-write-seek-write-.... which results in
non-sparse files for large files. This, unfortunately, is the way cp
now behaves when copying sparse files and so needs to be fixed.
What this code does is that it looks at the existing extent adjacent
to the current EOF and if it determines that it is a hole we disable
speculative preallocation altogether. To avoid the next write from
doing a large prealloc, it takes the size of subsequent
preallocations from the current size of the existing EOF extent.
IOWs, if you leave a hole in the file, it resets preallocation
behaviour to the same as if it was a zero size file.
Example new behaviour:
$ xfs_io -f -c "pwrite 0 31m" \
-c "pwrite 33m 1m" \
-c "pwrite 128m 1m" \
-c "fiemap -v" /mnt/scratch/blah
wrote 32505856/32505856 bytes at offset 0
31 MiB, 7936 ops; 0.0000 sec (1.608 GiB/sec and 421432.7439 ops/sec)
wrote 1048576/1048576 bytes at offset 34603008
1 MiB, 256 ops; 0.0000 sec (1.462 GiB/sec and 383233.5329 ops/sec)
wrote 1048576/1048576 bytes at offset 134217728
1 MiB, 256 ops; 0.0000 sec (1.719 GiB/sec and 450704.2254 ops/sec)
/mnt/scratch/blah:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..65535]: 96..65631 65536 0x0
1: [65536..67583]: hole 2048
2: [67584..69631]: 67680..69727 2048 0x0
3: [69632..262143]: hole 192512
4: [262144..264191]: 262240..264287 2048 0x1
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Print some additional debugging context to hopefully help to debug a
warning which is getting triggered by xfstests #74.
Also remove extraneous newlines from when printk's were converted to
ext4_warning() and ext4_msg().
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Some messages printed related to a WARN_ON(1) were printed using
KERN_NOTICE. Use KERN_WARNING or ext4_warning() instead so that
context related to the WARN_ON() is printed at the same printk warning
level (and log files, etc.)
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The current code in pnfs_destroy_all_layouts() assumes that removing
the layout from the server->layouts list is sufficient to make it
invisible to other processes. This ignores the fact that most
users access the layout through the nfs_inode->layout...
There is further breakage due to lack of reference counting of the
layouts, meaning that the whole thing Oopses at the drop of a hat.
The code in initiate_bulk_draining() is almost correct, and can be
used as a model for pnfs_destroy_all_layouts(), so move that
code to pnfs.c, and refactor the code to allow us to choose between
a single filesystem bulk recall, and a recall of all layouts.
Also note that initiate_bulk_draining() currently calls iput() while
holding locks. Fix that too.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
Allow individual fields of the layout to be fetched via getxattr.
The ceph.dir.layout.* vxattr with "disappear" if the exists_cb
indicates there no dir layout set.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
This virtual xattr will only appear when there is a dir layout policy
set on the directory. It can be set via setxattr and removed via
removexattr (implemented by the MDS).
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
Implement a new method to generate the ceph.file.layout vxattr using
the new framework.
Use 'stripe_unit' instead of 'chunk_size'.
Include pool name, either as a string or as an integer.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
Only include vxattrs in the result if they are not hidden and exist
(as determined by the exists_cb callback).
Note that the buffer size we return when 0 is passed in always includes
vxattrs that *might* exist, forming an upper bound.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
Change the vxattr handling for getxattr so that vxattrs are checked
prior to any xattr content, and never after. Enforce vxattr existence
via the exists_cb callback.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
Allow for a callback to dynamically determine if a vxattr exists for
the given inode.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
If we do not explicitly recognized a vxattr (e.g., as readonly), pass
the request through to the MDS and deal with it there.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
If we do not specifically understand a setxattr on a ceph.* virtual
xattr, send it through to the MDS. This allows us to implement new
functionality via the MDS without direct support on the client side.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
Add ability to flag virtual xattrs as hidden, such that you can
getxattr them but they do not appear in listxattr.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Sam Lang <sam.lang@inktank.com>
Add two helper functions get_option_uid and get_option_gid to handle
the work of parsing uid and gids paramaters from the command line and
making kuids and kgids out of them.
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In cifs_unix_to_basic_fattr only update the cifs_fattr with an id if
it is valid after conversion.
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Use INVALID_UID and INVALID_GID instead of NO_CHANGE_64 to indicate
the value should not be changed.
In cifs_fill_unix_set_info convert from kuids and kgids into uids and
gids that will fit in FILE_UNIX_BASIC_INFO.
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Update id_mode_to_cifs_acl to take a kuid_t and a kgid_t.
Replace NO_CHANGE_32 with INVALID_UID and INVALID_GID, and tests for
NO_CHANGE_32 with uid_valid and gid_valid.
Carefully unpack the value returned from request_key. memcpy the
value into the expected type. The convert the uid/gid into a
kuid/kgid. And then only if the result is a valid kuid or kgid update
fuid/fgid.
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
keyring_alloc has been updated to take a kuid_t and kgid_t so
pass GLOBAL_ROOT_UID instead of 0 for the uid and GLOBAL_ROOT_GID
instead of 0 for the gid.
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The assumption that sizeof(uid_t) is the same as sizeof(gid_t) is
completely reasonable but since we can verify the condition at
compile time.
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The cifs protocol has a 64bit space for uids and gids, while linux
only supports a 32bit space today. Instead of silently truncating
64bit cifs ids, replace cifs ids that do not fit in the 32bit linux
id space with the default uid and gids for the cifs mount.
Cc: Steve French <smfrench@gmail.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Use uid_eq(uid, GLOBAL_ROOT_UID) instead of !uid.
Use gid_eq(gid, GLOBAL_ROOT_GID) instead of !gid.
Use uid_eq(uid, INVALID_UID) instead of uid == -1
Use gid_eq(uid, INVALID_GID) instead of gid == -1
Use uid = GLOBAL_ROOT_UID instead of uid = 0;
Use gid = GLOBAL_ROOT_GID instead of gid = 0;
Use !uid_eq(uid1, uid2) instead of uid1 != uid2.
Use !gid_eq(gid1, gid2) instead of gid1 != gid2.
Use uid_eq(uid1, uid2) instead of uid1 == uid2.
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Change uid and gid in struct nfsd4_cb_sec to be of type kuid_t and
kgid_t.
In nfsd4_decode_cb_sec when reading uids and gids off the wire convert
them to kuids and kgids, and if they don't convert to valid kuids or
valid kuids ignore RPC_AUTH_UNIX and don't fill in any of the fields.
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In struct nfs4_ace remove the member who and replace it with an
anonymous union holding who_uid and who_gid. Allowing typesafe
storage uids and gids.
Add a helper pace_gt for sorting posix_acl_entries.
In struct posix_user_ace_state to replace uid with a union
of kuid_t uid and kgid_t gid.
Remove all initializations of the deprecated posic_acl_entry
e_id field. Which is not present when user namespaces are enabled.
Split find_uid into two functions find_uid and find_gid that work
in a typesafe manner.
In nfs4xdr update nfsd4_encode_fattr to deal with the changes
in struct nfs4_ace.
Rewrite nfsd4_encode_name to take a kuid_t and a kgid_t instead
of a generic id and flag if it is a group or a uid. Replace
the group flag with a test for a valid gid.
Modify nfsd4_encode_user to take a kuid_t and call the modifed
nfsd4_encode_name.
Modify nfsd4_encode_group to take a kgid_t and call the modified
nfsd4_encode_name.
Modify nfsd4_encode_aclname to take an ace instead of taking the
fields of an ace broken out. This allows it to detect if the ace is
for a user or a group and to pass the appropriate value while still
being typesafe.
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
When reading uids and gids off the wire convert them to
kuids and kgids. If the conversion results in an invalid
result don't set the ATTR_UID or ATTR_GID.
When putting kuids and kgids onto the wire first convert
them to uids and gids the other side will understand.
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>