Commit graph

4490 commits

Author SHA1 Message Date
NeilBrown
5959cded91 blk: Ensure users for current->bio_list can see the full list.
commit f5fe1b51905df7cfe4fdfd85c5fb7bc5b71a094f upstream.

Commit 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
changed current->bio_list so that it did not contain *all* of the
queued bios, but only those submitted by the currently running
make_request_fn.

There are two places which walk the list and requeue selected bios,
and others that check if the list is empty.  These are no longer
correct.

So redefine current->bio_list to point to an array of two lists, which
contain all queued bios, and adjust various code to test or walk both
lists.

Signed-off-by: NeilBrown <neilb@suse.com>
Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
Signed-off-by: Jens Axboe <axboe@fb.com>
Cc: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-08 09:30:36 +02:00
Shaohua Li
ad5166415f md/raid1/10: fix potential deadlock
commit 61eb2b43b99ebdc9bc6bc83d9792257b243e7cb3 upstream.

Neil Brown pointed out a potential deadlock in raid 10 code with
bio_split/chain. The raid1 code could have the same issue, but recent
barrier rework makes it less likely to happen. The deadlock happens in
below sequence:

1. generic_make_request(bio), this will set current->bio_list
2. raid10_make_request will split bio to bio1 and bio2
3. __make_request(bio1), wait_barrer, add underlayer disk bio to
current->bio_list
4. __make_request(bio2), wait_barrer

If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
raise_barrier waits for IO completion from 3. And since raise_barrier
sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
dispatched because raid10_make_request() doesn't finished yet.

The solution is to adjust the IO ordering. Quotes from Neil:
"
It is much safer to:

    if (need to split) {
        split = bio_split(bio, ...)
        bio_chain(...)
        make_request_fn(split);
        generic_make_request(bio);
   } else
        make_request_fn(mddev, bio);

This way we first process the initial section of the bio (in 'split')
which will queue some requests to the underlying devices.  These
requests will be queued in generic_make_request.
Then we queue the remainder of the bio, which will be added to the end
of the generic_make_request queue.
Then we return.
generic_make_request() will pop the lower-level device requests off the
queue and handle them first.  Then it will process the remainder
of the original bio once the first section has been fully processed.
"

Note, this only happens in read path. In write path, the bio is flushed to
underlaying disks either by blk flush (from schedule) or offladed to raid1/10d.
It's queued in current->bio_list.

Cc: Coly Li <colyli@suse.de>
Suggested-by: NeilBrown <neilb@suse.com>
Reviewed-by: Jack Wang <jinpu.wang@profitbricks.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-26 13:05:57 +02:00
Mikulas Patocka
21582cd0b6 dm: flush queued bios when process blocks to avoid deadlock
commit d67a5f4b5947aba4bfe9a80a2b86079c215ca755 upstream.

Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
stacking drivers") created a workqueue for every bio set and code
in bio_alloc_bioset() that tries to resolve some low-memory deadlocks
by redirecting bios queued on current->bio_list to the workqueue if the
system is low on memory.  However other deadlocks (see below **) may
happen, without any low memory condition, because generic_make_request
is queuing bios to current->bio_list (rather than submitting them).

** the related dm-snapshot deadlock is detailed here:
https://www.redhat.com/archives/dm-devel/2016-July/msg00065.html

Fix this deadlock by redirecting any bios on current->bio_list to the
bio_set's rescue workqueue on every schedule() call.  Consequently,
when the process blocks on a mutex, the bios queued on
current->bio_list are dispatched to independent workqueus and they can
complete without waiting for the mutex to be available.

The structure blk_plug contains an entry cb_list and this list can contain
arbitrary callback functions that are called when the process blocks.
To implement this fix DM (ab)uses the onstack plug's cb_list interface
to get its flush_current_bio_list() called at schedule() time.

This fixes the snapshot deadlock - if the map method blocks,
flush_current_bio_list() will be called and it redirects bios waiting
on current->bio_list to appropriate workqueues.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-18 19:14:34 +08:00
colyli@suse.de
6f9c02ab9d md linear: fix a race between linear_add() and linear_congested()
commit 03a9e24ef2aaa5f1f9837356aed79c860521407a upstream.

Recently I receive a bug report that on Linux v3.0 based kerenl, hot add
disk to a md linear device causes kernel crash at linear_congested(). From
the crash image analysis, I find in linear_congested(), mddev->raid_disks
contains value N, but conf->disks[] only has N-1 pointers available. Then
a NULL pointer deference crashes the kernel.

There is a race between linear_add() and linear_congested(), RCU stuffs
used in these two functions cannot avoid the race. Since Linuv v4.0
RCU code is replaced by introducing mddev_suspend().  After checking the
upstream code, it seems linear_congested() is not called in
generic_make_request() code patch, so mddev_suspend() cannot provent it
from being called. The possible race still exists.

Here I explain how the race still exists in current code.  For a machine
has many CPUs, on one CPU, linear_add() is called to add a hard disk to a
md linear device; at the same time on other CPU, linear_congested() is
called to detect whether this md linear device is congested before issuing
an I/O request onto it.

Now I use a possible code execution time sequence to demo how the possible
race happens,

seq    linear_add()                linear_congested()
 0                                 conf=mddev->private
 1   oldconf=mddev->private
 2   mddev->raid_disks++
 3                              for (i=0; i<mddev->raid_disks;i++)
 4                                bdev_get_queue(conf->disks[i].rdev->bdev)
 5   mddev->private=newconf

In linear_add() mddev->raid_disks is increased in time seq 2, and on
another CPU in linear_congested() the for-loop iterates conf->disks[i] by
the increased mddev->raid_disks in time seq 3,4. But conf with one more
element (which is a pointer to struct dev_info type) to conf->disks[] is
not updated yet, accessing its structure member in time seq 4 will cause a
NULL pointer deference fault.

To fix this race, there are 2 parts of modification in the patch,
 1) Add 'int raid_disks' in struct linear_conf, as a copy of
    mddev->raid_disks. It is initialized in linear_conf(), always being
    consistent with pointers number of 'struct dev_info disks[]'. When
    iterating conf->disks[] in linear_congested(), use conf->raid_disks to
    replace mddev->raid_disks in the for-loop, then NULL pointer deference
    will not happen again.
 2) RCU stuffs are back again, and use kfree_rcu() in linear_add() to
    free oldconf memory. Because oldconf may be referenced as mddev->private
    in linear_congested(), kfree_rcu() makes sure that its memory will not
    be released until no one uses it any more.
Also some code comments are added in this patch, to make this modification
to be easier understandable.

This patch can be applied for kernels since v4.0 after commit:
3be260cc18 ("md/linear: remove rcu protections in favour of
suspend/resume"). But this bug is reported on Linux v3.0 based kernel, for
people who maintain kernels before Linux v4.0, they need to do some back
back port to this patch.

Changelog:
 - V3: add 'int raid_disks' in struct linear_conf, and use kfree_rcu() to
       replace rcu_call() in linear_add().
 - v2: add RCU stuffs by suggestion from Shaohua and Neil.
 - v1: initial effort.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Neil Brown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12 06:41:52 +01:00
Heinz Mauelshagen
2937e22c23 dm raid: fix data corruption on reshape request
commit d36a19541fe8f392778ac137d60f9be8dfdd8f9d upstream.

The lvm2 sequence to manage dm-raid constructor flags that trigger a
rebuild or a reshape is defined as:

1) load table with flags (e.g. rebuild/delta_disks/data_offset)
2) clear out the flags in lvm2 metadata
3) store the lvm2 metadata, reload the table to reset the flags
   previously established during the initial load (1) -- in order to
   prevent repeatedly requesting a rebuild or a reshape on activation

Currently, loading an inactive table with rebuild/reshape flags
specified will cause dm-raid to rebuild/reshape on resume and thus start
updating the raid metadata (about the progress).  When the second table
reload, to reset the flags, occurs the constructor accesses the volatile
progress state kept in the raid superblocks.  Because the active mapping
is still processing the rebuild/reshape, that position will be stale by
the time the device is resumed.

In the reshape case, this causes data corruption by processing already
reshaped stripes again.  In the rebuild case, it does _not_ cause data
corruption but instead involves superfluous rebuilds.

Fix by keeping the raid set frozen during the first resume and then
allow the rebuild/reshape during the second resume.

Fixes: 9dbd1aa3a ("dm raid: add reshaping support to the target")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12 06:41:44 +01:00
Mike Snitzer
b7f874eedc dm round robin: revert "use percpu 'repeat_count' and 'current_path'"
commit 37a098e9d10db6e2efc05fe61e3a6ff2e9802c53 upstream.

The sloppy nature of lockless access to percpu pointers
(s->current_path) in rr_select_path(), from multiple threads, is
causing some paths to used more than others -- which results in less
IO performance being observed.

Revert these upstream commits to restore truly symmetric round-robin
IO submission in DM multipath:

b0b477c dm round robin: use percpu 'repeat_count' and 'current_path'
802934b dm round robin: do not use this_cpu_ptr() without having preemption disabled

There is no benefit to all this complexity if repeat_count = 1 (which is
the recommended default).

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12 06:41:44 +01:00
Mikulas Patocka
bad6c16b81 dm stats: fix a leaked s->histogram_boundaries array
commit 6085831883c25860264721df15f05bbded45e2a2 upstream.

Fixes: dfcfac3e4c ("dm stats: collect and report histogram of IO latencies")
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12 06:41:44 +01:00
Joe Thornber
9987feba90 dm cache: fix corruption seen when using cache > 2TB
commit ca763d0a53b264a650342cee206512bc92ac7050 upstream.

A rounding bug due to compiler generated temporary being 32bit was found
in remap_to_cache().  A localized cast in remap_to_cache() fixes the
corruption but this preferred fix (changing from uint32_t to sector_t)
eliminates potential for future rounding errors elsewhere.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12 06:41:44 +01:00
Kent Overstreet
8629aed238 bcache: Make gc wakeup sane, remove set_task_state()
commit be628be09563f8f6e81929efbd7cf3f45c344416 upstream.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Coly Li <colyli@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-23 17:44:36 +01:00
Mike Snitzer
621d101a7e dm rq: cope with DM device destruction while in dm_old_request_fn()
commit 4087a1fffe38106e10646606a27f10d40451862d upstream.

Fixes a crash in dm_table_find_target() due to a NULL struct dm_table
being passed from dm_old_request_fn() that races with DM device
destruction.

Reported-by: artem@flashgrid.io
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-02-14 15:25:32 -08:00
NeilBrown
5cc85ef4ff md: fix refcount problem on mddev when stopping array.
commit e2342ca832726a840ca6bd196dd2cc073815b08a upstream.

md_open() gets a counted reference on an mddev using mddev_find().
If it ends up returning an error, it must drop this reference.

There are two error paths where the reference is not dropped.
One only happens if the process is signalled and an awkward time,
which is quite unlikely.
The other was introduced recently in commit af8d8e6f0.

Change the code to ensure the drop the reference when returning an error,
and make it harded to re-introduce this sort of bug in the future.

Reported-by: Marc Smith <marc.smith@mcc.edu>
Fixes: af8d8e6f03 ("md: changes for MD_STILL_CLOSED flag")
Signed-off-by: NeilBrown <neilb@suse.com>
Acked-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12 11:39:35 +01:00
Shaohua Li
60a931c20d md: MD_RECOVERY_NEEDED is set for mddev->recovery
commit 82a301cb0ea2df8a5c88213094a01660067c7fb4 upstream.

Fixes: 90f5f7ad4f38("md: Wait for md_check_recovery before attempting device
removal.")

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-12 11:39:34 +01:00
Konstantin Khlebnikov
8beb252f2b md/raid5: limit request size according to implementation limits
commit e8d7c33232e5fdfa761c3416539bc5b4acd12db5 upstream.

Current implementation employ 16bit counter of active stripes in lower
bits of bio->bi_phys_segments. If request is big enough to overflow
this counter bio will be completed and freed too early.

Fortunately this not happens in default configuration because several
other limits prevent that: stripe_cache_size * nr_disks effectively
limits count of active stripes. And small max_sectors_kb at lower
disks prevent that during normal read/write operations.

Overflow easily happens in discard if it's enabled by module parameter
"devices_handle_discard_safely" and stripe_cache_size is set big enough.

This patch limits requests size with 256Mb - 8Kb to prevent overflows.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Shaohua Li <shli@kernel.org>
Cc: Neil Brown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-09 08:32:22 +01:00
Benjamin Marzinski
f5dca4881f dm space map metadata: fix 'struct sm_metadata' leak on failed create
commit 314c25c56c1ee5026cf99c570bdfe01847927acb upstream.

In dm_sm_metadata_create() we temporarily change the dm_space_map
operations from 'ops' (whose .destroy function deallocates the
sm_metadata) to 'bootstrap_ops' (whose .destroy function doesn't).

If dm_sm_metadata_create() fails in sm_ll_new_metadata() or
sm_ll_extend(), it exits back to dm_tm_create_internal(), which calls
dm_sm_destroy() with the intention of freeing the sm_metadata, but it
doesn't (because the dm_space_map operations is still set to
'bootstrap_ops').

Fix this by setting the dm_space_map operations back to 'ops' if
dm_sm_metadata_create() fails when it is set to 'bootstrap_ops'.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Acked-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06 10:40:16 +01:00
Heinz Mauelshagen
461f272954 dm raid: fix discard support regression
commit 11e2968478edc07a75ee1efb45011b3033c621c2 upstream.

Commit ecbfb9f118 ("dm raid: add raid level takeover support") moved the
configure_discard_support() call from raid_ctr() to raid_preresume().

Enabling/disabling discard _must_ happen during table load (through the
.ctr hook).  Fix this regression by moving the
configure_discard_support() call back to raid_ctr().

Fixes: ecbfb9f118 ("dm raid: add raid level takeover support")
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06 10:40:15 +01:00
Bart Van Assche
e362c317ba dm rq: fix a race condition in rq_completed()
commit d15bb3a6467e102e60d954aadda5fb19ce6fd8ec upstream.

It is required to hold the queue lock when calling blk_run_queue_async()
to avoid that a race between blk_run_queue_async() and
blk_cleanup_queue() is triggered.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06 10:40:15 +01:00
Ondrej Kozina
2c017f77e1 dm crypt: mark key as invalid until properly loaded
commit 265e9098bac02bc5e36cda21fdbad34cb5b2f48d upstream.

In crypt_set_key(), if a failure occurs while replacing the old key
(e.g. tfm->setkey() fails) the key must not have DM_CRYPT_KEY_VALID flag
set.  Otherwise, the crypto layer would have an invalid key that still
has DM_CRYPT_KEY_VALID flag set.

Signed-off-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06 10:40:15 +01:00
Wei Yongjun
3fae2a9e99 dm flakey: return -EINVAL on interval bounds error in flakey_ctr()
commit bff7e067ee518f9ed7e1cbc63e4c9e01670d0b71 upstream.

Fix to return error code -EINVAL instead of 0, as is done elsewhere in
this function.

Fixes: e80d1c805a ("dm: do not override error code returned from dm_get_device()")
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06 10:40:15 +01:00
Bart Van Assche
e74fb82228 dm table: an 'all_blk_mq' table must be loaded for a blk-mq DM device
commit 301fc3f5efb98633115bd887655b19f42c6dfaa8 upstream.

When dm_table_set_type() is used by a target to establish a DM table's
type (e.g. DM_TYPE_MQ_REQUEST_BASED in the case of DM multipath) the
DM core must go on to verify that the devices in the table are
compatible with the established type.

Fixes: e83068a5 ("dm mpath: add optional "queue_mode" feature")
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06 10:40:15 +01:00
Mike Snitzer
470b6910f7 dm table: fix 'all_blk_mq' inconsistency when an empty table is loaded
commit 6936c12cf809850180b24947271b8f068fdb15e9 upstream.

An earlier DM multipath table could have been build ontop of underlying
devices that were all using blk-mq.  In that case, if that active
multipath table is replaced with an empty DM multipath table (that
reflects all paths have failed) then it is important that the
'all_blk_mq' state of the active table is transfered to the new empty DM
table.  Otherwise dm-rq.c:dm_old_prep_tio() will incorrectly clone a
request that isn't needed by the DM multipath target when it is to issue
IO to an underlying blk-mq device.

Fixes: e83068a5 ("dm mpath: add optional "queue_mode" feature")
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com>
Tested-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06 10:40:15 +01:00
Linus Torvalds
6c286e812d Merge tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD fixes from Shaohua Li:
 "There are several bug fixes queued:

   - fix raid5-cache recovery bugs

   - fix discard IO error handling for raid1/10

   - fix array sync writes bogus position to superblock

   - fix IO error handling for raid array with external metadata"

* tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  md: be careful not lot leak internal curr_resync value into metadata. -- (all)
  raid1: handle read error also in readonly mode
  raid5-cache: correct condition for empty metadata write
  md: report 'write_pending' state when array in sync
  md/raid5: write an empty meta-block when creating log super-block
  md/raid5: initialize next_checkpoint field before use
  RAID10: ignore discard error
  RAID1: ignore discard error
2016-11-05 11:34:07 -07:00
NeilBrown
1217e1d199 md: be careful not lot leak internal curr_resync value into metadata. -- (all)
mddev->curr_resync usually records where the current resync is up to,
but during the starting phase it has some "magic" values.

 1 - means that the array is trying to start a resync, but has yielded
     to another array which shares physical devices, and also needs to
     start a resync
 2 - means the array is trying to start resync, but has found another
     array which shares physical devices and has already started resync.

 3 - means that resync has commensed, but it is possible that nothing
     has actually been resynced yet.

It is important that this value not be visible to user-space and
particularly that it doesn't get written to the metadata, as the
resync or recovery checkpoint.  In part, this is because it may be
slightly higher than the correct value, though this is very rare.
In part, because it is not a multiple of 4K, and some devices only
support 4K aligned accesses.

There are two places where this value is propagates into either
->curr_resync_completed or ->recovery_cp or ->recovery_offset.
These currently avoid the propagation of values 1 and 3, but will
allow 3 to leak through.

Change them to only propagate the value if it is > 3.

As this can cause an array to fail, the patch is suitable for -stable.

Cc: stable@vger.kernel.org (v3.7+)
Reported-by: Viswesh <viswesh.vichu@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-28 22:04:05 -07:00
Tomasz Majchrzak
7449f699b2 raid1: handle read error also in readonly mode
If write is the first operation on a disk and it happens not to be
aligned to page size, block layer sends read request first. If read
operation fails, the disk is set as failed as no attempt to fix the
error is made because array is in auto-readonly mode. Similarily, the
disk is set as failed for read-only array.

Take the same approach as in raid10. Don't fail the disk if array is in
readonly or auto-readonly mode. Try to redirect the request first and if
unsuccessful, return a read error.

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-28 22:04:04 -07:00
Shaohua Li
9a8b27fac5 raid5-cache: correct condition for empty metadata write
As long as we recover one metadata block, we should write the empty metadata
write. The original code could make recovery corrupted if only one meta is
valid.

Reported-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-28 22:04:03 -07:00
Linus Torvalds
e0f3e6a7cc - A couple DM raid and DM mirror fixes
- A couple .request_fn request-based DM NULL pointer fixes
 
 - A fix for a DM target reference count leak, on target load error, that
   prevented associated DM target kernel module(s) from being removed
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJYEo+lAAoJEMUj8QotnQNaGfkH/jGqr4bj4l2Ty3QgV95fYW7+
 lqp4Flkevm35HotEGKuuizvqbbVrj57BCGLE+dV48/X2cv5QbUFht6QBu9iJTrk6
 Q7VqyBOvDDnOZHIof5CfKBeLZ2gd8YHZwUpYvzJcThSWS1+LjeVqg8a33LMZroMQ
 rghVxFCIKy6LqCryIiTHk1t+OfmuBz3S2LXcQXFY7XAPpWq/f+V66gthTZUpm86+
 Gu1xOHQlvnmf5xnDUxCpPVbQNY334D/aSbU73i2cdvfL1pkxBFNcI+LbPcu+sNP9
 ugGjPj4etbIRsVysuW3fLhn2kKqaXXVuD1rLTQ+C3ytciI+RQJvG892gWhAABRQ=
 =apHk
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mike Snitzer:

 - a couple DM raid and DM mirror fixes

 - a couple .request_fn request-based DM NULL pointer fixes

 - a fix for a DM target reference count leak, on target load error,
   that prevented associated DM target kernel module(s) from being
   removed

* tag 'dm-4.9-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm table: fix missing dm_put_target_type() in dm_table_add_target()
  dm rq: clear kworker_task if kthread_run() returned an error
  dm: free io_barrier after blk_cleanup_queue call
  dm raid: fix activation of existing raid4/10 devices
  dm mirror: use all available legs on multiple failures
  dm mirror: fix read error on recovery after default leg failure
  dm raid: fix compat_features validation
2016-10-28 09:27:58 -07:00
Tomasz Majchrzak
16f889499a md: report 'write_pending' state when array in sync
If there is a bad block on a disk and there is a recovery performed from
this disk, the same bad block is reported for a new disk. It involves
setting MD_CHANGE_PENDING flag in rdev_set_badblocks. For external
metadata this flag is not being cleared as array state is reported as
'clean'. The read request to bad block in RAID5 array gets stuck as it
is waiting for a flag to be cleared - as per commit c3cce6cda1
("md/raid5: ensure device failure recorded before write request
returns.").

The meaning of MD_CHANGE_PENDING and MD_CHANGE_CLEAN flags has been
clarified in commit 070dc6dd71 ("md: resolve confusion of
MD_CHANGE_CLEAN"), however MD_CHANGE_PENDING flag has been used in
personality error handlers since and it doesn't fully comply with
initial purpose. It was supposed to notify that write request is about
to start, however now it is also used to request metadata update.
Initially (in md_allow_write, md_write_start) MD_CHANGE_PENDING flag has
been set and in_sync has been set to 0 at the same time. Error handlers
just set the flag without modifying in_sync value. Sysfs array state is
a single value so now it reports 'clean' when MD_CHANGE_PENDING flag is
set and in_sync is set to 1. Userspace has no idea it is expected to
take some action.

Swap the order that array state is checked so 'write_pending' is
reported ahead of 'clean' ('write_pending' is a misleading name but it
is too late to rename it now).

Signed-off-by: Tomasz Majchrzak <tomasz.majchrzak@intel.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24 15:28:19 -07:00
Zhengyuan Liu
56056c2e7d md/raid5: write an empty meta-block when creating log super-block
If superblock points to an invalid meta block, r5l_load_log will set
create_super with true and create an new superblock, this runtime path
would always happen if we do no writing I/O to this array since it was
created. Writing an empty meta block could avoid this unnecessary
action at the first time we created log superblock.

Another reason is for the corretness of log recovery. Currently we have
bellow code to guarantee log revocery to be correct.

        if (ctx.seq > log->last_cp_seq + 1) {
                int ret;

                ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10);
                if (ret)
                        return ret;
                log->seq = ctx.seq + 11;
                log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
                r5l_write_super(log, ctx.pos);
        } else {
                log->log_start = ctx.pos;
                log->seq = ctx.seq;
        }

If we just created a array with a journal device, log->log_start and
log->last_checkpoint should all be 0, then we write three meta block
which are valid except mid one and supposed crash happened. The ctx.seq
would equal to log->last_cp_seq + 1 and log->log_start would be set to
position of mid invalid meta block after we did a recovery, this will
lead to problems which could be avoided with this patch.

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24 15:28:18 -07:00
Zhengyuan Liu
28cd88e2b4 md/raid5: initialize next_checkpoint field before use
No initial operation was done to this field when we
load/recovery the log, it got assignment only when IO
to raid disk was finished. So r5l_quiesce may use wrong
next_checkpoint to reclaim log space, that would make
reclaimable space calculation confused.

Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24 15:28:18 -07:00
Shaohua Li
579ed34f7b RAID10: ignore discard error
This is the counterpart of raid10 fix. If a write error occurs, raid10
will try to rewrite the bio in small chunk size. If the rewrite fails,
raid10 will record the error in bad block. narrow_write_error will
always use WRITE for the bio, but actually it could be a discard. Since
discard bio hasn't payload, write the bio will cause different issues.
But discard error isn't fatal, we can safely ignore it. This is what
this patch does.

This issue should exist since discard is added, but only exposed with
recent arbitrary bio size feature.

Cc: Sitsofe Wheeler <sitsofe@gmail.com>
Cc: stable@vger.kernel.org (v3.6)
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24 15:28:17 -07:00
Shaohua Li
e3f948cd32 RAID1: ignore discard error
If a write error occurs, raid1 will try to rewrite the bio in small
chunk size. If the rewrite fails, raid1 will record the error in bad
block. narrow_write_error will always use WRITE for the bio, but
actually it could be a discard. Since discard bio hasn't payload, write
the bio will cause different issues. But discard error isn't fatal, we
can safely ignore it. This is what this patch does.

This issue should exist since discard is added, but only exposed with
recent arbitrary bio size feature.

Reported-and-tested-by: Sitsofe Wheeler <sitsofe@gmail.com>
Cc: stable@vger.kernel.org (v3.6)
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-24 15:28:17 -07:00
tang.junhui
dafa724bf5 dm table: fix missing dm_put_target_type() in dm_table_add_target()
dm_get_target_type() was previously called so any error returned from
dm_table_add_target() must first call dm_put_target_type().  Otherwise
the DM target module's reference count will leak and the associated
kernel module will be unable to be removed.

Also, leverage the fact that r is already -EINVAL and remove an extra
newline.

Fixes: 36a0456 ("dm table: add immutable feature")
Fixes: cc6cbe1 ("dm table: add always writeable feature")
Fixes: 3791e2f ("dm table: add singleton feature")
Cc: stable@vger.kernel.org # 3.2+
Signed-off-by: tang.junhui <tang.junhui@zte.com.cn>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-24 11:17:46 -04:00
Mike Snitzer
937fa62e8a dm rq: clear kworker_task if kthread_run() returned an error
cleanup_mapped_device() calls kthread_stop() if kworker_task is
non-NULL.  Currently the assigned value could be a valid task struct or
an error code (e.g -ENOMEM).  Reset md->kworker_task to NULL if
kthread_run() returned an erorr.

Fixes: 7193a9defc ("dm rq: check kthread_run return for .request_fn request-based DM")
Cc: stable@vger.kernel.org # 4.8
Reported-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-18 14:02:04 -04:00
Tahsin Erdogan
d09960b003 dm: free io_barrier after blk_cleanup_queue call
dm_old_request_fn() has paths that access md->io_barrier.  The party
destroying io_barrier should ensure that no future execution of
dm_old_request_fn() is possible.  Move io_barrier destruction to below
blk_cleanup_queue() to ensure this and avoid a NULL pointer crash during
request-based DM device shutdown.

Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-18 12:02:08 -04:00
Heinz Mauelshagen
b052b07c39 dm raid: fix activation of existing raid4/10 devices
dm-raid 1.9.0 fails to activate existing RAID4/10 devices that have the
old superblock format (which does not have takeover/reshaping support
that was added via commit 33e53f0685).

Fix validation path for old superblocks by reverting to the old raid4
layout and basing checks on mddev->new_{level,layout,...} members in
super_init_validation().

Cc: stable@vger.kernel.org # 4.8
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-17 16:41:31 -04:00
Heinz Mauelshagen
12a7cf5ba6 dm mirror: use all available legs on multiple failures
When any leg(s) have failed, any read will cause a new operational
default leg to be selected and the read is resubmitted to it.  If that
new default leg fails the read too, no other still accessible legs are
used to resubmit the read again -- thus failing the io.

Fix by allowing the read to get resubmitted until all operational legs
have been exhausted.  Also, remove any details.bi_dev use as a flag.

Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-14 11:55:17 -04:00
Heinz Mauelshagen
dcb2ff5641 dm mirror: fix read error on recovery after default leg failure
If a default leg has failed, any read will cause a new operational
default leg to be selected and the read is resubmitted.  But until now
the read will return failure even though it was successful due to
resubmission.  The reason for this is bio->bi_error was not being
cleared before resubmitting the bio.

Fix by clearing bio->bi_error before resubmission.

Fixes: 4246a0b63b ("block: add a bi_error field to struct bio")
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-14 11:54:10 -04:00
Petr Mladek
3989144f86 kthread: kthread worker API cleanup
A good practice is to prefix the names of functions by the name
of the subsystem.

The kthread worker API is a mix of classic kthreads and workqueues.  Each
worker has a dedicated kthread.  It runs a generic function that process
queued works.  It is implemented as part of the kthread subsystem.

This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:

__init_kthread_worker()		-> __kthread_init_worker()
init_kthread_worker()		-> kthread_init_worker()
init_kthread_work()		-> kthread_init_work()
insert_kthread_work()		-> kthread_insert_work()
queue_kthread_work()		-> kthread_queue_work()
flush_kthread_work()		-> kthread_flush_work()
flush_kthread_worker()		-> kthread_flush_worker()

Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.

Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:

  + "init" in the function names stands for the verb "initialize"
    aka "initialize worker". While "INIT" in the macro names
    stands for the noun "INITIALIZER" aka "worker initializer".

  + INIT() macros are used only in DEFINE() macros

  + init() functions are used close to the other kthread()
    functions. It looks much better if all the functions
    use the same scheme.

  + There will be also kthread_destroy_worker() that will
    be used close to kthread_cancel_work(). It is related
    to the init() function. Again it looks better if all
    functions use the same naming scheme.

  + there are several precedents for such init() function
    names, e.g. amd_iommu_init_device(), free_area_init_node(),
    jump_label_init_type(),  regmap_init_mmio_clk(),

  + It is not an argument but it was inconsistent even before.

[arnd@arndb.de: fix linux-next merge conflict]
 Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Andy Whitcroft
5c33677c87 dm raid: fix compat_features validation
In ecbfb9f118 ("dm raid: add raid level takeover support") a new
compatible feature flag was added.  Validation for these compat_features
was added but this only passes for new raid mappings with this feature
flag.  This causes previously created raid mappings to be failed at
import.

Check compat_features for the only valid combination.

Fixes: ecbfb9f118 ("dm raid: add raid level takeover support")
Cc: stable@vger.kernel.org # v4.8
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-10-11 15:19:17 -04:00
Linus Torvalds
12e3d3cdd9 Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-block
Pull blk-mq irq/cpu mapping updates from Jens Axboe:
 "This is the block-irq topic branch for 4.9-rc. It's mostly from
  Christoph, and it allows drivers to specify their own mappings, and
  more importantly, to share the blk-mq mappings with the IRQ affinity
  mappings. It's a good step towards making this work better out of the
  box"

* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
  blk_mq: linux/blk-mq.h does not include all the headers it depends on
  blk-mq: kill unused blk_mq_create_mq_map()
  blk-mq: get rid of the cpumask in struct blk_mq_tags
  nvme: remove the post_scan callout
  nvme: switch to use pci_alloc_irq_vectors
  blk-mq: provide a default queue mapping for PCI device
  blk-mq: allow the driver to pass in a queue mapping
  blk-mq: remove ->map_queue
  blk-mq: only allocate a single mq_map per tag_set
  blk-mq: don't redistribute hardware queues on a CPU hotplug event
2016-10-09 17:29:33 -07:00
Linus Torvalds
48915c2cbc . various fixes and cleanups for request-based DM core
. add support for delaying the requeue of requests; used by DM multipath
   when all paths have failed and 'queue_if_no_path' is enabled
 
 . DM cache improvements to speedup the loading metadata and the writing
   of the hint array
 
 . fix potential for a dm-crypt crash on device teardown
 
 . remove dm_bufio_cond_resched() and just using cond_resched()
 
 . change DM multipath to return a reservation conflict error
   immediately; rather than failing the path and retrying (potentially
   indefinitely)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJX7n9KAAoJEMUj8QotnQNab74IANm+rW2uYdpLNCxWUmcaih0d
 BK8dLS/Mz35S0TRSekvynuBcPx18VP2Zueulc+aHTWcT4sj79l6KnVYT9g6c98rL
 zzcv10QTteqhiiWwFmPHsZgv5dW8Y5wiRdt+SqcQ5sAHMFci6C05gzp9caNu7VTs
 fbcLUdyYm40y3j84Lx/+ABXgnBhq+40OTtdnYSkEmLtdscPLzwpHgPmMctkrEl7e
 7mqGC1KbDDzartqOZOeGP2P2qOCNN21qA+8ctMw9Xyze33uwvj7Vx6cro6e28wMm
 ZClY9XNGlfuW9dCNtFR9o6NXS6NIK30UJbKqyZPPsK+70JrOgzh6GzQnwSXdyNs=
 =7SkG
 -----END PGP SIGNATURE-----

Merge tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mike Snitzer:

 - various fixes and cleanups for request-based DM core

 - add support for delaying the requeue of requests; used by DM
   multipath when all paths have failed and 'queue_if_no_path' is
   enabled

 - DM cache improvements to speedup the loading metadata and the writing
   of the hint array

 - fix potential for a dm-crypt crash on device teardown

 - remove dm_bufio_cond_resched() and just using cond_resched()

 - change DM multipath to return a reservation conflict error
   immediately; rather than failing the path and retrying (potentially
   indefinitely)

* tag 'dm-4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (24 commits)
  dm mpath: always return reservation conflict without failing over
  dm bufio: remove dm_bufio_cond_resched()
  dm crypt: fix crash on exit
  dm cache metadata: switch to using the new cursor api for loading metadata
  dm array: introduce cursor api
  dm btree: introduce cursor api
  dm cache policy smq: distribute entries to random levels when switching to smq
  dm cache: speed up writing of the hint array
  dm array: add dm_array_new()
  dm mpath: delay the requeue of blk-mq requests while all paths down
  dm mpath: use dm_mq_kick_requeue_list()
  dm rq: introduce dm_mq_kick_requeue_list()
  dm rq: reduce arguments passed to map_request() and dm_requeue_original_request()
  dm rq: add DM_MAPIO_DELAY_REQUEUE to delay requeue of blk-mq requests
  dm: convert wait loops to use autoremove_wake_function()
  dm: use signal_pending_state() in dm_wait_for_completion()
  dm: rename task state function arguments
  dm: add two lockdep_assert_held() statements
  dm rq: simplify dm_old_stop_queue()
  dm mpath: check if path's request_queue is dying in activate_path()
  ...
2016-10-09 17:16:18 -07:00
Linus Torvalds
513a4befae Merge branch 'for-4.9/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
 "This is the main pull request for block layer changes in 4.9.

  As mentioned at the last merge window, I've changed things up and now
  do just one branch for core block layer changes, and driver changes.
  This avoids dependencies between the two branches. Outside of this
  main pull request, there are two topical branches coming as well.

  This pull request contains:

   - A set of fixes, and a conversion to blk-mq, of nbd. From Josef.

   - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
     Followup dependency fix from Geert.

   - General fixes from Bart, Baoyou, Guoqing, and Linus W.

   - CFQ async write starvation fix from Glauber.

   - Add supprot for delayed kick of the requeue list, from Mike.

   - Pull out the scalable bitmap code from blk-mq-tag.c and make it
     generally available under the name of sbitmap. Only blk-mq-tag uses
     it for now, but the blk-mq scheduling bits will use it as well.
     From Omar.

   - bdev thaw error progagation from Pierre.

   - Improve the blk polling statistics, and allow the user to clear
     them. From Stephen.

   - Set of minor cleanups from Christoph in block/blk-mq.

   - Set of cleanups and optimizations from me for block/blk-mq.

   - Various nvme/nvmet/nvmeof fixes from the various folks"

* 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
  fs/block_dev.c: return the right error in thaw_bdev()
  nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
  nvme/scsi: Remove power management support
  nvmet: Make dsm number of ranges zero based
  nvmet: Use direct IO for writes
  admin-cmd: Added smart-log command support.
  nvme-fabrics: Add host_traddr options field to host infrastructure
  nvme-fabrics: revise host transport option descriptions
  nvme-fabrics: rework nvmf_get_address() for variable options
  nbd: use BLK_MQ_F_BLOCKING
  blkcg: Annotate blkg_hint correctly
  cfq: fix starvation of asynchronous writes
  blk-mq: add flag for drivers wanting blocking ->queue_rq()
  blk-mq: remove non-blocking pass in blk_mq_map_request
  blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
  block: export bio_free_pages to other modules
  lightnvm: propagate device_add() error code
  lightnvm: expose device geometry through sysfs
  lightnvm: control life of nvm_dev in driver
  blk-mq: register device instead of disk
  ...
2016-10-07 14:42:05 -07:00
Linus Torvalds
c23112e039 Merge tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD updates from Shaohua Li:
 "This update includes:

   - new AVX512 instruction based raid6 gen/recovery algorithm

   - a couple of md-cluster related bug fixes

   - fix a potential deadlock

   - set nonrotational bit for raid array with SSD

   - set correct max_hw_sectors for raid5/6, which hopefuly can improve
     performance a little bit

   - other minor fixes"

* tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  md: set rotational bit
  raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays
  raid5: handle register_shrinker failure
  raid5: fix to detect failure of register_shrinker
  md: fix a potential deadlock
  md/bitmap: fix wrong cleanup
  raid5: allow arbitrary max_hw_sectors
  lib/raid6: Add AVX512 optimized xor_syndrome functions
  lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions
  lib/raid6: Add AVX512 optimized recovery functions
  lib/raid6: Add AVX512 optimized gen_syndrome functions
  md-cluster: make resync lock also could be interruptted
  md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang
  md-cluster: convert the completion to wait queue
  md-cluster: protect md_find_rdev_nr_rcu with rcu lock
  md-cluster: clean related infos of cluster
  md: changes for MD_STILL_CLOSED flag
  md-cluster: remove some unnecessary dlm_unlock_sync
  md-cluster: use FORCEUNLOCK in lockres_free
  md-cluster: call md_kick_rdev_from_array once ack failed
2016-10-07 09:45:43 -07:00
Linus Torvalds
597f03f9d1 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:
 "Yet another batch of cpu hotplug core updates and conversions:

   - Provide core infrastructure for multi instance drivers so the
     drivers do not have to keep custom lists.

   - Convert custom lists to the new infrastructure. The block-mq custom
     list conversion comes through the block tree and makes the diffstat
     tip over to more lines removed than added.

   - Handle unbalanced hotplug enable/disable calls more gracefully.

   - Remove the obsolete CPU_STARTING/DYING notifier support.

   - Convert another batch of notifier users.

   The relayfs changes which conflicted with the conversion have been
   shipped to me by Andrew.

   The remaining lot is targeted for 4.10 so that we finally can remove
   the rest of the notifiers"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  cpufreq: Fix up conversion to hotplug state machine
  blk/mq: Reserve hotplug states for block multiqueue
  x86/apic/uv: Convert to hotplug state machine
  s390/mm/pfault: Convert to hotplug state machine
  mips/loongson/smp: Convert to hotplug state machine
  mips/octeon/smp: Convert to hotplug state machine
  fault-injection/cpu: Convert to hotplug state machine
  padata: Convert to hotplug state machine
  cpufreq: Convert to hotplug state machine
  ACPI/processor: Convert to hotplug state machine
  virtio scsi: Convert to hotplug state machine
  oprofile/timer: Convert to hotplug state machine
  block/softirq: Convert to hotplug state machine
  lib/irq_poll: Convert to hotplug state machine
  x86/microcode: Convert to hotplug state machine
  sh/SH-X3 SMP: Convert to hotplug state machine
  ia64/mca: Convert to hotplug state machine
  ARM/OMAP/wakeupgen: Convert to hotplug state machine
  ARM/shmobile: Convert to hotplug state machine
  arm64/FP/SIMD: Convert to hotplug state machine
  ...
2016-10-03 19:43:08 -07:00
Shaohua Li
bb086a89a4 md: set rotational bit
if all disks in an array are non-rotational, set the array
non-rotational.

This only works for array with all disks populated at startup. Support
for disk hotadd/hotremove could be added later if necessary.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Shaohua Li <shli@fb.com>
2016-10-03 10:20:27 -07:00
Hannes Reinecke
8ff232c1a8 dm mpath: always return reservation conflict without failing over
If dm-mpath encounters an reservation conflict it should not fail the
path (as communication with the target is not affected) but should
rather retry on another path.  However, in doing so we might be inducing
a ping-pong between paths, with no guarantee of any forward progress.
And arguably a reservation conflict is an unexpected error, so we should
be passing it upwards to allow the application to take appropriate
steps.

This change resolves a show-stopper problem seen with the pNFS SCSI
layout because it is trivial to hit reservation conflict based failover
loops without it.

Doubts were raised about the implications of this change relative to
products like IBM's SVC.  But there is little point withholding a fix
for Linux because a proprietary product may or may not have some issues
in its implementation of how it interfaces with Linux.  In the future,
if there is glaring evidence that this change is certainly problematic
we can revisit it.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com> # tweaked header
2016-09-29 10:57:07 -04:00
Peter Zijlstra
7cd326747f dm bufio: remove dm_bufio_cond_resched()
Use cond_resched() like everybody else.

Mikulas explained why dm_bufio_cond_resched() was introduced to begin
with (hopefully cond_resched can be improved accordingly) here:
https://www.redhat.com/archives/dm-devel/2016-September/msg00112.html

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair Kergon <agk@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mike Snitzer <snitzer@redhat.com> # added last comment in header
2016-09-22 11:15:07 -04:00
Rabin Vincent
f659b10087 dm crypt: fix crash on exit
As the documentation for kthread_stop() says, "if threadfn() may call
do_exit() itself, the caller must ensure task_struct can't go away".
dm-crypt does not ensure this and therefore crashes when crypt_dtr()
calls kthread_stop().  The crash is trivially reproducible by adding a
delay before the call to kthread_stop() and just opening and closing a
dm-crypt device.

 general protection fault: 0000 [#1] PREEMPT SMP
 CPU: 0 PID: 533 Comm: cryptsetup Not tainted 4.8.0-rc7+ #7
 task: ffff88003bd0df40 task.stack: ffff8800375b4000
 RIP: 0010: kthread_stop+0x52/0x300
 Call Trace:
  crypt_dtr+0x77/0x120
  dm_table_destroy+0x6f/0x120
  __dm_destroy+0x130/0x250
  dm_destroy+0x13/0x20
  dev_remove+0xe6/0x120
  ? dev_suspend+0x250/0x250
  ctl_ioctl+0x1fc/0x530
  ? __lock_acquire+0x24f/0x1b10
  dm_ctl_ioctl+0x13/0x20
  do_vfs_ioctl+0x91/0x6a0
  ? ____fput+0xe/0x10
  ? entry_SYSCALL_64_fastpath+0x5/0xbd
  ? trace_hardirqs_on_caller+0x151/0x1e0
  SyS_ioctl+0x41/0x70
  entry_SYSCALL_64_fastpath+0x1f/0xbd

This problem was introduced by bcbd94ff48 ("dm crypt: fix a possible
hang due to race condition on exit").

Looking at the description of that patch (excerpted below), it seems
like the problem it addresses can be solved by just using
set_current_state instead of __set_current_state, since we obviously
need the memory barrier.

| dm crypt: fix a possible hang due to race condition on exit
|
| A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE),
| __add_wait_queue, spin_unlock_irq and then tests kthread_should_stop().
| It is possible that the processor reorders memory accesses so that
| kthread_should_stop() is executed before __set_current_state().  If
| such reordering happens, there is a possible race on thread
| termination: [...]

So this patch just reverts the aforementioned patch and changes the
__set_current_state(TASK_INTERRUPTIBLE) to set_current_state(...).  This
fixes the crash and should also fix the potential hang.

Fixes: bcbd94ff48 ("dm crypt: fix a possible hang due to race condition on exit")
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-09-22 11:15:06 -04:00
Joe Thornber
f177940a80 dm cache metadata: switch to using the new cursor api for loading metadata
This change offers a pretty significant performance improvement.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-09-22 11:15:05 -04:00
Joe Thornber
fdd1315aa5 dm array: introduce cursor api
More efficient way to iterate an array due to prefetching (makes use of
the new dm_btree_cursor_* api).

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-09-22 11:15:04 -04:00
Joe Thornber
7d111c81fa dm btree: introduce cursor api
This uses prefetching to speed up iteration through a btree.

Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2016-09-22 11:15:04 -04:00