Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hpb v21 #3

Open
wants to merge 65 commits into
base: main
Choose a base branch
from
Open

Hpb v21 #3

wants to merge 65 commits into from

Conversation

avri-altman-wdc
Copy link
Contributor

WDC host mode v3 based on Samsung device mode v21.

ahunter6 and others added 30 commits January 7, 2021 22:24
Update sysfs documentation for addition of DeepSleep power mode.

Link: https://lore.kernel.org/r/[email protected]
Fixes: fe1d4c2 ("scsi: ufs: Add DeepSleep feature")
Signed-off-by: Adrian Hunter <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Phil Oester reported that a fix for a possible buffer overrun that I sent
caused a regression that manifests in this output:

 Event Message: A PCI parity error was detected on a component at bus 0 device 5 function 0.
 Severity: Critical
 Message ID: PCI1308

The original code tried to handle the sense data pointer differently when
using 32-bit 64-bit DMA addressing, which would lead to a 32-bit dma_addr_t
value of 0x11223344 to get stored

32-bit kernel:       44 33 22 11 ?? ?? ?? ??
64-bit LE kernel:    44 33 22 11 00 00 00 00
64-bit BE kernel:    00 00 00 00 44 33 22 11

or a 64-bit dma_addr_t value of 0x1122334455667788 to get stored as

32-bit kernel:       88 77 66 55 ?? ?? ?? ??
64-bit kernel:       88 77 66 55 44 33 22 11

In my patch, I tried to ensure that the same value is used on both 32-bit
and 64-bit kernels, and picked what seemed to be the most sensible
combination, storing 32-bit addresses in the first four bytes (as 32-bit
kernels already did), and 64-bit addresses in eight consecutive bytes (as
64-bit kernels already did), but evidently this was incorrect.

Always storing the dma_addr_t pointer as 64-bit little-endian,
i.e. initializing the second four bytes to zero in case of 32-bit
addressing, apparently solved the problem for Phil, and is consistent with
what all 64-bit little-endian machines did before.

I also checked in the history that in previous versions of the code, the
pointer was always in the first four bytes without padding, and that
previous attempts to fix 64-bit user space, big-endian architectures and
64-bit DMA were clearly flawed and seem to have introduced made this worse.

Link: https://lore.kernel.org/r/[email protected]
Fixes: 381d34e ("scsi: megaraid_sas: Check user-provided offsets")
Fixes: 107a60d ("scsi: megaraid_sas: Add support for 64bit consistent DMA")
Fixes: 94cd65d ("[SCSI] megaraid_sas: addded support for big endian architecture")
Fixes: 7b2519a ("[SCSI] megaraid_sas: fix 64 bit sense pointer truncation")
Reported-by: Phil Oester <[email protected]>
Tested-by: Phil Oester <[email protected]>
Signed-off-by: Arnd Bergmann <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Building ufshcd-pltfrm.c on arch/s390/ has a linker error since S390 does
not support IOMEM, so add a dependency on HAS_IOMEM.

s390-linux-ld: drivers/scsi/ufs/ufshcd-pltfrm.o: in function `ufshcd_pltfrm_init':
ufshcd-pltfrm.c:(.text+0x38e): undefined reference to `devm_platform_ioremap_resource'

where that devm_ function is inside an #ifdef CONFIG_HAS_IOMEM/#endif
block.

Link: lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
Fixes: 03b1781 ("[SCSI] ufs: Add Platform glue driver for ufshcd")
Cc: "James E.J. Bottomley" <[email protected]>
Cc: "Martin K. Petersen" <[email protected]>
Cc: Alim Akhtar <[email protected]>
Cc: Avri Altman <[email protected]>
Cc: [email protected]
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Randy Dunlap <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Commit 2aa0102 ("scsi: ibmvfc: Use correlation token to tag commands")
sets the vfcFrame correlation token to the pointer handle of the associated
ibmvfc_event. However, that commit failed to cast the pointer to an
appropriate type which in this case is a u64. As such sparse warnings are
generated for both correlation token assignments.

 ibmvfc.c:2375:36: sparse: incorrect type in argument 1 (different base types)
 ibmvfc.c:2375:36: sparse: expected unsigned long long [usertype] val
 ibmvfc.c:2375:36: sparse: got struct ibmvfc_event *[assigned] evt

Add the appropriate u64 casts when assigning an ibmvfc_event as a
correlation token.

Link: https://lore.kernel.org/r/[email protected]
Fixes: 2aa0102 ("scsi: ibmvfc: Use correlation token to tag commands")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Tyrel Datwyler <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
When gate_work/ungate_work experience an error during hibern8_enter or exit
we can livelock:

 ufshcd_err_handler()
   ufshcd_scsi_block_requests()
   ufshcd_reset_and_restore()
     ufshcd_clear_ua_wluns() -> stuck
   ufshcd_scsi_unblock_requests()

In order to avoid this, ufshcd_clear_ua_wluns() can be called per recovery
flows such as suspend/resume, link_recovery, and error_handler.

Link: https://lore.kernel.org/r/[email protected]
Fixes: 1918651 ("scsi: ufs: Clear UAC for RPMB after ufshcd resets")
Reviewed-by: Can Guo <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
When non-fatal error like line-reset happens, ufshcd_err_handler() starts
to abort tasks by ufshcd_try_to_abort_task(). When it tries to issue a task
management request, we hit two warnings:

WARNING: CPU: 7 PID: 7 at block/blk-core.c:630 blk_get_request+0x68/0x70
WARNING: CPU: 4 PID: 157 at block/blk-mq-tag.c:82 blk_mq_get_tag+0x438/0x46c

After fixing the above warnings we hit another tm_cmd timeout which may be
caused by unstable controller state:

__ufshcd_issue_tm_cmd: task management cmd 0x80 timed-out

Then, ufshcd_err_handler() enters full reset, and kernel gets stuck. It
turned out ufshcd_print_trs() printed too many messages on console which
requires CPU locks. Likewise hba->silence_err_logs, we need to avoid too
verbose messages. This is actually not an error case.

Link: https://lore.kernel.org/r/[email protected]
Fixes: 69a6c26 ("scsi: ufs: Use blk_{get,put}_request() to allocate and free TMFs")
Reviewed-by: Can Guo <[email protected]>
Signed-off-by: Jaegeuk Kim <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Commit 0b2894c ("scsi: docs: ABI: sysfs-driver-ufs: Add DeepSleep
power mode") adds new entries in tables of sysfs-driver-ufs ABI
documentation, but formatted the table incorrectly.

Hence, make htmldocs warns:

  ./Documentation/ABI/testing/sysfs-driver-ufs:{915,956}:
  WARNING: Malformed table. Text in column margin in table line 15.

Rectify table formatting for DeepSleep power mode.

Link: https://lore.kernel.org/r/[email protected]
Acked-by: Adrian Hunter <[email protected]>
Signed-off-by: Lukas Bulwahn <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
If the port is in SRP_RPORT_FAIL_FAST state when srp_reconnect_rport() is
entered, a transition to SDEV_BLOCK would be illegal, and a kernel WARNING
would be triggered. Skip scsi_target_block() in this case.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Bart Van Assche <[email protected]>
Signed-off-by: Martin Wilck <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
…ompleted

A race condition exists between the response handler getting called because
of exchange_mgr_reset() (which clears out all the active XIDs) and the
response we get via an interrupt.

Sequence of events:

	 rport ba0200: Port timeout, state PLOGI
	 rport ba0200: Port entered PLOGI state from PLOGI state
	 xid 1052: Exchange timer armed : 20000 msecs      xid timer armed here
	 rport ba0200: Received LOGO request while in state PLOGI
	 rport ba0200: Delete port
	 rport ba0200: work event 3
	 rport ba0200: lld callback ev 3
	 bnx2fc: rport_event_hdlr: event = 3, port_id = 0xba0200
	 bnx2fc: ba0200 - rport not created Yet!!
	 /* Here we reset any outstanding exchanges before
	 freeing rport using the exch_mgr_reset() */
	 xid 1052: Exchange timer canceled
	 /* Here we got two responses for one xid */
	 xid 1052: invoking resp(), esb 20000000 state 3
	 xid 1052: invoking resp(), esb 20000000 state 3
	 xid 1052: fc_rport_plogi_resp() : ep->resp_active 2
	 xid 1052: fc_rport_plogi_resp() : ep->resp_active 2

Skip the response if the exchange is already completed.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Javed Hasan <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
When ioread32() returns 0xFFFFFFFF, we should execute cleanup functions
like other error handling paths before returning.

Link: https://lore.kernel.org/r/[email protected]
Acked-by: Karan Tilak Kumar <[email protected]>
Signed-off-by: Dinghao Liu <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Commit a351290 ("scsi: target: tcmu: Use priv pointer in se_cmd")
modified tcmu_free_cmd() to set NULL to priv pointer in se_cmd. However,
se_cmd can be already freed by work queue triggered in
target_complete_cmd(). This caused BUG KASAN use-after-free [1].

To fix the bug, do not touch priv pointer in tcmu_free_cmd(). Instead, set
NULL to priv pointer before target_complete_cmd() calls. Also, to avoid
unnecessary priv pointer change in tcmu_queue_cmd(), modify priv pointer in
the function only when tcmu_free_cmd() is not called.

[1]
BUG: KASAN: use-after-free in tcmu_handle_completions+0x1172/0x1770 [target_core_user]
Write of size 8 at addr ffff88814cf79a40 by task cmdproc-uio0/14842

CPU: 2 PID: 14842 Comm: cmdproc-uio0 Not tainted 5.11.0-rc2 #1
Hardware name: Supermicro Super Server/X10SRL-F, BIOS 3.2 11/22/2019
Call Trace:
 dump_stack+0x9a/0xcc
 ? tcmu_handle_completions+0x1172/0x1770 [target_core_user]
 print_address_description.constprop.0+0x18/0x130
 ? tcmu_handle_completions+0x1172/0x1770 [target_core_user]
 ? tcmu_handle_completions+0x1172/0x1770 [target_core_user]
 kasan_report.cold+0x7f/0x10e
 ? tcmu_handle_completions+0x1172/0x1770 [target_core_user]
 tcmu_handle_completions+0x1172/0x1770 [target_core_user]
 ? queue_tmr_ring+0x5d0/0x5d0 [target_core_user]
 tcmu_irqcontrol+0x28/0x60 [target_core_user]
 uio_write+0x155/0x230
 ? uio_vma_fault+0x460/0x460
 ? security_file_permission+0x4f/0x440
 vfs_write+0x1ce/0x860
 ksys_write+0xe9/0x1b0
 ? __ia32_sys_read+0xb0/0xb0
 ? syscall_enter_from_user_mode+0x27/0x70
 ? trace_hardirqs_on+0x1c/0x110
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fcf8b61905f
Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 b9 fc ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 44 24 08 e8 0c fd ff ff 48
RSP: 002b:00007fcf7b3e6c30 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fcf8b61905f
RDX: 0000000000000004 RSI: 00007fcf7b3e6c78 RDI: 000000000000000c
RBP: 00007fcf7b3e6c80 R08: 0000000000000000 R09: 00007fcf7b3e6aa8
R10: 000000000b01c000 R11: 0000000000000293 R12: 00007ffe0c32a52e
R13: 00007ffe0c32a52f R14: 0000000000000000 R15: 00007fcf7b3e7640

Allocated by task 383:
 kasan_save_stack+0x1b/0x40
 ____kasan_kmalloc.constprop.0+0x84/0xa0
 kmem_cache_alloc+0x142/0x330
 tcm_loop_queuecommand+0x2a/0x4e0 [tcm_loop]
 scsi_queue_rq+0x12ec/0x2d20
 blk_mq_dispatch_rq_list+0x30a/0x1db0
 __blk_mq_do_dispatch_sched+0x326/0x830
 __blk_mq_sched_dispatch_requests+0x2c8/0x3f0
 blk_mq_sched_dispatch_requests+0xca/0x120
 __blk_mq_run_hw_queue+0x93/0xe0
 process_one_work+0x7b6/0x1290
 worker_thread+0x590/0xf80
 kthread+0x362/0x430
 ret_from_fork+0x22/0x30

Freed by task 11655:
 kasan_save_stack+0x1b/0x40
 kasan_set_track+0x1c/0x30
 kasan_set_free_info+0x20/0x30
 ____kasan_slab_free+0xec/0x120
 slab_free_freelist_hook+0x53/0x160
 kmem_cache_free+0xf4/0x5c0
 target_release_cmd_kref+0x3ea/0x9e0 [target_core_mod]
 transport_generic_free_cmd+0x28b/0x2f0 [target_core_mod]
 target_complete_ok_work+0x250/0xac0 [target_core_mod]
 process_one_work+0x7b6/0x1290
 worker_thread+0x590/0xf80
 kthread+0x362/0x430
 ret_from_fork+0x22/0x30

Last potentially related work creation:
 kasan_save_stack+0x1b/0x40
 kasan_record_aux_stack+0xa3/0xb0
 insert_work+0x48/0x2e0
 __queue_work+0x4e8/0xdf0
 queue_work_on+0x78/0x80
 tcmu_handle_completions+0xad0/0x1770 [target_core_user]
 tcmu_irqcontrol+0x28/0x60 [target_core_user]
 uio_write+0x155/0x230
 vfs_write+0x1ce/0x860
 ksys_write+0xe9/0x1b0
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Second to last potentially related work creation:
 kasan_save_stack+0x1b/0x40
 kasan_record_aux_stack+0xa3/0xb0
 insert_work+0x48/0x2e0
 __queue_work+0x4e8/0xdf0
 queue_work_on+0x78/0x80
 tcm_loop_queuecommand+0x1c3/0x4e0 [tcm_loop]
 scsi_queue_rq+0x12ec/0x2d20
 blk_mq_dispatch_rq_list+0x30a/0x1db0
 __blk_mq_do_dispatch_sched+0x326/0x830
 __blk_mq_sched_dispatch_requests+0x2c8/0x3f0
 blk_mq_sched_dispatch_requests+0xca/0x120
 __blk_mq_run_hw_queue+0x93/0xe0
 process_one_work+0x7b6/0x1290
 worker_thread+0x590/0xf80
 kthread+0x362/0x430
 ret_from_fork+0x22/0x30

The buggy address belongs to the object at ffff88814cf79800 which belongs
to the cache tcm_loop_cmd_cache of size 896.

Link: https://lore.kernel.org/r/[email protected]
Fixes: a351290 ("scsi: target: tcmu: Use priv pointer in se_cmd")
Cc: [email protected] # v5.9+
Acked-by: Bodo Stroesser <[email protected]>
Signed-off-by: Shin'ichiro Kawasaki <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
While testing live partition mobility, we have observed occasional crashes
of the Linux partition. What we've seen is that during the live migration,
for specific configurations with large amounts of memory, slow network
links, and workloads that are changing memory a lot, the partition can end
up being suspended for 30 seconds or longer. This resulted in the following
scenario:

CPU 0                          CPU 1
-------------------------------  ----------------------------------
scsi_queue_rq                    migration_store
 -> blk_mq_start_request          -> rtas_ibm_suspend_me
  -> blk_add_timer                 -> on_each_cpu(rtas_percpu_suspend_me
              _______________________________________V
             |
             V
    -> IPI from CPU 1
     -> rtas_percpu_suspend_me
                                     -> __rtas_suspend_last_cpu

-- Linux partition suspended for > 30 seconds --
                                      -> for_each_online_cpu(cpu)
                                           plpar_hcall_norets(H_PROD
 -> scsi_dispatch_cmd
                                      -> scsi_times_out
                                       -> scsi_abort_command
                                        -> queue_delayed_work
  -> ibmvfc_queuecommand_lck
   -> ibmvfc_send_event
    -> ibmvfc_send_crq
     - returns H_CLOSED
   <- returns SCSI_MLQUEUE_HOST_BUSY
-> __blk_mq_requeue_request

                                      -> scmd_eh_abort_handler
                                       -> scsi_try_to_abort_cmd
                                         - returns SUCCESS
                                       -> scsi_queue_insert

Normally, the SCMD_STATE_COMPLETE bit would protect against the command
completion and the timeout, but that doesn't work here, since we don't
check that at all in the SCSI_MLQUEUE_HOST_BUSY path.

In this case we end up calling scsi_queue_insert on a request that has
already been queued, or possibly even freed, and we crash.

The patch below simply increases the default I/O timeout to avoid this race
condition. This is also the timeout value that nearly all IBM SAN storage
recommends setting as the default value.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Brian King <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Correct the spelling of Nagle's name in a comment.

Link: https://lore.kernel.org/r/2921.1610694423@turing-police
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Valdis Klētnieks <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Parameter ql2xenforce_iocb_limit is enabled by default.

Link: https://lore.kernel.org/r/[email protected]
Fixes: 89c72f4 ("scsi: qla2xxx: Add IOCB resource tracking")
Reviewed-by: Daniel Wagner <[email protected]>
Reviewed-by: Himanshu Madhani <[email protected]>
Signed-off-by: Enzo Matsumiya <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
The UFS core has received a substantial rework this cycle. This in
turn has caused a merge conflict in linux-next. Merge 5.11/scsi-fixes
into 5.12/scsi-queue and resolve the conflict.

Signed-off-by: Martin K. Petersen <[email protected]>
This was supposed to be "data" instead of "&data".  The current code will
corrupt the stack.

Link: https://lore.kernel.org/r/YA6E0geUlL9Hs04A@mwanda
Fixes: dbf1f53 ("scsi: qla2xxx: Implementation to get and manage host, target stats and initiator port")
Acked-by: Saurav Kashyap <[email protected]>
Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
The "pmb" pointer is freed at the start of the function and then freed
again in the error handling code.

Link: https://lore.kernel.org/r/YA6E8rO51hE56SVw@mwanda
Fixes: 92d7f7b ("[SCSI] lpfc: NPIV: add NPIV support on top of SLI-3")
Signed-off-by: Dan Carpenter <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
The patch to switch using SAM status values had some typos; fix them up.

Link: https://lore.kernel.org/r/[email protected]
Fixes: 491152c ("scsi: ncr53c8xx: Use SAM status values")
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Hannes Reinecke <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
lpfc depends on irq_poll library, but it is not selected automatically.
When irq_poll is not selected, compiling it can run into following error

ERROR: modpost: "irq_poll_init" [drivers/scsi/lpfc/lpfc.ko] undefined!
ERROR: modpost: "irq_poll_sched" [drivers/scsi/lpfc/lpfc.ko] undefined!
ERROR: modpost: "irq_poll_complete" [drivers/scsi/lpfc/lpfc.ko] undefined!

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: James Smart <[email protected]>
Signed-off-by: Tong Zhang <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
The platform_get_irq() check for -EPROBE_DEFER was to ensure that all the
steps to add the SCSI host are not done and then only to realise that the
probe needs to be deferred.

However, since there is now an earlier check for this in
hisi_sas_interrupt_preinit(), this check is superfluous and may be removed.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: John Garry <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Now that v2 and v3 hw expose their HW queues (and so shost.nr_hw_queues is
set), remove the conditional checks in hisi_sas_task_prep().

This change would affect v1 HW performance (as it does not expose HW
queues), but nobody uses it and support may be dropped soon.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Xiang Chen <[email protected]>
Signed-off-by: John Garry <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Add a config option to enable debugfs support by default. And if debugfs
support is enabled by default, dump count default value is increased to 50
as generally users want something bigger than the current default in that
situation.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Luo Jiaxing <[email protected]>
Signed-off-by: John Garry <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
If the controller reset occurs at the same time as driver removal, it may
be possible that the interrupts have been released prior to the host
softreset, and calling pci_irq_vector() there causes a WARN:

WARNING: CPU: 37 PID: 1542 /pci/msi.c:1275 pci_irq_vector+0xc0/0xd0
Call trace:
pci_irq_vector+0xc0/0xd0
disable_host_v3_hw+0x58/0x5b0 [hisi_sas_v3_hw]
soft_reset_v3_hw+0x40/0xc0 [hisi_sas_v3_hw]
hisi_sas_controller_reset+0x150/0x260 [hisi_sas_main]
hisi_sas_rst_work_handler+0x3c/0x58 [hisi_sas_main]

To fix, flush the driver workqueue prior to releasing the interrupts to
ensure any resets have been completed.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Luo Jiaxing <[email protected]>
Signed-off-by: John Garry <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
The controller provides trace FIFO DFX tool to assist link fault debugging
and link optimization. This tool can be helpful when debugging link faults
without SAS analyzers. Each PHY has an independent trace FIFO interface.

The user can configure the trace FIFO tool of one PHY by using the
following six interfaces:

signal_sel: select signal group applies to different scenarios.
  0x0: linkrate negotiation
  0x1: Host 12G TX train
  0x2: Disk 12G TX train
  0x3: SAS PHY CTRL DFX 0
  0x4: SAS PHY CTRL DFX 1
  0x5: SAS PCS DFX
  other: linkrate negotiation
dump_mask: The masked hardware status bit will not be updated.
dump_mode: determines how to dump data after trigger signal is generated.
  0x0: dump forever
  0x1: dump 32 data after trigger signal is generated
  0x2: no more dump after trigger signal is generated
trigger_mode: determines the trigger mode, level or edge.
  0x0: dump when trigger signal changed
  0x1: dump when trigger signal's level equal to trigger_level
  0x2: dump when trigger signal's level different from trigger_level
trigger_level: determines the trigger level.
trigger_msk: mask trigger signal

The user can get 32-byte values from hardware by reading the rd_data.
These values consitute the status record of the hardware at different time
points.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Luo Jiaxing <[email protected]>
Signed-off-by: John Garry <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
TCM always fails SBC commands with residuals for 4Kn devices when the
command is processed by sbc_parse_cdb(). That prevents residual signalling
to the transport driver because residual kind and residual amount aren't
set. It also makes residual handling different from 512-byte formatted
devices - if there are residuals 512-byte LUN would proceed with command
execution while 4K-byte LUN would fail.

Link: https://lore.kernel.org/r/[email protected]
Based-on: https://patchwork.kernel.org/project/target-devel/patch/[email protected]/
Based-on-patch-by: Bart Van Assche <[email protected]>
Signed-off-by: Roman Bolshakov <[email protected]>
Signed-off-by: Konstantin Vinogradov <[email protected]>
Signed-off-by: Anastasia Kovaleva <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
According to RFC 7143 11.4.5.2.:

  If SPDTL > EDTL for a task, iSCSI Overflow MUST be signaled in the SCSI
  Response PDU as specified in Section 11.4.5.1.  The Residual Count MUST
  be set to the numerical value of (SPDTL - EDTL).

  If SPDTL < EDTL for a task, iSCSI Underflow MUST be signaled in the SCSI
  Response PDU as specified in Section 11.4.5.1.  The Residual Count MUST
  be set to the numerical value of (EDTL - SPDTL).

libiscsi has residual write tests that check residual kind and residual
amount and all of them (Write10Residuals, Write12Residuals,
Write16Residuals) currently fail.

One of the reasons why they fail is because target completes write commands
with INVALID FIELD IN CDB before setting the Overflow/Underflow bit and
residual amount.

Set the Overflow/Underflow bit and the residual amount before failing a
write to comply with RFC 7143.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anastasia Kovaleva <[email protected]>
Signed-off-by: Roman Bolshakov <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
According to FCP-4 (9.4.2):

  If the command requested that data beyond the length specified by the
  FCP_DL field be transferred, then the device server shall set the
  FCP_RESID_OVER bit (see 9.5.8) to one in the FCP_RSP IU and:

  a) process the command normally except that data beyond the FCP_DL count
  shall not be requested or transferred;

  b) transfer no data and return CHECK CONDITION status with the sense key
  set to ILLEGAL REQUEST and the additional sense code set to INVALID FIELD
  IN COMMAND INFORMATION UNIT; or

  c) may transfer data and return CHECK CONDITION status with the sense key
  set to ABORTED COMMAND and the additional sense code set to INVALID FIELD
  IN COMMAND INFORMATION UNIT.

TCM follows b) and transfers no data for residual writes but returns
INVALID FIELD IN CDB instead of INVALID FIELD IN COMMAND INFORMATION UNIT.

Change the ASCQ to INVALID FIELD IN COMMAND INFORMATION UNIT to meet the
standard.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Anastasia Kovaleva <[email protected]>
Signed-off-by: Roman Bolshakov <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Fix misspellings of "physical".

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Bjorn Helgaas <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Fix the following coccicheck warnings:

./drivers/scsi/qla2xxx/qla_nvme.c:288:24-26: WARNING !A || A && B is
equivalent to !A || B.

Link: https://lore.kernel.org/r/[email protected]
Reported-by: Abaci Robot <[email protected]>
Reviewed-by: Himanshu Madhani <[email protected]>
Signed-off-by: Jiapeng Zhong <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
allocted -> allocated

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: dingsenjie <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
sreekanthbrcm and others added 30 commits February 8, 2021 21:53
Currently the driver allocates memory for ReplyPostFree queues in chunks of
16. In resource constrained environments--such as VM with 1 GB RAM and 2
CPUs--memory allocation for ReplyPostFree pools may fail because the driver
tries to allocate a memory for 16 ReplyPostFree queues even though the
actual number needed is 2.

Change the driver to allocate memory for only the actual number of queues
needed if the ReplyPostFree queue count is less than 16.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sreekanth Reddy <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
MPT Fusion adapters can steer completions to individual queues and we now
have support for shared host-wide tags in the I/O stack. The addition of
the host-wide tags allows us to enable multiqueue support for MPT Fusion
adapters. Once host-wise tags are enabled, the CPU hotplug feature is also
supported.

Allow use of host-wide tags to be disabled through the "host_tagset_enable"
module parameter. Once we do not have any major performance regressions
using host-wide tags, we will drop the hand-crafted interrupt affinity
settings.

Performance is meeting expectations. About 3.1M IOPS using 24 Drive SSD on
Aero controllers.

Link: https://lore.kernel.org/r/[email protected]
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Sreekanth Reddy <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
When a host trace buffer is released, applications never know for what
reason the buffer is released. Add a new IOCTL MPT3ADDNLDIAGQUERY to
provide the trigger information due to which the diag buffer is released.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Suganath Prabu S <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Update driver version to 37.100.00.00.

Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Suganath Prabu S <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Eliminate the following coccicheck warning:

./drivers/target/sbp/sbp_target.c:1009:2-3: Unneeded semicolon

Link: https://lore.kernel.org/r/[email protected]
Reported-by: Abaci Robot <[email protected]>
Reviewed-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Yang Li <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Fix the following coccicheck warnings:

./drivers/scsi/qla2xxx/qla_target.c:984:12-14: WARNING !A || A && B is
equivalent to !A || B.

Link: https://lore.kernel.org/r/1612319190-111421-1-git-send-email-jiapeng.chong@linux.alibaba.com
Reported-by: Abaci Robot <[email protected]>
Reviewed-by: Himanshu Madhani <[email protected]>
Signed-off-by: Jiapeng Chong <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
Print event counter after dumping the event history.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Avri Altman <[email protected]>
Signed-off-by: DooHyun Hwang <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
This patch removes unneeded return variables. It fixes the following
warning detected by coccinelle:

./drivers/scsi/isci/request.c:1483:17-23: Unneeded variable: "status".
Return "SCI_SUCCESS" on line 1503
./drivers/scsi/isci/request.c:2157:17-23: Unneeded variable: "status".
Return "SCI_SUCCESS" on line 2177

Link: https://lore.kernel.org/r/[email protected]
Reported-by: Abaci Robot <[email protected]>
Signed-off-by: Yang Li <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
If iscsi_prep_scsi_cmd_pdu() fails we try to add it back to the cmdqueue,
but we leave it partially setup. We don't have functions that can undo the
pdu and init task setup. We only have cleanup_task which can clean up both
parts. So this has us just fail the cmd and go through the standard cleanup
routine and then have the SCSI midlayer retry it like is done when it fails
in the queuecommand path.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Lee Duncan <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
The purpose of the taskqueuelock was to handle the issue where a bad target
decides to send a R2T and before its data has been sent decides to send a
cmd response to complete the cmd. The following patches fix up the
frwd/back locks so they are taken from the queue/xmit (frwd) and completion
(back) paths again. To get there this patch removes the taskqueuelock which
for iSCSI xmit wq based drivers was taken in the queue, xmit and completion
paths.

Instead of the lock, we just make sure we have a ref to the task when we
queue a R2T, and then we always remove the task from the requeue list in
the xmit path or the forced cleanup paths.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Lee Duncan <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
The following bug was reported and debugged by [email protected]:

When testing kernel 4.18 version, NULL pointer dereference problem occurs
in iscsi_eh_cmd_timed_out() function.

I think this bug in the upstream is still exists.

The analysis reasons are as follows:

1) For some reason, I/O command did not complete within the timeout
   period. The block layer timer works, call scsi_times_out() to handle I/O
   timeout logic.  At the same time the command just completes.

2) scsi_times_out() call iscsi_eh_cmd_timed_out() to process timeout logic.
   Although there is an NULL judgment for the task, the task has not been
   released yet now.

3) iscsi_complete_task() calls __iscsi_put_task(). The task reference count
   reaches zero, the conditions for free task is met, then
   iscsi_free_task() frees the task, and sets sc->SCp.ptr = NULL. After
   iscsi_eh_cmd_timed_out() passes the task judgment check, there can still
   be NULL dereference scenarios.

   CPU0                                                CPU3

    |- scsi_times_out()                                 |-
iscsi_complete_task()
    |                                                   |
    |- iscsi_eh_cmd_timed_out()                         |-
__iscsi_put_task()
    |                                                   |
    |- task=sc->SCp.ptr, task is not NUL, check passed  |-
iscsi_free_task(task)
    |                                                   |
    |                                                   |-> sc->SCp.ptr
= NULL
    |                                                   |
    |- task is NULL now, NULL pointer dereference       |
    |                                                   |
   \|/                                                 \|/

Calltrace:
[380751.840862] BUG: unable to handle kernel NULL pointer dereference at
0000000000000138
[380751.843709] PGD 0 P4D 0
[380751.844770] Oops: 0000 [#1] SMP PTI
[380751.846283] CPU: 0 PID: 403 Comm: kworker/0:1H Kdump: loaded
Tainted: G
[380751.851467] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[380751.856521] Workqueue: kblockd blk_mq_timeout_work
[380751.858527] RIP: 0010:iscsi_eh_cmd_timed_out+0x15e/0x2e0 [libiscsi]
[380751.861129] Code: 83 ea 01 48 8d 74 d0 08 48 8b 10 48 8b 4a 50 48 85
c9 74 2c 48 39 d5 74
[380751.868811] RSP: 0018:ffffc1e280a5fd58 EFLAGS: 00010246
[380751.870978] RAX: ffff9fd1e84e15e0 RBX: ffff9fd1e84e6dd0 RCX:
0000000116acc580
[380751.873791] RDX: ffff9fd1f97a9400 RSI: ffff9fd1e84e1800 RDI:
ffff9fd1e4d6d420
[380751.876059] RBP: ffff9fd1e4d49000 R08: 0000000116acc580 R09:
0000000116acc580
[380751.878284] R10: 0000000000000000 R11: 0000000000000000 R12:
ffff9fd1e6e931e8
[380751.880500] R13: ffff9fd1e84e6ee0 R14: 0000000000000010 R15:
0000000000000003
[380751.882687] FS:  0000000000000000(0000) GS:ffff9fd1fac00000(0000)
knlGS:0000000000000000
[380751.885236] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[380751.887059] CR2: 0000000000000138 CR3: 000000011860a001 CR4:
00000000003606f0
[380751.889308] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[380751.891523] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[380751.893738] Call Trace:
[380751.894639]  scsi_times_out+0x60/0x1c0
[380751.895861]  blk_mq_check_expired+0x144/0x200
[380751.897302]  ? __switch_to_asm+0x35/0x70
[380751.898551]  blk_mq_queue_tag_busy_iter+0x195/0x2e0
[380751.900091]  ? __blk_mq_requeue_request+0x100/0x100
[380751.901611]  ? __switch_to_asm+0x41/0x70
[380751.902853]  ? __blk_mq_requeue_request+0x100/0x100
[380751.904398]  blk_mq_timeout_work+0x54/0x130
[380751.905740]  process_one_work+0x195/0x390
[380751.907228]  worker_thread+0x30/0x390
[380751.908713]  ? process_one_work+0x390/0x390
[380751.910350]  kthread+0x10d/0x130
[380751.911470]  ? kthread_flush_work_fn+0x10/0x10
[380751.913007]  ret_from_fork+0x35/0x40

crash> dis -l iscsi_eh_cmd_timed_out+0x15e
xxxxx/drivers/scsi/libiscsi.c: 2062

1970 enum blk_eh_timer_return iscsi_eh_cmd_timed_out(struct scsi_cmnd
*sc)
{
...
1984         spin_lock_bh(&session->frwd_lock);
1985         task = (struct iscsi_task *)sc->SCp.ptr;
1986         if (!task) {
1987                 /*
1988                  * Raced with completion. Blk layer has taken
ownership
1989                  * so let timeout code complete it now.
1990                  */
1991                 rc = BLK_EH_DONE;
1992                 goto done;
1993         }

...

2052         for (i = 0; i < conn->session->cmds_max; i++) {
2053                 running_task = conn->session->cmds[i];
2054                 if (!running_task->sc || running_task == task ||
2055                      running_task->state != ISCSI_TASK_RUNNING)
2056                         continue;
2057
2058                 /*
2059                  * Only check if cmds started before this one have
made
2060                  * progress, or this could never fail
2061                  */
2062                 if (time_after(running_task->sc->jiffies_at_alloc,
2063                                task->sc->jiffies_at_alloc))    <---
2064                         continue;
2065
...
}

carsh> struct scsi_cmnd ffff9fd1e6e931e8
struct scsi_cmnd {
  ...
  SCp = {
    ptr = 0x0,   <--- iscsi_task
    this_residual = 0,
    ...
  },
}

To prevent this, we take a ref to the cmd under the back (completion) lock
so if the completion side were to call iscsi_complete_task() on the task
while the timer/eh paths are not holding the back_lock it will not be freed
from under us.

Note that this requires the previous patch, "scsi: libiscsi: Drop
taskqueuelock" because bnx2i sleeps in its cleanup_task callout if the cmd
is aborted. If the EH/timer and completion path are racing we don't know
which path will do the last put. The previous patch moved the operations we
needed to do under the forward lock to cleanup_queued_task.  Once that has
run we can drop the forward lock for the cmd and bnx2i no longer has to
worry about if the EH, timer or completion path did the ast put and if the
forward lock is held or not since it won't be.

Link: https://lore.kernel.org/r/[email protected]
Reported-by: Wu Bo <[email protected]>
Reviewed-by: Lee Duncan <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
We allocate the iSCSI host workq in iscsi_host_alloc() so iscsi_host_free()
should do the destruction. Drivers can then do their error/goto handling
and call iscsi_host_free() to clean up what has been allocated in
iscsi_host_alloc().

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Lee Duncan <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
This patch just breaks out the code that calculates the number of SCSI cmds
that will be used for a SCSI session. It also adds a check that we don't go
over the host's can_queue value.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Lee Duncan <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
We are setting the shost's can_queue after we add the host which is too
late, because the SCSI midlayer will have allocated the tag set based on
the can_queue value at that time. This patch has us use the
iscsi_host_get_max_scsi_cmds() helper to figure out the number of SCSI
cmds.

It also fixes up the template can_queue so it reflects the max SCSI cmds we
can support like how other drivers work.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Lee Duncan <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
If we lose the session then relogin, but the new cmdsn window has shrunk
(due to something like an admin changing a setting) we will have the old
exp/max_cmdsn values and will never be able to update them. For example,
max_cmdsn would be 64, but if on the target the user set the window to be
smaller then the target could try to return the max_cmdsn as 32. We will
see that new max_cmdsn in the rsp but because it's lower than the old
max_cmdsn when the window was larger we will not update it.

So this patch has us reset the window values during session cleanup so they
can be updated after a new login.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Lee Duncan <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
__qla4xxx_is_chap_active() just wants to know if a session is online and
does not care about why it's not, so this has it use
iscsi_is_session_online().

This is not a bug now, but the next patch changes the behavior of
iscsi_session_chkready() so this patch just prepares the driver for that
change.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Lee Duncan <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
The session lock in iscsi_session_chkready() is not needed because when we
transition from logged into to another state we will block and/or remove
the devices under the session, so no new I/O will be sent to the drivers
after the block/remove. I/O that races with the block/removal is cleaned up
by the drivers when it handles all outstanding I/O, so this just added an
extra lock in the main I/O path. This patch removes the lock like other
transport classes.

Link: https://lore.kernel.org/r/[email protected]
Reviewed-by: Lee Duncan <[email protected]>
Signed-off-by: Mike Christie <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>
This is a patch for the HPB initialization and adds HPB function calls to
UFS core driver.

NAND flash-based storage devices, including UFS, have mechanisms to
translate logical addresses of IO requests to the corresponding physical
addresses of the flash storage.
In UFS, Logical-address-to-Physical-address (L2P) map data, which is
required to identify the physical address for the requested IOs, can only
be partially stored in SRAM from NAND flash. Due to this partial loading,
accessing the flash address area where the L2P information for that address
is not loaded in the SRAM can result in serious performance degradation.

The basic concept of HPB is to cache L2P mapping entries in host system
memory so that both physical block address (PBA) and logical block address
(LBA) can be delivered in HPB read command.
The HPB READ command allows to read data faster than a read command in UFS
since it provides the physical address (HPB Entry) of the desired logical
block in addition to its logical address. The UFS device can access the
physical block in NAND directly without searching and uploading L2P mapping
table. This improves read performance because the NAND read operation for
uploading L2P mapping table is removed.

In HPB initialization, the host checks if the UFS device supports HPB
feature and retrieves related device capabilities. Then, some HPB
parameters are configured in the device.

We measured the total start-up time of popular applications and observed
the difference by enabling the HPB.
Popular applications are 12 game apps and 24 non-game apps. Each target
applications were launched in order. The cycle consists of running 36
applications in sequence. We repeated the cycle for observing performance
improvement by L2P mapping cache hit in HPB.

The Following is experiment environment:
 - kernel version: 4.4.0
 - RAM: 8GB
 - UFS 2.1 (64GB)

Result:
+-------+----------+----------+-------+
| cycle | baseline | with HPB | diff  |
+-------+----------+----------+-------+
| 1     | 272.4    | 264.9    | -7.5  |
| 2     | 250.4    | 248.2    | -2.2  |
| 3     | 226.2    | 215.6    | -10.6 |
| 4     | 230.6    | 214.8    | -15.8 |
| 5     | 232.0    | 218.1    | -13.9 |
| 6     | 231.9    | 212.6    | -19.3 |
+-------+----------+----------+-------+

We also measured HPB performance using iozone.
Here is my iozone script:
iozone -r 4k -+n -i2 -ecI -t 16 -l 16 -u 16
-s $IO_RANGE/16 -F mnt/tmp_1 mnt/tmp_2 mnt/tmp_3 mnt/tmp_4 mnt/tmp_5
mnt/tmp_6 mnt/tmp_7 mnt/tmp_8 mnt/tmp_9 mnt/tmp_10 mnt/tmp_11 mnt/tmp_12
mnt/tmp_13 mnt/tmp_14 mnt/tmp_15 mnt/tmp_16

Result:
+----------+--------+---------+
| IO range | HPB on | HPB off |
+----------+--------+---------+
|   1 GB   | 294.8  | 300.87  |
|   4 GB   | 293.51 | 179.35  |
|   8 GB   | 294.85 | 162.52  |
|  16 GB   | 293.45 | 156.26  |
|  32 GB   | 277.4  | 153.25  |
+----------+--------+---------+

Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Can Guo <[email protected]>
Acked-by: Avri Altman <[email protected]>
Tested-by: Bean Huo <[email protected]>
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Daejun Park <[email protected]>
This is a patch for managing L2P map in HPB module.

The HPB divides logical addresses into several regions. A region consists
of several sub-regions. The sub-region is a basic unit where L2P mapping is
managed. The driver loads L2P mapping data of each sub-region. The loaded
sub-region is called active-state. The HPB driver unloads L2P mapping data
as region unit. The unloaded region is called inactive-state.

Sub-region/region candidates to be loaded and unloaded are delivered from
the UFS device. The UFS device delivers the recommended active sub-region
and inactivate region to the driver using sensedata.
The HPB module performs L2P mapping management on the host through the
delivered information.

A pinned region is a pre-set regions on the UFS device that is always
activate-state.

The data structure for map data request and L2P map uses mempool API,
minimizing allocation overhead while avoiding static allocation.

The mininum size of the memory pool used in the HPB is implemented
as a module parameter, so that it can be configurable by the user.

To gurantee a minimum memory pool size of 4MB: ufshpb_host_map_kbytes=4096

The map_work manages active/inactive by 2 "to-do" lists.
Each hpb lun maintains 2 "to-do" lists:
  hpb->lh_inact_rgn - regions to be inactivated, and
  hpb->lh_act_srgn - subregions to be activated
Those lists are maintained on IO completion.

Reviewed-by: Bart Van Assche <[email protected]>
Reviewed-by: Can Guo <[email protected]>
Acked-by: Avri Altman <[email protected]>
Tested-by: Bean Huo <[email protected]>
Signed-off-by: Daejun Park <[email protected]>
This patch changes the read I/O to the HPB read I/O.

If the logical address of the read I/O belongs to active sub-region, the
HPB driver modifies the read I/O command to HPB read. It modifies the UPIU
command of UFS instead of modifying the existing SCSI command.

In the HPB version 1.0, the maximum read I/O size that can be converted to
HPB read is 4KB.

The dirty map of the active sub-region prevents an incorrect HPB read that
has stale physical page number which is updated by previous write I/O.

Reviewed-by: Can Guo <[email protected]>
Reviewed-by: Bart Van Assche <[email protected]>
Acked-by: Avri Altman <[email protected]>
Tested-by: Bean Huo <[email protected]>
Signed-off-by: Daejun Park <[email protected]>
This patch supports the HPB 2.0.

The HPB 2.0 supports read of varying sizes from 4KB to 512KB.
In the case of Read (<= 32KB) is supported as single HPB read.
In the case of Read (36KB ~ 512KB) is supported by as a combination of
write buffer command and HPB read command to deliver more PPN.
The write buffer commands may not be issued immediately due to busy tags.
To use HPB read more aggressively, the driver can requeue the write buffer
command. The requeue threshold is implemented as timeout and can be
modified with requeue_timeout_ms entry in sysfs.

Signed-off-by: Daejun Park <[email protected]>
We will use it later, when we'll need to differentiate between device
and host control modes.

Signed-off-by: Avri Altman <[email protected]>
In device control mode, the device may recommend the host to either
activate or inactivate a region, and the host should follow. Meaning
those are not actually recommendations, but more of instructions.

On the contrary, in host control mode, the recommendation protocol is
slightly changed:
a) The device may only recommend the host to update a subregion of an
   already-active region. And,
b) The device may *not* recommend to inactivate a region.

Furthermore, in host control mode, the host may choose not to follow any
of the device's recommendations. However, in case of a recommendation to
update an active and clean subregion, it is better to follow those
recommendation because otherwise the host has no other way to know that
some internal relocation took place.

Signed-off-by: Avri Altman <[email protected]>
In host control mode, reads are the major source of activation trials.
Keep track of those reads counters, for both active as well inactive
regions.

We reset the read counter upon write - we are only interested in "clean"
reads.  less intuitive however, is that we also reset it upon region's
deactivation.  Region deactivation is often due to the fact that
eviction took place: a region become active on the expense of another.
This is happening when the max-active-regions limit has crossed. If we
don’t reset the counter, we will trigger a lot of trashing of the HPB
database, since few reads (or even one) to the region that was
deactivated, will trigger a re-activation trial.

Keep those counters normalized, as we are using those reads as a
comparative score, to make various decisions.
If during consecutive normalizations an active region has exhaust its
reads - inactivate it.

Signed-off-by: Avri Altman <[email protected]>
In host mode, eviction is considered an extreme measure.
verify that the entering region has enough reads, and the exiting
region has much less reads.

Signed-off-by: Avri Altman <[email protected]>
I host mode, the host is expected to send HPB-WRITE-BUFFER with
buffer-id = 0x1 when it inactivates a region.

Use the map-requests pool as there is no point in assigning a
designated cache for umap-requests.

Signed-off-by: Avri Altman <[email protected]>
The spec does not define what is the host's recommended response when
the device send hpb dev reset response (oper 0x2).

We will update all active hpb regions: mark them and do that on the next
read.

Signed-off-by: Avri Altman <[email protected]>
In order not to hang on to “cold” regions, we shall inactivate a
region that has no READ access for a predefined amount of time -
READ_TO_MS. For that purpose we shall monitor the active regions list,
polling it on every POLLING_INTERVAL_MS. On timeout expiry we shall add
the region to the "to-be-inactivated" list, unless it is clean and did
not exhaust its READ_TO_EXPIRIES - another parameter.

All this does not apply to pinned regions.

Signed-off-by: Avri Altman <[email protected]>
Support devices that report they are using host control mode.

Signed-off-by: Avri Altman <[email protected]>
We can make use of this commit, to elaborate some more of the host
control mode logic, explaining what role play each and every variable.

While at it, allow those parameters to be configurable.

Signed-off-by: Avri Altman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.