Improve documentation of Stack Checking

This commit is contained in:
2020-09-07 23:52:12 +02:00
parent a877ef5f28
commit 2f6590416d
8 changed files with 98 additions and 2 deletions

View File

@@ -0,0 +1,112 @@
.. _backup_ram:
Safety Backup RAM
=================
Overview
--------
The STM controller's backup RAM is used to store different kinds of information that shall be preserved if the controller resets.
The hardware setup is missing a separate powersupply for the controller's backup domain. Therefore, the backup RAM is cleared, when the power is cut.
The backup RAM is used to store permanent error flags (See :ref:`safety_flags`). This ensures the flags that trigger hard faults / the panic mode, can be identified, although the wathcoog resets the controller. The only way to clear them is by cutting the power.
Because cutting the power is a way to clear the backup RAM, no separate method for clearing the error entries in the backup RAM is defined.
The backup RAM contents are protected by a `CRC Checksum`_.
The backup RAM is initialized and checked after boot. If the controller starts from a powered down state,
the backup RAM is empty. This is detected by an invalid `Header`_ at the beginning of the backup RAM. If this is the case, the safety ocntoller
will create a valid backup RAM image with a `Header`_, empty `Boot Status Flag Entries`_, empty `Config Overrides`_, an empty `Error Memory`_, and a valid `CRC Checksum`_.
If the Header is valid during boot (verified by plausible values and correct magic numbers), the backup RAM is CRC checked and the error memory is
checked for valid entries.
In case of a CRC error or invalid entries in the error memory, the Backup RAM is wiped and reinitialized. On top of that, the error flag :ref:`safety_flags_safety_mem_corrupt` is set.
.. note:: It may be possible that future versions of the hardware include a backup RAM battery / Goldcap. In this case, a way to clear the error memory will be implemented,
because it will no longer be possible to clear the error memory by cutting the power.
On top of that, the backup memory will also contain the calibration data.
.. note:: The firmware will not use the ``NOP`` entries of the error memory by default, but they will be respected by the validity checker.
Partitioning and Entries
------------------------
The backup RAM consists of multiple sections. The memory section are listed below.
Header
~~~~~~
The backup memory header is located at offset address:
.. doxygendefine:: SAFETY_MEMORY_HEADER_ADDRESS
The header is defined by the following structure:
.. doxygenstruct:: safety_memory_header
The validity of the header is checked, if the magic and inverse amgic fields contain the correct values, and if the offset address pointers
have values that are located inside the error memory and are not ``0`` or the same value.
The safety memory header magic is:
.. doxygendefine:: SAFETY_MEMORY_MAGIC
.. _backup_ram_boot_flags:
Boot Status Flag Entries
~~~~~~~~~~~~~~~~~~~~~~~~
The boot status flag entries are use to store system states over resets.
The flags are stored in memory using the follwoing structure:
.. doxygenstruct:: safety_memory_boot_status
Flags are evaluated active, if the corresponding word is unequal to ``0``.
Config Overrides
~~~~~~~~~~~~~~~~
Config overrides are used to override persistance and flag weights dynamically. The safety controller will parse the entries on
startup.
======================= ============ ================= ===================== =====================================
Entry Byte 1 (LSB) Byte 2 Byte 3 Byte 4 (MSB)
======================= ============ ================= ===================== =====================================
Weight override ``0xA2`` ``Weight`` ``Flag Number`` reserved don't care (written as 0xAA)
Persistance override ``0x8E`` ``Persistance`` ``Flag Number`` reserved don't care (written as 0xBB)
======================= ============ ================= ===================== =====================================
All words, not matching the table above are ignored and do not cause an error. By default the firmware fills this memory area with zeroes.
Error Memory
~~~~~~~~~~~~
The error memory contains error entries in form of 32 bit words. The entries are coded as stated below.
``Error Flag`` entries are used to restore error flags after boot. In theory, all flags can be set using this entry type.
However, only persistent flags are stored in the error memory by the firmware.
``NOP`` entries have no meaning. They are used as a filler. When adding a new error memory entry, the error memory is scanned until the first ``NOP`` entry is found.
It is replaced with a valid entry. If the error memory contains a word, that is not defined below, it is considered invalid and will trigger the RAM checker on boot.
``NOP`` entries can be used to preallocate the error memory in advance. if the end of the error memory is reached, it is expanded by 1 word to first
the new error entry, until the backup RAM is full. After this, no further errors are stored.
If the same persistent error is triggered mutliple times, the ``COUNTER`` in the error entry is incremented.
======================= ============ ================= ===================== =====================================
Entry Byte 1 (LSB) Byte 2 Byte 3 Byte 4 (MSB)
======================= ============ ================= ===================== =====================================
Error Flag ``0x51`` ``Flag Number`` ``COUNTER 7:0`` ``COUNTER 15:8``
NOP Entry ``0x22`` ``0x12`` ``0xAA`` ``0xC1``
======================= ============ ================= ===================== =====================================
CRC Checksum
~~~~~~~~~~~~
The CRC checksum is located after the error memory. The checksum is calculated by the internal peripheral module of the STM32F4 controller.
Therefore, the CRC calculation is fixed.
The polynomial is ``0x4C11DB7`` (Ethernet CRC32):
.. math:: P_{CRC}(x) = x^{32}+x^{26}+x^{23}+x^{22}+x^{16}+x^{12}+x^{11}+x^{10}+x^{8}+x^{7}+x^{5}+x^{4}+x^{2}+x+1

View File

@@ -0,0 +1,20 @@
.. _safety_handling:
Error Handling
==============
.. _safety_panic:
Panic Mode
----------
.. _safety_error_mem:
Error memory
------------
Permanent errors are stored in the backup RAM of the STM. This ensures, that errors can be read even after a full system reset has occured.
.. seealso:: :ref:`backup_ram`

View File

@@ -0,0 +1,114 @@
.. _safety_flags:
Safety Flags
============
The safety flags are represented in software by the following enums
.. doxygenenum:: safety_flag
The safety flags can be temporarily or permanent. Some temporary flags are reset automatically, once the error condition disappears. Others have to be explicitly cleared.
The safety weights (if a flag stops the PID controller, or triggers the panic mode) are configured by default as described below. However, it will be possible to override these weights by
setting config entries in the safety memory.
.. todo:: Change docu of config entires in memory
----------------------------------------------------------------------------------------------------------------------------------
.. _safety_flags_adc_overflow:
ERR_FLAG_MEAS_ADC_OVERFLOW
--------------------------
``ERR_FLAG_MEAS_ADC_OVERFLOW`` is triggered in case of an overflow in the signal path of the measurement ADC. This should never happen unless there is a bug in the software.
========== ============= ============= ===========
persistent self-clearing Stops PID Panic Mode
========== ============= ============= ===========
yes no yes no
========== ============= ============= ===========
----------------------------------------------------------------------------------------------------------------------------------
.. _safety_flags_adc_off:
ERR_FLAG_MEAS_ADC_OFF
---------------------
``ERR_FLAG_MEAS_ADC_OFF`` signals that the measurement ADC for the PT1000 sensor is deactivated. This flag is automatically cleared by the firmware
once the ADC is started.
========== ============= ============= ===========
persistent self-clearing Stops PID Panic Mode
========== ============= ============= ===========
no yes yes no
========== ============= ============= ===========
----------------------------------------------------------------------------------------------------------------------------------
.. _safety_flags_adc_watchdog:
ERR_FLAG_MEAS_ADC_WATCHDOG
--------------------------
``ERR_FLAG_MEAS_ADC_WATCHDOG`` is used as a wire break detection mechanism. This flag is set when the PT1000 measurement ADC detects an invalid resistance measurement.
.. seealso:: :ref:`ADC Watchdog<firmware_meas_adc_watchdog>`
========== ============= ============= ===========
persistent self-clearing Stops PID Panic Mode
========== ============= ============= ===========
no no yes no
========== ============= ============= ===========
----------------------------------------------------------------------------------------------------------------------------------
.. _safety_flags_adc_unstable:
ERR_FLAG_MEAS_ADC_UNSTABLE
--------------------------
``ERR_FLAG_MEAS_ADC_UNSTABLE`` is set after startup of the PT1000 measuremnt or after reconfiguring the filter settings.
.. seealso:: :ref:`firmware_meas_adc_filter`
========== ============= ============= ===========
persistent self-clearing Stops PID Panic Mode
========== ============= ============= ===========
no yes no no
========== ============= ============= ===========
.. _safety_flags_safety_mem_corrupt:
ERR_FLAG_SAFETY_MEM_CORRUPT
---------------------------
``ERR_FLAG_SAFETY_MEM_CORRUPT`` is set during the initialization of the controller, in case a corrupted safety memory is encountered.
In this case the error memory is reinitialized and the flag is set in the error memory. Afer a reboot it will stay asserted until the
safety backup memory is cleared
.. seealso:: :ref:`backup_ram`
========== ============= ============= ===========
persistent self-clearing Stops PID Panic Mode
========== ============= ============= ===========
yes no yes no
========== ============= ============= ===========
.. _safety_flags_stack:
ERR_FLAG_STACK
---------------------------
``ERR_FLAG_STACK`` ialization of the controller, in case a corrupted safety memory is encountered.
This error is not recoverable and will trigger the panic mode.
.. seealso:: :ref:`safety_stack_checking`
========== ============= ============= ===========
persistent self-clearing Stops PID Panic Mode
========== ============= ============= ===========
yes no yes yes
========== ============= ============= ===========

View File

@@ -0,0 +1,24 @@
.. _firmware_safety:
Safety Controller
=================
The safety controller is the software component that monitors the overall condition of the reflow controller,
and stops the output driver in case of an error.
Severe error flags, like a drifting reference voltage, stop the PID controller and force the output to zero.
The controller stays in a usable state. After the errors have been cleared, normal operation may continue.
On the other hand, fatal errors like an over-temperature error, or memory problem, lead to the activation of the :ref:`safety_panic`,
which forces the output zero, but does not allow any further interaction.
On top of this, a :ref:`backup_ram` is implemented. It stores permantent errors, which are reset at a restart. On top of that, it stores the :ref:`backup_ram_boot_flags`,
which are used to retain boot information across resets, for example to communicate with the firmware updater etc. The RAM also contains entries, that allow overrides of flag weights and persistance.
.. toctree::
:maxdepth: 3
flags
backup-ram
error-handling
stack-checking

View File

@@ -0,0 +1,39 @@
.. _safety_stack_checking:
Safety Stack Checking
=====================
To ensure correct operation of the controller, the stack is continuously monitored. For this, the :ref:`firmware_safety` checks the stack in each run.
These checks include:
1. Checking of used stack space and limit to end of stack
2. Checking a protection area between heap and stack for memory corruption
Any detected error will set the :ref:`safety_flags_stack` error flag.
Stack Pointer Checking
----------------------
The stack pointer is checked using :c:func:`stack_check_get_free`. The returned value for the remaining stack space is checked against
.. doxygendefine:: SAFETY_MIN_STACK_FREE
.. doxygenfunction:: stack_check_get_free
Stack and Heap Corruption Checking
----------------------------------
A section of memory is located between the stack and the heap. It is defined inside the linker script. It's size is configured by the linker script parameter ``__stack_corruption_area_size``, which is set to ``128`` by default.
This section is filled at the initializazion of the safety controller by a call to
.. doxygenfunction:: stack_check_init_corruption_detect_area
On each run of the safety controller's handling function (:c:func:`safety_controller_handle`) the following function is called:
.. doxygenfunction:: stack_check_corruption_detect_area
This function constantly checks the memory area for write modifications, and therefore detects, if the stack or heap have grown outside their boundaries.