Test device config changes are propagated correctly to the watchdog.
* Keep polling until the config value becomes the expected one.
* Remove listeners at the end of each test - too many device config
listeners will delay the notification significantly ( > 1 min)
according to my local experiments.
Bug: 181820350
Test: atest PackageWatchdogTest
Change-Id: Ia5e89b69a8052f49e6400f6b22313249d523786c
Sometimes the property change callback is not called within the
sleep timeout. Let's call updateConfigs() to apply device config
changes immediately to eliminate the race condition.
Bug: 178675924
Test: atest PackageWatchdogTest
Change-Id: I2b3ce79eac36cfc5ef98a62750142bb6d936e043
Bug: 174932174
Test: I solemnly swear I tested this conflict resolution.
Exempt-From-Owner-Approval: refactoring with team leads buy-in
Change-Id: I9262a08ffc1ccede8e519d0eed90ed2bfcf0232c
As general background, OWNERS files expedite code reviews by helping
code authors quickly find relevant reviewers, and they also ensure
that stakeholders are involved in code changes in their areas.
Some teams under frameworks/base/ have been using OWNERS files
successfully for many years, and we're ready to expand them to cover
more areas. Here's the historical coverage statistics for the last
two years of changes before these new OWNERS changes land:
-- 56% of changes are fully covered by OWNERS
-- 17% of changes are partially covered by OWNERS
-- 25% of changes have no OWNERS coverage
Working closely with team leads, we've now identified clear OWNERS on
a per-package basis, and we're using "include" directives whenever
possible to to simplify future maintenance. With this extensive
effort, we've now improved our coverage as follows:
-- 98% of changes are fully covered by OWNERS
-- 1% of changes are partially covered by OWNERS
-- 1% of changes have no OWNERS coverage
This specific change is automatically generated by a script that
identifies relevant "include" directives.
Bug: 174932174
Test: manual
Exempt-From-Owner-Approval: refactoring with team leads buy-in
Merged-In: I3480ddf2fe7ba3dfb922b459d4da01fa17a2c813
Change-Id: I3480ddf2fe7ba3dfb922b459d4da01fa17a2c813
As part of the effort to better support failure mitigation across
reboots, track the mitigation calls across reboots by storing them
along with other salient parts of MonitoredPackage. These values
are relative to the current uptime of the system, so that they will
be accurate when the system uptime is reset at the next boot.
Also refactored code to allow for testing the reading and writing
of MonitoredPackage objects, and added tests for this.
Test: atest PackageWatchdogTest
Bug: 171951174
Change-Id: Ia96cf3892886d8d77193ffc278fa1eb584fecdd3
Similar to work that has already been done for package failures,
pass a mitigation count to an observer interested in boot loops.
Since this logic is handled using sysprops, the mitigation count
will be reset after the default escalation window, rather than
decreasing with a sliding window.
Migrated elapsedRealTime to uptimeMillis in all instances in
PackageWatchdog. This should not have any impact, since the only
difference is that elapsedRealTime counted time in deep sleep,
which will not be an important factor for boot loops detection.
Test: atest PackageWatchdogTest
Test: atest RescuePartyTest
Bug: 172206136
Change-Id: I7e59f3fa32544bd410d8508e6529c77049a70df0
Track the number of times a call has been made to an
observer to mitigate each MonitoredPackage object. This
"mitigation count" will be used as a proxy for determining
what rescue level to perform in RescueParty. A sliding
window is used so that this mitigation count may
de-escalate. The default value of this sliding window
is one hour.
A follow-up CL will integrate RescueParty's rescue level
mechanism with this logic.
Test: atest PackageWatchdogTest
Bug: 172206136
Change-Id: Idb97901ad1c8acbee15417ea35d29e67e9d4562e
Ensure that calls to sync requests with the explicit health
check controller are always sent if the list of packages
pending health checks is empty, so that the controller can
unbind. This will allow extservices to be killed by lmkd
on low memory devices.
Test: atest PackageWatchdogTest
Test: atest NetworkStagedRollbackTest
Test: check logcat to see that the service is unbound
Bug: 156323728
Change-Id: If615a337760b2057b962284bde8565b593d82a50
This reverts commit 553c94bcabce0fb449a97da502969b20d1cffd16.
Reason for revert: Breaks NetworkStagedRollbackTest. Will re-submit change once I have a fix for that issue
Bug: 157662759
Change-Id: If81c01c597f37ff01924c8b038cbd38f77e7fa06
Explicitly call into the health check controller if there are
no more packages to check. This is due to the fact that
the ExplicitHealthCheckController will unbind itself in this
case. If this call is not made, the controller will continue
running in the foreground and will not be killed by lmkd.
Test: atest PackageWatchdogTest
Test: check logcat to see that the service is unbound
Bug: 156323728
Change-Id: I0044d0832178ee90043d5e64e406df07ee2c36a2
Instead of always creating a new MonitoredPackage every time
PackageWatchdog#startObservingHealth is called, just update
the duration of an existing MonitoredPackage if one exists. This
means that the failure history will be preserved.
Test: atest PackageWatchdogTest
Bug: 150114865
Change-Id: I6d6e3e0e893a603fda50df833bc5b6ce1757b6ec
Instead of periodically syncing requests with the same information,
only call into the ExplicitHealthCheckController when the set
of packages with pending health checks has changed, or a new observer
has been registered. Add tests to verify that duplicate calls are not made.
Test: atest PackageWatchdogTest#testSyncHealthCheckRequests
Test: atest NetworkStagedRollbackTest
Bug: 150114865
Bug: 146767850
Change-Id: I2926e9c7689e0ac9c4a142263ffd50a4747d016f
It is possible for null to be returned by
ProcessRecord.getPackageListWithVersionCode on package failure. This
can cause a NPE in Package Watchdog. Ensure that the list of failing
packages is not null.
Test: atest PackageWatchdogTest
Bug: 151113966
Change-Id: Iab23cd6b4b8ae6b787df5f0b831b51e0ac8b3d31
Test the notifyHealthCheckPassed method to ensure that the expected
information is sent when an explicit health check passes.
Bug: 150638807
Test: atest ExplicitHealthCheckServiceTest
Change-Id: I98c1c3bf018a82ea769846b4212c295518814a18
Make Package Watchdog the component that receives calls
about boot events, and decides on whether or not to
perform mitigation action for a perceived boot loop.
The logic for selecting an observer to handle boot loops
is similar to how package failure is handled. The threshold
logic is the same as it was in Rescue Party (5 system server
boots in 10 minutes). Rescue Party maintains its own rescue
levels internally, which map to user impact levels.
Add optional onBootLoop() and executeBootLoopMitigation() methods
to PackageHealthObserver.
Add tests to handle the new cases handled by Package Watchdog.
Test: atest RescuePartyTest
Test: atest PackageWatchdogTest
Bug: 136135457
Change-Id: Ic435e60318e369509975c19a9888741e047803de
Integrate Rescue Party as an observer for Package
Watchdog, for managing package failures. Rescue Party
will be a persistent observer, meaning it may receive
failure calls for packages it has not explicitly asked
to observe.
Remove app failure calls and thresholding logic from
Rescue Party. Remove obsolete Rescue Party tests
and add persistent observer tests to
PackageWatchdogTest.
Test: atest PackageWatchdogTest
Test: atest RescuePartyTest
Test: atest StagedRollbackTest
Bug: 136135457
Change-Id: I55ec0de48acd5434255811feba758d38c9304478
For the sake of consolidating various error detection mechanisms,
move native crash detection to Package Watchdog. Add a method
to allow the traditional threshold logic to be bypassed in this
case. This method will be used in the future for prioritizing
explicit health check failures.
Test: atest StagedRollbackTest#testNativeWatchdogTriggersRollback
Bug: 145584672
Change-Id: I98eb9f45a6f4a6d15001650e31ba9c596905663a
This is a prerequisite for adding additional logging of
the Watchdog-triggered rollback reason. Add flags which
indicate the failure observed (native, crash, ANR, explicit
health check). These will be used in the future by
RollbackPackageHealthObserver to map the failure type to the
(new) set of available logging metrics.
Test: atest PackageWatchdogTest
Bug: 138782888
Change-Id: I7e7c5e5399011e2761dada2b989a95c2013307e9
Use factory method to create MonitoredPackage which will return null
when version code can't be resolved.
Bug: 141155222
Test: atest PackageWatchdogTest
Change-Id: I6c983872cbdfd02940d76f7307aa4a6a1062d438
The code doesn't work as intended. What we should do is:
1. set up so that health check duration is shorter than observation duration
2. move time forward so we fail the health check
3. check observer.mMitigatedPackages contains only APP_A
4. move time forward again to expire the observation duration
5. check APP_A is not notified again as a failed package
Also add a similar test where the observation duration is shorter than
the health check duration.
Bug: 141518951
Test: atest PackageWatchdogTest
Change-Id: Iba1cdc4fab8608982b416cdb463ed4b38d355c9f
Since startObservingHealth is called during boot, it is less desirable
to cause boot loops by an uncaught exception. We will fall back to
DEFAULT_OBSERVING_DURATION_MS when invalid durationMs is passed.
See b/140780361 for more details about the design decision.
Bug: 140780361
Test: atest PackageWatchdogTest
Change-Id: I2bcbecb2dc4c2448ef697001dd93aea5f50f9dbf
Use the sliding window algorithm to detect if there exists a window
containing failures equal to or above the trigger threshold.
Bug: 140841942
Test: atest PackageWatchdogTest
Change-Id: I34a20e4d3b98a093dffa05fc7c7c026905834b53
Since calls to raiseFatalFailure are always followed by
TestLooper#dispatchAll, we can combine them to reduce boilerplate code.
Bug: 140691154
Test: atest PackageWatchdogTest
Change-Id: I0ea23dc132f2ad26ced1119bc5278bc5d876949c
Following go/unit-test-practices, we split testRegistration into smaller
ones so each test focuses on one behavior at a time.
Note we will remove testRegistration in a later CL.
Bug: 140472424
Test: atest PackageWatchdogTest
Change-Id: I88e00a8fc43b953d575ee047979b7fe1d5fbd3ba
TestObserver#mHealthCheckFailedPackages is added to collect packages
when TestObserver#onHealthCheckFailed is called. It will be used to test
if resgistration/unregistration is done successfully.
TestController#mFailedPackages is also renamed to be distinguished from
mHealthCheckFailedPackages.
Bug: 140472424
Test: atest PackageWatchdogTest
Change-Id: I791e0a1b8e5d59ae766502b54a0782d509b209b5
TestLooper.moveTimeForward() changes the target delivery time of the messages
in the queue to simulate elapsed time. This allows tests to run faster in a
more deterministic way without incurring the indeterminism caused by Thread.sleep()
which is usually a source of flakiness and should be avoided when possible.
Bug: 140208026
Test: atest PackageWatchdogTest
Change-Id: I3365093838ec9fa2de5742359f6947379add7703
This bug is motivated by bug 140208026 where we want to replace
Thread.sleep() with TestLooper.moveTimeForward() in PackageWatchdogTest.java.
However, it turns out that PackageWatchdog uses SystemClock.uptimeMillis()
internally. The tests will fail if we don't forward PackageWatchdog's internal
clock accordingly.
We add a wrapper around SystemClock.uptimeMillis() so it is customizable
by the test case.
Bug: 140358475
Test: atest PackageWatchdogTest
Change-Id: Id26325a93dc4050c6468502347b0e7852ed1263f
Refactor NetworkStackClient class to move the module service binding &
network stack process death monitoring to a separate class. This class
will only instantiated in the SystemServer process.
The new class |SystemServerToNetworkStackConnector| will be used from
the client classes corresponding to each module running on the network
stack process (NetworkStackClient, WifiStackClient, etc)
This has 2 main advantages:
a) Reduces code duplication (Otherwise the various Client classes need
to replicate the service bindding & process death monitoring).
b) Central crash recovery for the network stack process (Otherwise the
various Client classes will trigger multiple recovery for a single
network stack process crash).
Bug: 135679762
Test: Device boots up & connects to wifi networks.
Change-Id: I673581b0067b9a3f72dd68a3ab622c18183ebd2e
Merged-In: I673581b0067b9a3f72dd68a3ab622c18183ebd2e
The test adds a dependency on mockito extended to be able to mock the
Context, PackageManager etc.
Test: atest PackageWatchdogTest#testNetworkStackFailure (+rest of class)
Bug: 133725814
Change-Id: Iba8a47f5e94b5dba49d6d395085e77285305ee7c
In addition to the NetworkStack app monitoring, have PackageWatchdog
register an observer to NetworkStackClient to receive severe failure
notifications, and attempt a rollback if available.
The callback is registered in onPackagesReady(), which is called in the
boot sequence just before starting the NetworkStack.
Test: installed new networkstack, killed it twice, observe rollback
Test: unit test in change on top
Bug: 133725814
Change-Id: I2cb4200b78c2482cacc4bfe2ace1581b869be512
Make PackageWatchdogTest compatible to the changes that added
DeviceConfig flags to PackageWatchdog. This includes:
* Make PackageWatchdog#setExplicitHealthCheckEnabled private and
use DeviceConfig mechanism for changing that value instead
* Disable TestLooper#startAutoDispatch
* Other minor refinements that solve compatibility issues
Bug: 129335707
Test: atest com.android.server.PackageWatchdogTest
Merged-In: I7323dc65ec2957aeab128224864441bdf63c6f81
Change-Id: I7323dc65ec2957aeab128224864441bdf63c6f81
1. Receiving List<PackageInfo>:
Since I29e2d619a5296716c29893ab3aa2f35f69bfb4d7, we now receive a
List of PackageInfo instead of Strings for packages supporting
explicit health checks. Now, we parse this List<PackageInfo> from
ExtServices instead of trying to parse List<String> and we use the
health check timeout in the PackageInfo as the health check expiry
deadline instead of using the total package expiry time.
2. Updating health check durations onSupportedPackages:
Before, we always updated the health check duration for a
package if the package is supported and the health check state is
not PASSED, this caused the health check duration for a package to
never reduce as long as we kept getting onSupportedPackages. Now, we
improved the readability of the state transitions onSupportedPackages.
We now correctly only update the health check duration for supported
packages in the INACTIVE state.
3. FAILED state:
Before we only had INACTIVE, ACTIVE and PASSED states. When a package
has failed the health check we could notify the observer multiple
times in quick succession and get into a bad internal state with
negative health check durations. Now we added check to ensure we
don't try to schedule with a Handler with a negative duration and we
defined a negative health check duration to be a new FAILED state if the
health check is not passed. This clearly defines the state transitions
as seen below:
+----------+ +---------+ +------+
| | | | | |
| INACTIVE +---->+ ACTIVE +--->+PASSED|
| | | | | |
+-----+----+ +----+----+ +------+
| |
| |
| |
| |
| +----v----+
| | |
+----------> FAILED |
| |
+---------+
4. Uptime state:
Everytime we pruned observers, we scheduled the next prune and stored
the current SystemClock#uptimeMillis. This allowed us determine how
much time had elapsed for the next prune. The uptime was not correclty
updated when starting to observe already observed packages. With the
following sequence of events:
-monitor package A for 1hr
-30mins elapsed
-monitor package A again for 1hr
A would expire 30mins from the last event instead of 1hr.
This was because the second time around, we
saved the new state to disk but did not reschedule so did not update
the uptime at last schedule, so 1hr from the first event, we would
prune packages with the original uptime and incorrectly expire A
earlier. Now we update all internal state, fixed this and added a test
for this case.
5. Readability
Improved method variable names, logging and comments.
Bug: 120598832
Test: Manual testing && atest PackageWatcdogTest
Change-Id: I1512d5938848ad26b668636405fe9b0db50d3a2e
We have always evaluated the explicit health check results on package
expiry. Since I29e2d619a5296716c29893ab3aa2f35f69bfb4d7 we now receive
explicit health check timeouts from ExtServices. This cl doesn't yet
use the timeout but it treats explicit health check timeouts as
different events from package expiry. This is in preparation to use
the timeouts from the cl mentioned above.
Improved readability: Logging, comments, variable and function names
Bug: 120598832
Test: atest PackageWatchdogTest
Change-Id: I8030dae1fef5b8fee42095c1eaf16861cc33ac59
Improvements:
1. Queuing PackageWatchdog requests to startObserving packages:
When observing packages with the watchdog, we needed to get
the packages supporting explicit health checks so we can decide if a
package should be passing or not. This prevents us from receiving
requests to monitor packages during early boot, before third party
packages are ready. In this change we don't depend on ExtServices to
be up to startObserving, we initially treat all package as failing a
health check and lazily syncRequests to request or cancel explicit
health checks based on the currently observed packages. When we receive
onSupportedPackages, we mark the packages that don't support health
checks as passing.
2. Lazy binding to the explicit health check service:
We were always bound to the explicit health check
service regardless of whether we are expecting requests or not, we need
to be able to bind and unbind dynamically to improve device resource
usage. In this change, we bind as soon as we make a request and are
expecting results, we unbind otherwise.
3. Fixed Races:
There were a couple of potential races that could lead to exceptions
that could bring the system server down, e.g when the service is
transitioning between disconnected and connected state (maybe it
crashed) or when ExtServices is being upated and is down or early
boot requests when third party apps are not ready. This change fixes such.
4. Logging:
We improved the logging wording and order and made it more consistent
Bug: 120598832
Test: Manual tests. Stress tested behavior by killing extservices and
making requests simultaneously
function killproc {
while true
do
local pid=$(adb shell pidof $1)
if [[ ! -z $pid ]]
then
echo $pid
adb shell kill $pid
fi
done;
}
adb install-multi-package -i com.android.shell --enable-rollback \
NetworkStack.apk ModuleMetadataGoogle.apk
Also switched between enabled and disabled states to verify packages
are handled correctly. Will automate these tests in later cl
atest PackageWatchdogTest
Change-Id: Iafaef553e95d107f700109f9a8328950a5e2bf71
PackageWatchdog now uses the ExplicitHealthCheckController introduced
in Ia030671c99699bd8d8273f32a97a1d3b7b015d3b when observing packages.
Bug: 120598832
Test: Manually tested that after an APEX update, the network stack
does not pass the explicit health check until WiFi is connected
successfully. If Wi-Fi is never connected and the network stack
monitoring duration is exceeded, the update is rolled back.
Change-Id: I75d3cc909cabb4a4eb34df1d5022d1afc629dac3
As part of extending PackageWatchdog with explicit health check support
in Ib4322c327bcb00ca9a3fbdc83579e7b5f2fd633b. Trigger the observers #execute
method if a package never passed explicit health check on expiry.
Bug: 120598832
Test: atest PackageWatchdogTest
Change-Id: I8e916a6ca115d3883fe29f66456da36cd0ed09fb