心智的本質:條件預測

時間:2000.03.21

摘要

作者 Rich Sutton 提出,心智活動的核心可以被視為進行條件預測。他將此概念從強化學習中的狀態-行動預測,擴展到一個更通用的框架,該框架基於狀態、策略以及觸發預測的「結果」。這種以結果為條件的預測,不僅為知識提供了可驗證的基礎,也使其能被自動化學習與規劃,並能以簡潔的方式表達複雜的常識性知識。

Roger’s Takeaway

本文中Richard嘗試將人類的行為視為可透過獎勵-逞罰機制去做預測,很有趣,這正是他如今反對的理論(?),當人類只是一個反應獎勵與懲罰的機器人時,他的行為就會變得高度可預測。

儘管我們可以爭辯人類是不是一台機器,但我們可以感受到,越走出人類範圍的東西,越可以預測,例如企業、國家、組織,這些『法人』們,時時刻刻受到自己組織設計、薪酬設計,以及當中的個人自利所驅動,因此產生的行為往往相對理性且可預測。

就像蒙格說的『激勵機制』會決定很多事情。

Highlight

1.

關於心智活動,有一件事對我而言似乎很清楚——其大部分目的可被視為進行預測。我這裡所說的預測是一個相當普遍的概念,包含條件預測與獎勵預測。我指的這種意義足夠強烈且具體,使其不致空泛。

2.

假設世界是一個馬可夫決策過程(Markov Decision Process, MDP),也就是說,我們在每個離散的時間步驟中都有明確的行動、感知和獎勵。那麼,顯然,值得做的有趣預測包括立即獎勵和狀態轉移,例如「如果我處於此狀態並執行此行動,那麼下一個狀態和獎勵會是什麼?」價值函數的概念也是一種預測,例如「如果我處於此狀態並遵循此策略,我未來累積的折扣獎勵將會是多少?」當然,我們可以做出許多價值函數的預測,每種不同的策略都對應一個。

3.

它們是假設性的預測。第一種是假設性的,因其取決於單一行動;另一種也是假設性的,因其取決於一整個策略,即一整套行為方式。以行動為條件的預測對於實際選擇行動當然很有用,就像在許多強化學習方法中,會優先選擇估計價值最高的行動。更普遍地說,我們的許多知識都是關於「如果」我們選擇以某種方式行事會發生什麼的信念,這是一種常識。例如,開車上班需要多長時間的知識,就是關於世界與一種我們可能採取的、帶有目的性的假設行為方式互動的知識。

4.

我們需要一個新的想法,一種我稱之為「以結果為條件」(conditioning on outcomes)的預測條件設定方式。在這裡,我們等待某個明確指定的結果集合其中之一發生,然後詢問(或試圖預測)它是哪一個。

這有點像你下注時,設定一些明確的條件,當條件滿足時,賭局結束,誰輸誰贏就一目了然。

因此,一個普遍的條件預測取決於三件事:1)做出預測時的狀態,2)行為的策略,以及 3)觸發預測事件發生時間的結果。

5.

現在讓我們回到我一開始的主張,即大部分(如果不是絕大多數)的心智活動都集中在這樣的條件預測上,包括學習和計算它們,以及用它們進行規劃和推理。我甚至要提出,我們大部分(如果不是絕大多數)的知識都是以這種預測的形式來表示的,而它們就是哲學家所說的「概念」(concepts)。

6.

其中最重要的一點是,預測是「有根據的」(grounded),因為它們具有清晰、可由機械方式確定的意義。任何預測的準確性都可以透過從其狀態開始執行其策略,直到結果發生,然後將預測與結果進行核對來確定。不需要人為干預來解釋其表示方式並確定任何陳述的真偽。將預測與實際事件進行比較的能力,也使其適合被自動化學習。

逐字稿

簡化並概括來說,關於心智活動,有一件事對我而言似乎很清楚——其大部分目的可被視為進行預測。我這裡所說的預測是一個相當普遍的概念,包含條件預測與獎勵預測。我指的這種意義足夠強烈且具體,使其不致空泛。

為了具體說明,假設世界是一個馬可夫決策過程(Markov Decision Process, MDP),也就是說,我們在每個離散的時間步驟中都有明確的行動、感知和獎勵。那麼,顯然,值得做的有趣預測包括立即獎勵和狀態轉移,例如「如果我處於此狀態並執行此行動,那麼下一個狀態和獎勵會是什麼?」價值函數的概念也是一種預測,例如「如果我處於此狀態並遵循此策略,我未來累積的折扣獎勵將會是多少?」當然,我們可以做出許多價值函數的預測,每種不同的策略都對應一個。

請注意,上述兩種預測都是有條件的,不僅取決於狀態,也取決于行動的選擇。它們是假設性的預測。第一種是假設性的,因其取決於單一行動;另一種也是假設性的,因其取決於一整個策略,即一整套行為方式。以行動為條件的預測對於實際選擇行動當然很有用,就像在許多強化學習方法中,會優先選擇估計價值最高的行動。更普遍地說,我們的許多知識都是關於「如果」我們選擇以某種方式行事會發生什麼的信念,這是一種常識。例如,開車上班需要多長時間的知識,就是關於世界與一種我們可能採取的、帶有目的性的假設行為方式互動的知識。

現在來到關鍵步驟,就是將上述兩種明確的條件預測進行概括,以涵蓋更多我們通常認為是知識的範疇。為此,我們需要一個新的想法,一種我稱之為「以結果為條件」(conditioning on outcomes)的預測條件設定方式。在這裡,我們等待某個明確指定的結果集合其中之一發生,然後詢問(或試圖預測)它是哪一個。例如,我們可能會試圖預測我們研究所畢業時會是幾歲,或者夏天結束時體重會是多少,或者開車上班需要多長時間,或者讀完這篇文章時你將學到多少。當骰子停止滾動時會顯示什麼數字?當我賣出股票時,股價會是多少?在所有這些情況下,預測都是關於某個明確識別的事件發生時的狀態。這有點像你下注時,設定一些明確的條件,當條件滿足時,賭局結束,誰輸誰贏就一目了然。

因此,一個普遍的條件預測取決於三件事:1)做出預測時的狀態,2)行為的策略,以及 3)觸發預測事件發生時間的結果。當然,策略只需從做出預測的那一刻起遵循,直到觸發結果的事件發生為止。觸發事件之後的行動是無關緊要的。[這種條件預測的概念先前已在時間擴展行動的模型中被探討過,也被稱為「選項」(options)(Sutton, Precup, and Singh, 1999; Precup, 論文準備中)]。

現在讓我們回到我一開始的主張,即大部分(如果不是絕大多數)的心智活動都集中在這樣的條件預測上,包括學習和計算它們,以及用它們進行規劃和推理。我甚至要提出,我們大部分(如果不是絕大多數)的知識都是以這種預測的形式來表示的,而它們就是哲學家所說的「概念」(concepts)。要適當地論證這些觀點當然是一項漫長的工作。現在,我們先談談一些重點,從條件預測在知識表達方面的一些明顯優勢開始。

其中最重要的一點是,預測是「有根據的」(grounded),因為它們具有清晰、可由機械方式確定的意義。任何預測的準確性都可以透過從其狀態開始執行其策略,直到結果發生,然後將預測與結果進行核對來確定。不需要人為干預來解釋其表示方式並確定任何陳述的真偽。將預測與實際事件進行比較的能力,也使其適合被自動化學習。預測的語義也清楚地說明了它們應如何在自動規劃方法中使用,例如在 MDPs 和 SMDPs 中常用的方法。事實上,我們在此討論的條件預測,其形式正是在這些方法核心的貝爾曼方程式(Bellman equations)中所需要的。

以結果為條件的預測還有一個較不那麼明顯但同樣重要的優勢,那就是它們可以簡潔地表達許多原本難以且昂貴地表示的內容。這種情況在常識性知識中經常發生;這裡我們舉一個簡單的例子。我們想要表達的知識是:你可以去街角,然後一小時內會有一班公車來載你回家。這當然意味著,如果現在是 12:00,公車可能在 12:10 來,也可能在 12:20 來,等等,但它肯定會在 1:00 之前來。使用以結果為條件的方式,這個想法很容易表達:我們要嘛將結果設定為到達 1:00,並預測屆時公車已經來過;要嘛將結果設定為公車的到來,並預測那時的時間會是 1:00 或更早。

一種自然但天真的替代方法是,嘗試將此知識表示為公車在每個時段到達的機率。也許它在每 10 分鐘區間內到達的機率都是六分之一。這種方法並不令人滿意,不僅因為它迫使我們說出可能超出我們所知的內容,更因為它沒有捕捉到一個重要的事實:公車最終一定會來。從形式上來說,這裡的問題是,公車在不同時間到達的事件並非獨立的。它可能只有六分之一的機會在剛好 1:00 到達,但如果時間已經是 12:55,那麼它實際上是確定會在 1:00 前到達。這種天真的表示法沒有捕捉到這個事實,而這個事實對於使用這項知識來說至關重要。一個更複雜的表示法可以捕捉所有這些相依性,但那樣就會變得——更複雜。以結果為條件的形式簡單地表達了這個事實,並且只表達了以這種方式進行推理所需的內容。當然,其他情況可能需要更詳細的知識,而以結果為條件的形式並不排除這一點。這種形式只是允許更大的靈活性,特別是,它能夠在省略這些細節的同時,仍然保持適合規劃和學習的適當形式。

Mind Is About Conditional Predictions

Rich Sutton

March 21, 2000

Simplifying and generalizing, one thing seems clear to me about mental activity---that the purpose of much of it can be considered to be the making of predictions. By this I mean a fairly general notion of prediction, including conditional predictions and predictions of reward. And I mean this in a sufficiently strong and specific sense to make it non-vacuous.

For concreteness, assume the world is a Markov Decision Process (MDP), that is, that we have discrete time and clear actions, sensations, and reward on each time step. Then, obviously, among the interesting predictions to make are those of immediate rewards and state transitions, as in "If I am in this state and do this action, then what will the next state and reward be?" The notion of value function is also a prediction, as in "If I am in this state and follow this policy, what will my cumulative discounted future reward be?" Of course one could make many value-function predictions, one for each of many different policies.

Note that both kinds of prediction mentioned above are conditional, not just on the state, but on action selections. They are hypothetical predictions. One is hypothetical in that it is dependent on a single action, and the other is hypothetical in that it is dependent on a whole policy, a whole way of behaving. Action conditional predictions are of course useful for actually selecting actions, as in many reinforcement learning methods in which the action with the highest estimated value is preferentially chosen. More generally, it is commonsensical that much of our knowledge is beliefs about what would happen IF we chose to behave in certain ways. The knowledge about how long it takes to drive to work, for example, is knowledge about the world in interaction with a hypothetical purposive way in which we could behave.

Now for the key step, which is simply to generalize the above two clear kinds of conditional predictions to cover much more of what we normally think of as knowledge. For this we need a new idea, a new way of conditioning predictions that I call conditioning on outcomes. Here we wait until one of some clearly designated set of outcomes occurs and ask (or try to predict) something about which one it is. For example, we might try to predict how old we will be when we finish graduate school, or how much we will weigh at the end of the summer, or how long it will take to drive to work, or much you will have learned by the time you reach the end of this article. What will the dice show when they have stopped tumbling? What will the stock price be when I sell it? In all these cases the prediction is about what the state will be when some clearly identified event occurs. It is a little like when you make a bet and establish some clear conditions at which time the bet will be over and it will be clear who has won.

A general conditional prediction, then, is conditional on three things: 1) the state in which it is made, 2) the policy for behaving, and 3) the outcome that triggers the time at which the predicted event is to occur. Of course the policy need only be followed from the time the prediction is made until the outcome triggering event. Actions taken after the trigger are irrelevant. [This notion of conditional prediction has been previously explored as the models of temporally extended actions, also known as "options" (Sutton, Precup, and Singh, 1999; Precup, thesis in preparation).

Let us return now to the claim with which I started, that much if not most mental activity is focused on such conditional predictions, on learning and computing them, on planning and reasoning with them. I would go so far as to propose that much if not most of our knowledge is represented in the form of such predictions, and that they are what philosophers refer to as "concepts". To properly argue these points would of course be a lengthy undertaking. For now let us just cover some high points, starting with some of the obvious advantages of conditional predictions for knowledge representation.

Foremost among these is just that predictions are grounded in the sense of having a clear, mechanically determinable meaning. The accuracy of any prediction can be determined just by running its policy from its state until an outcome occurs, then checking the prediction against the outcome. No human intervention is required to interpret the representation and establish the truth or falseness of any statement. The ability to compare predictions to actual events also make them suitable for being learned automatically. The semantics of predictions also make it clear how they are to be used in automatic planning methods such as are commonly used with MDPs and SMDPs. In fact, the conditional predictions we have discussed here are of exactly the form needed for use in the Bellman equations at the heart of these methods.

A less obvious but just as important advantage of outcome-conditional predictions is that they can compactly express much that would otherwise be difficult and expensive to represent. This happens very often in commonsense knowledge; here we give a simple example. The knowledge we want to represent is that you can go to the street corner and a bus will come to take you home within an hour. What this means of course is that if it is now 12:00 then the bus might come at 12:10 and it might come at 12:20, etc., but it will definitely come by 1:00. Using outcome conditioning, the idea is easy to express: we either make the outcome reaching 1:00 and predict that the bus will have come by then, or we make the outcome the arrival of the bus and predict that at that time it will be 1:00 or earlier.

A natural but naive alternative way to try to represent this knowledge would be as a probability of the bus arriving in each time slot. Perhaps it has one-sixth chance of arriving in each 10-minute interval. This approach is unsatisfactory not just because it forces us to say more than we may know, but because it does not capture the important fact that the bus will come eventually. Formally, the problem here is that the events of the bus coming at different times are not independent. If may have only a one-sixth chance of coming exactly at 1:00, but if it is already 12:55 then it is in fact certain to come at 1:00. The naive representation does not capture this fact that is actually absolutely important to using this knowledge. A more complicated representation could capture all these dependencies but would be just that -- more complicated. The outcome-conditional form represents the fact simply and represents just what is needed to reason with the knowledge this way. Of course, other circumstances may require the more detailed knowledge, and this is not precluded by the outcome-conditional form. This form just permits greater flexibility, in particular, the ability to omit these details while still being of an appropriate form for planning and learning.



💡 對我們的 AI 研究助手感興趣嗎?

使用 AI 技術革新您的研究流程、提升分析效率並發掘更深層次的洞見。

了解更多