There are two basic ways perception might work to let us experience the world in behaviourally relevant ways. Direct perception is the idea that perception only requires two components; the environment and the organism. Indirect perception is the idea that perception requires at least three components; the environment, the organism, and at least one other component that mediates between the organism and the environment. Over the last few posts, I've been working through the specifics of the ecological approach to making direct perception plausible, because this is a question I often get (usually in the form of 'I don't see how this could work in this case'). Regardless of whether or not it's correct, we can show that we have all the pieces needed to make direct perception work in principle, and the empirical programme is about seeing if it works in practice. What about indirect perception?
I asked this question on Twitter, and one interesting thing I noticed was just how little sense the question seems to make to people these days. Responses fell into roughly two categories: 'I don't see how we can do with it in this case', and 'brains do stuff, so...', neither of which answer the question. Even if some form of indirect perception is required in those cases (which is, of course, still up for grabs) we're still owed an account of how this might work, at least in principle and then later in practice.
People used to know this. The most recent indirect perception hypothesis is that the key mediator is a mental representation, understood as a computational, information processing system implementing some form of inference that combines sensory data and information stored in memory to create a model of the world that represents the system's best guess about what is out there and how to behave successfully with respect to what's out there. This hypothesis didn't come out of nowhere; the development of the computer and the theory of information that allows them to work turned out to provide the pieces required to create a formal account of representations that stood a chance of living up to the challenge of explaining perception. Cognitive scientists therefore leaned heavily into the details of these pieces as they worked very hard, from the late 1950s on, to make indirect perception implemented this way plausible.
The exact details of the process have, of course, changed and evolved with empirical data and developments in computational theory. For example, while all the accounts have to do inference that combine sources of information into a best guess, there are a variety of ways of doing inference, some better than others. Probably the best way to do inference is via Bayesian methods, and so most modern theories propose that indirect perception combines sources of information this way so as to be optimal.
Before these inferential methods can even be brought into play, however, there remain two related and big unanswered challenges that need to be addressed. The first one is the grounding problem; how do representations get the content they need so as to combine sources of information in a way that works? It's all very well describing the inferential process of the fully formed system, but how do you build one in the first place? The second is the 'which representation?' problem; of all the different sources of information the system has to combine, how does it know which information to bring together for a given task? These reflect a circular problem indirect theories create for themselves. If perception is not good enough to be direct, and thus requires representational support, where do those representations come from? In order for a theory of indirect perception to be plausible, these must be addressed (analogous to how in order for a theory of direct perception to be plausible, questions like 'can the physical world present itself in behaviourally relevant ways?' had to be addressed).
I am not going to address these challenges to indirect theories, because it isn't my job. But they are legitimate questions that people have mostly stopped asking. Debates about the form and content of representations were prominent and explicit right up until the end of the 1990s, and then it all just seemed to stop. Interface theory, for all it's problems, at least got back into the fight and tackled the grounding problem (unsuccessfully, I've argued, but it was a solid swing and at least Hoffman recognised he owed us an account). Mark Bickhard's work is probably the only currently active research programme explicitly working out the details, but I don't know many scientists who even know who he is, and a lot of his work is about mapping out the rules of living up to the challenge, versus actually solving the problem.
Until these foundational issues are addressed and answered, whether indirect perception is plausible remains unclear, and no matter how sophisticated your inferential machinery is (looking at you, free energy principle) it can't help until you explain how it came to be organised that way in the first place. Even if the ecological theory of direct perception doesn't hold up, representational theories of indirect perception are not viable options if they cannot be shown to be plausible.