PHP Sharp - Prototype
Updated 16 Aug 2004 -
Note:
i'm not working on the anymore - the code is on sourceforge, I have
used alot of the ideas and am looking more seriously at using it to
write a parrot compiler.
Updated 16 Dec 2002 (following
comments from php-dev & mono mailing lists)
As you may realise, I've got far too much free time at present, so the
challege of a PHP to CIL compiler looked like an interesting way to
fill
my unemployed time.
I've been pondering the posibility of implementing a PHP to .Net
compiler for some time, There has been some interesting articles
previously on implementing .Net compilers for Python, and some of the
issues that have surrounded it (primarly dealing with a non typed
language mapping to a typed bytecode system). But the actual documents
where very thin on the actual methodogy or code involved in actually
developing the .Net compiler. Probably due to organizations funding the
research may have been interested in comercializing the compiler.
Anyway having seen Mono grow from a crazy idea into a huge reality, I
decided it was time to have a better look to see if this would provide
any clues on how to build a PHP# compiler. (PHP# indicates that the
eventual language may not be that close to what you are used to today
as
PHP, but more a derived language - which will probably have to include
a
number of features advantageous for the .Net bytecodes)
Why do it?
I though I'd add a brief explaination of the advantages and why I'm
even considering this. To really understand the reasoning, you have to
get an idea about why C# and more precisely the .net framework was
created. From my reading, Microsoft where running into increasing
problems with their existing toolset, VB, C++, Java etc. in that
widgets and components developed in one language where taking
considerable time to make available to other languages. (eg. a nice C++
library access stock quotes, only worked partially on VB for quite a
while). In the Open source world, a prime example of this would be the
Excel Read,Write classes in perl, had to be manually ported to PHP for
PEAR. To solve this, a common bytecode format (like Java's bytecode)
format was really the only solution. In the way that Java was write
once run anywhere, .net started of as write bits in any language, run
on MS. however the Mono project is turning this into Write bits in any
language, run anywhere.
So the goal as has been mentioned below, is that you can have a
development team working on specific areas of an application, choosing
a language which is most suited to that area of the task, and combine
the results into one coherent result, without having to write huge
bridging layers. In effect something like evolution could be
built using multiple languages - PHP for the top level interface,
C# for the complex widgets and C for the low level graphics writing.. -
hence saving considerable development time on all aspects, and
producing a result that could be run on anything from a Office desktop,
PDA or Set top box...
Getting down to basics
PHP is a pretty classic language design, It takes your original source
code 'the php language', parses it into tokens, things like quotes,
brakets, plus, minus, equals, along with keyword tokens like 'for',
'while'. This section is the tokenizer, in most languages this is done
using flex, In Mono, it is appears to be done using a simple loop/check
character, which suprissingly for a hand coded tokenizer appears to be
pretty fast. Although probably not as fast as a C based flex tokenizer.
The Next Step in the process, is the Parser, which is all about reading
the syntax or grammer of the language, really the way the words are
arranged together to make up sentences, (rather than the existance of
single words which is what the tokenizer is about). PHP again uses the
classic Bison Parser, as the Zend engine is all in C, so this
combination is fast and suitable. In Mono, a Tool called Jay is used,
this is a C program that can generate Java or C# code, (obviously in
Mono it's C#).
Plan A - Do it in PHP?
After looking at the problem, my first sense was this could be done
quite easily in PHP. The first part of this meant building the two
'core' components of a compiler in PHP - the tokenizer and the parser.
I
wrote a crude tokeinzer that read lex files and produced a big array of
PHP preg_match's and substring matches to emulate the lex part, as I
wasnt able to find any realitively simple lexers to port. On the
grammer
side, I just took the existing Jay parser, (which is written in C), and
modified it to ouput PHP. This meant I could take the original Lex and
yacc code from the Zend engine, and just modify the function calls to
call PHP rather than C.
So At this point, I had a working grammer parser and tokenizer
(although it would have been possible to use PHP's existing
tokenizer here!).. this in it'self would be very useful for writing
phpdocu type tools.
However, as I delved deeper into the other side of the Compiler, life
became considerably more complex. Not withstanding the huge size of the
codebase to convert, As I went down the road of converting the syntax,
two problems became very apparent.
- PHP's use of aliasing for copying objects ($xObject = &
$yObject;) was very troublesome, in terms of building tree's easily
with
PHP. great care had to be taken that this copying was used where
absolutely neccessary, and that real copies where done at other times.
(This should be fixed in PHP5, and is probably something worth avoiding
in PHP#)
- Although the differences in the end user grammer is very small,
the backend grammer (Jay code), was quite different, PHP is targeted as
a compile to bytecode, then run bytecodes, with no intermediate
checking
of the bytecodes. Where as the mono compiler built a large data tree,
then went through this tree doing type checking and variable
resolution.
The result is that the grammer is partly based around building this
tree.
Plan B - Do it in C#?
Well, The main reasons I was reluctant to write the compiler in
C# originally where a) the self compiling goal, b) not that impressed
with C#'s syntax (although the strong typing was a bit of an
improvement.). So eating a bit of pride I started looking at what would
be involved in implementing it in C#. The biggest barrier (after
learning C#), was that the tokenizer was hard coded, this made
understanding the tokens rather complex. This is one of the key
benifits
of Lex based tokenizers, in that they make it very clear, using regular
experssions what token is produced by what combination of letters or
symbols.
I ended up using this lexer
http://www.cybercom.net/~zbrad/DotNet/Lex/
by Brad Merrill, The later of the two versions compiled with mono, but
had to be modified slightly to cope with integer return values from
clex() calls, and I changed it to use a switch/case combination, rather
that an array of objects in a an attempt to speed it up (which didnt
have much effect).
What are the differences - and how to translate them?
- statically typed vs dynamically
typed is one of the first obstacles that I saw when reading the
other reports of creating compilers for python. The approach I see most
likely to happen is to use a mix of 'guessed' types, a generic (PHP?)
object and allowing the use of strict typing.
Unknown PHP variable
|
mono.PHP.Variable
|
PHP String, integer/Real etc.
|
(standard .net strings etc.)
|
PHP Array
|
mono.PHP.Variable.Array
|
PHP Reference
|
mono.PHP.Variable.Reference
|
PHP Object
|
mono.PHP.Variable.Object
|
Thanks to Miguel de Icaza:
- Microsoft's Javascript compiler is also dynamically typed,
(athough source code is not available, examining the bytecodes may
provide some clues on how it was done there)
- Microsoft VB.NET compiler also does tricks like that, for
example if the compiler can infer the types of variables, it will use
the most efficient mechanism possible.
For example, `int + int' can be
encoded directly using CILinstructions:
load-integer a
load-integer b
add
but when the compiler can not
decide what to do, it has to call a support routine in the runtime:
load-thing a
load-thing b
call Add_two_unknown_objects
Or even:
load-integer a
load-thing b
call Add_Integer_And_Unknown
The Microsoft JSC compiler is
interesting, because its
implemented in C#, and is available as a runtime class
this allows `eval' to actually compile code freshly
the preference would be to
try and avoid unknown types
generally (and try to do type guessing at compile time and add casting operators)
- The Microsoft JSC compiler is interesting, because its
implemented in C#, and is available as a runtime class this allows
`eval'
to actually compile code freshly.
- get, set syntax for variables,
implements a concept very similar to PHP's __getXXX(), __setXXX(),
however the syntax is very closely tied to the variable decleration.
Since PHP's overloading with __get syntax, is not really mainstream,
inheriting the C# method is probably the way to go for the time being.
- not having variable identifiers
eg.
$variables makes using method and variables with the same name both
feasible and common (personally it also makes reading the code
clearer).. - I've yet to explore exactly how and if this is handled in
C#.
- static method clalls and object
method calls are not clearly indicated in C# -
Tokenizer.getToken() or lexer.getToken(), one is a static method call,
the other is a object method call - in PHP Tokenizer::getToken(), or
$object->getToken(); makes this clear..
- PHP uses a base functional
library
for what would be the corelib in C#, this is one of the more
interesting
parts of the planned application - either this would have to be mapped
at the parser/grammer time into corelib methods, or a seperate PHP
library could do that at runtime. - I think a mix of these two is
probably what will be done (eg. mysql stuff is a mono.PHP.mysql)
- There was some discussion about how to implement the huge
base of existing PHP extensions, as far as I can see, two key
methodolgies exist here:
- write function emulation in C#, so that mysql_xx calls are
really implemented as C# calls but to the programmer look like classic
PHP functions. This looks the easiest, as it shouldnt involve to much
clever typecasting.
- write C# to C calls to existing dl libraries.. obviously
these may be useful for performance intensive operations..
- since the first test application will be hello world, echo is going to be one of the
first PHP functions to deal with. - as mentioned in 5), I will
probably do this with a System.Console.Write() mapped at compiler time
and the work on string concatination operators.
- string concatination and + & .
operations, C# has operator overloading + has a different effect when
adding strings than it does adding numbers, this is impossble to do
when
working in PHP (and is counter intuative for the language). This will
involve quite a few changes to language grammer, and a few tricks to
ensure casting is done correctly.
- variable and method scope,
One of PHP's distinct characteristics has been very tight variable
scoping, variables and methods outside the existing method, (including
class variables) have to be explicitly selected , whereas C# allows
access to any methods or variable, without being explicit to as to them
being local (inside the method), or part of the class, or it's parents.
- global variables, not
possible in general C#, are very common in PHP, the presumption here is
to use mono.PHP.globals class, to store all globals in a hashtable.
- arrays in C# are handled
in a number of ways, depending on their content, - the most common
appears to be the generic hashtable, which needs to be used whenever
PHP
syntax $var[], as append and $var = array() or array(a,b).. are used.
PHP also uses 'a' => 'b' for associated hashtables.
Resources:
Progress....
At present, the C# tokenizer is now working, so the changes to make
mphp happen are underway.
Targets
a)
|
Hello World class
- Add Basic Tokens to lexer
- Add simple grammer parsing to Jay
- implement echo
- $variable support
- array support
|
b)
|
PEAR class (as mono.PHP.PEAR)
- $GLOBALS
- array creation syntax.
- array add syntax
-
|
After that Ideas and targets are open :)