FANDOM


Here is the botcode

#!/usr/bin/perl

=head1 Notes
whitehouse.gov fetcher.
 
Version 1.01
Author Brian Gannon (bgannon please spam me hard at gmail.com)

Notes: Usually I don't stick everything in one file.... This is just easier for this deal.
	You're going to need a lot of perl modules.. Just run it and you'll find them all
	and pdftohtml (.40+).  (optional).

	If you start this program.. .and then we fail somewhere, set
	READ_QUEUE_FROM_FILES  to 1.  that will start where it left off.

	If you change anything that changes the output.. it's going to update EVERY page on the wikia.
	MD5 is just for the content.. not for the MD5/original URL box at the bottom.

	I usually write cleaner code.. but this was a rush job.. Feel free to update this script..
	Multiple people can run this at the same time without problems.. just make sure you're using
	the same version of the bot... because the output is different on each bot... you know the drill

	Thanks to Silona who works her ass off.. and thanks to my 6 month old that went to bed
	at 7:30pm everynight so that I had 2 hrs of freetime a night.
	
	Under the Creative Commons license.
	You are free to Share/Make derivate works... but.. give props to me, it's gotta be
	non commerical, and you need to release any other work under this same license.
	(although i'd remove the password for your wikia account if before you do)
	The full license below.

	P.S. Obama, please don't stick me on the terrorist watch list for writing this.

 Bugs: Some UNICODE to 8 bit output bugs.... 
	sometimes pdftohtml locks up
	Sometimes wikia decides to hang up on me.
	Too many....



=head1 COPYRIGHT

License

THE WORK (AS DEFINED BELOW) IS PROVIDED UNDER THE TERMS OF THIS CREATIVE COMMONS PUBLIC LICENSE ("CCPL" OR "LICENSE"). THE WORK IS PROTECTED BY COPYRIGHT AND/OR OTHER APPLICABLE LAW. ANY USE OF THE WORK OTHER THAN AS AUTHORIZED UNDER THIS LICENSE OR COPYRIGHT LAW IS PROHIBITED.

BY EXERCISING ANY RIGHTS TO THE WORK PROVIDED HERE, YOU ACCEPT AND AGREE TO BE BOUND BY THE TERMS OF THIS LICENSE. TO THE EXTENT THIS LICENSE MAY BE CONSIDERED TO BE A CONTRACT, THE LICENSOR GRANTS YOU THE RIGHTS CONTAINED HERE IN CONSIDERATION OF YOUR ACCEPTANCE OF SUCH TERMS AND CONDITIONS.

1. Definitions

"Adaptation" means a work based upon the Work, or upon the Work and other pre-existing works, such as a translation, adaptation, derivative work, arrangement of music or other alterations of a literary or artistic work, or phonogram or performance and includes cinematographic adaptations or any other form in which the Work may be recast, transformed, or adapted including in any form recognizably derived from the original, except that a work that constitutes a Collection will not be considered an Adaptation for the purpose of this License. For the avoidance of doubt, where the Work is a musical work, performance or phonogram, the synchronization of the Work in timed-relation with a moving image ("synching") will be considered an Adaptation for the purpose of this License.
"Collection" means a collection of literary or artistic works, such as encyclopedias and anthologies, or performances, phonograms or broadcasts, or other works or subject matter other than works listed in Section 1(f) below, which, by reason of the selection and arrangement of their contents, constitute intellectual creations, in which the Work is included in its entirety in unmodified form along with one or more other contributions, each constituting separate and independent works in themselves, which together are assembled into a collective whole. A work that constitutes a Collection will not be considered an Adaptation (as defined above) for the purposes of this License.
"Distribute" means to make available to the public the original and copies of the Work or Adaptation, as appropriate, through sale or other transfer of ownership.
"Licensor" means the individual, individuals, entity or entities that offer(s) the Work under the terms of this License.
"Original Author" means, in the case of a literary or artistic work, the individual, individuals, entity or entities who created the Work or if no individual or entity can be identified, the publisher; and in addition (i) in the case of a performance the actors, singers, musicians, dancers, and other persons who act, sing, deliver, declaim, play in, interpret or otherwise perform literary or artistic works or expressions of folklore; (ii) in the case of a phonogram the producer being the person or legal entity who first fixes the sounds of a performance or other sounds; and, (iii) in the case of broadcasts, the organization that transmits the broadcast.
"Work" means the literary and/or artistic work offered under the terms of this License including without limitation any production in the literary, scientific and artistic domain, whatever may be the mode or form of its expression including digital form, such as a book, pamphlet and other writing; a lecture, address, sermon or other work of the same nature; a dramatic or dramatico-musical work; a choreographic work or entertainment in dumb show; a musical composition with or without words; a cinematographic work to which are assimilated works expressed by a process analogous to cinematography; a work of drawing, painting, architecture, sculpture, engraving or lithography; a photographic work to which are assimilated works expressed by a process analogous to photography; a work of applied art; an illustration, map, plan, sketch or three-dimensional work relative to geography, topography, architecture or science; a performance; a broadcast; a phonogram; a compilation of data to the extent it is protected as a copyrightable work; or a work performed by a variety or circus performer to the extent it is not otherwise considered a literary or artistic work.
"You" means an individual or entity exercising rights under this License who has not previously violated the terms of this License with respect to the Work, or who has received express permission from the Licensor to exercise rights under this License despite a previous violation.
"Publicly Perform" means to perform public recitations of the Work and to communicate to the public those public recitations, by any means or process, including by wire or wireless means or public digital performances; to make available to the public Works in such a way that members of the public may access these Works from a place and at a place individually chosen by them; to perform the Work to the public by any means or process and the communication to the public of the performances of the Work, including by public digital performance; to broadcast and rebroadcast the Work by any means including signs, sounds or images.
"Reproduce" means to make copies of the Work by any means including without limitation by sound or visual recordings and the right of fixation and reproducing fixations of the Work, including storage of a protected performance or phonogram in digital form or other electronic medium.
2. Fair Dealing Rights. Nothing in this License is intended to reduce, limit, or restrict any uses free from copyright or rights arising from limitations or exceptions that are provided for in connection with the copyright protection under copyright law or other applicable laws.

3. License Grant. Subject to the terms and conditions of this License, Licensor hereby grants You a worldwide, royalty-free, non-exclusive, perpetual (for the duration of the applicable copyright) license to exercise the rights in the Work as stated below:

to Reproduce the Work, to incorporate the Work into one or more Collections, and to Reproduce the Work as incorporated in the Collections;
to create and Reproduce Adaptations provided that any such Adaptation, including any translation in any medium, takes reasonable steps to clearly label, demarcate or otherwise identify that changes were made to the original Work. For example, a translation could be marked "The original work was translated from English to Spanish," or a modification could indicate "The original work has been modified.";
to Distribute and Publicly Perform the Work including as incorporated in Collections; and,
to Distribute and Publicly Perform Adaptations.
The above rights may be exercised in all media and formats whether now known or hereafter devised. The above rights include the right to make such modifications as are technically necessary to exercise the rights in other media and formats. Subject to Section 8(f), all rights not expressly granted by Licensor are hereby reserved, including but not limited to the rights set forth in Section 4(d).

4. Restrictions. The license granted in Section 3 above is expressly made subject to and limited by the following restrictions:

You may Distribute or Publicly Perform the Work only under the terms of this License. You must include a copy of, or the Uniform Resource Identifier (URI) for, this License with every copy of the Work You Distribute or Publicly Perform. You may not offer or impose any terms on the Work that restrict the terms of this License or the ability of the recipient of the Work to exercise the rights granted to that recipient under the terms of the License. You may not sublicense the Work. You must keep intact all notices that refer to this License and to the disclaimer of warranties with every copy of the Work You Distribute or Publicly Perform. When You Distribute or Publicly Perform the Work, You may not impose any effective technological measures on the Work that restrict the ability of a recipient of the Work from You to exercise the rights granted to that recipient under the terms of the License. This Section 4(a) applies to the Work as incorporated in a Collection, but this does not require the Collection apart from the Work itself to be made subject to the terms of this License. If You create a Collection, upon notice from any Licensor You must, to the extent practicable, remove from the Collection any credit as required by Section 4(c), as requested. If You create an Adaptation, upon notice from any Licensor You must, to the extent practicable, remove from the Adaptation any credit as required by Section 4(c), as requested.
You may not exercise any of the rights granted to You in Section 3 above in any manner that is primarily intended for or directed toward commercial advantage or private monetary compensation. The exchange of the Work for other copyrighted works by means of digital file-sharing or otherwise shall not be considered to be intended for or directed toward commercial advantage or private monetary compensation, provided there is no payment of any monetary compensation in connection with the exchange of copyrighted works.
If You Distribute, or Publicly Perform the Work or any Adaptations or Collections, You must, unless a request has been made pursuant to Section 4(a), keep intact all copyright notices for the Work and provide, reasonable to the medium or means You are utilizing: (i) the name of the Original Author (or pseudonym, if applicable) if supplied, and/or if the Original Author and/or Licensor designate another party or parties (e.g., a sponsor institute, publishing entity, journal) for attribution ("Attribution Parties") in Licensor's copyright notice, terms of service or by other reasonable means, the name of such party or parties; (ii) the title of the Work if supplied; (iii) to the extent reasonably practicable, the URI, if any, that Licensor specifies to be associated with the Work, unless such URI does not refer to the copyright notice or licensing information for the Work; and, (iv) consistent with Section 3(b), in the case of an Adaptation, a credit identifying the use of the Work in the Adaptation (e.g., "French translation of the Work by Original Author," or "Screenplay based on original Work by Original Author"). The credit required by this Section 4(c) may be implemented in any reasonable manner; provided, however, that in the case of a Adaptation or Collection, at a minimum such credit will appear, if a credit for all contributing authors of the Adaptation or Collection appears, then as part of these credits and in a manner at least as prominent as the credits for the other contributing authors. For the avoidance of doubt, You may only use the credit required by this Section for the purpose of attribution in the manner set out above and, by exercising Your rights under this License, You may not implicitly or explicitly assert or imply any connection with, sponsorship or endorsement by the Original Author, Licensor and/or Attribution Parties, as appropriate, of You or Your use of the Work, without the separate, express prior written permission of the Original Author, Licensor and/or Attribution Parties.
For the avoidance of doubt:

Non-waivable Compulsory License Schemes. In those jurisdictions in which the right to collect royalties through any statutory or compulsory licensing scheme cannot be waived, the Licensor reserves the exclusive right to collect such royalties for any exercise by You of the rights granted under this License;
Waivable Compulsory License Schemes. In those jurisdictions in which the right to collect royalties through any statutory or compulsory licensing scheme can be waived, the Licensor reserves the exclusive right to collect such royalties for any exercise by You of the rights granted under this License if Your exercise of such rights is for a purpose or use which is otherwise than noncommercial as permitted under Section 4(b) and otherwise waives the right to collect royalties through any statutory or compulsory licensing scheme; and,
Voluntary License Schemes. The Licensor reserves the right to collect royalties, whether individually or, in the event that the Licensor is a member of a collecting society that administers voluntary licensing schemes, via that society, from any exercise by You of the rights granted under this License that is for a purpose or use which is otherwise than noncommercial as permitted under Section 4(c).
Except as otherwise agreed in writing by the Licensor or as may be otherwise permitted by applicable law, if You Reproduce, Distribute or Publicly Perform the Work either by itself or as part of any Adaptations or Collections, You must not distort, mutilate, modify or take other derogatory action in relation to the Work which would be prejudicial to the Original Author's honor or reputation. Licensor agrees that in those jurisdictions (e.g. Japan), in which any exercise of the right granted in Section 3(b) of this License (the right to make Adaptations) would be deemed to be a distortion, mutilation, modification or other derogatory action prejudicial to the Original Author's honor and reputation, the Licensor will waive or not assert, as appropriate, this Section, to the fullest extent permitted by the applicable national law, to enable You to reasonably exercise Your right under Section 3(b) of this License (right to make Adaptations) but not otherwise.
5. Representations, Warranties and Disclaimer

UNLESS OTHERWISE MUTUALLY AGREED TO BY THE PARTIES IN WRITING, LICENSOR OFFERS THE WORK AS-IS AND MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND CONCERNING THE WORK, EXPRESS, IMPLIED, STATUTORY OR OTHERWISE, INCLUDING, WITHOUT LIMITATION, WARRANTIES OF TITLE, MERCHANTIBILITY, FITNESS FOR A PARTICULAR PURPOSE, NONINFRINGEMENT, OR THE ABSENCE OF LATENT OR OTHER DEFECTS, ACCURACY, OR THE PRESENCE OF ABSENCE OF ERRORS, WHETHER OR NOT DISCOVERABLE. SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OF IMPLIED WARRANTIES, SO SUCH EXCLUSION MAY NOT APPLY TO YOU.

6. Limitation on Liability. EXCEPT TO THE EXTENT REQUIRED BY APPLICABLE LAW, IN NO EVENT WILL LICENSOR BE LIABLE TO YOU ON ANY LEGAL THEORY FOR ANY SPECIAL, INCIDENTAL, CONSEQUENTIAL, PUNITIVE OR EXEMPLARY DAMAGES ARISING OUT OF THIS LICENSE OR THE USE OF THE WORK, EVEN IF LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

7. Termination

This License and the rights granted hereunder will terminate automatically upon any breach by You of the terms of this License. Individuals or entities who have received Adaptations or Collections from You under this License, however, will not have their licenses terminated provided such individuals or entities remain in full compliance with those licenses. Sections 1, 2, 5, 6, 7, and 8 will survive any termination of this License.
Subject to the above terms and conditions, the license granted here is perpetual (for the duration of the applicable copyright in the Work). Notwithstanding the above, Licensor reserves the right to release the Work under different license terms or to stop distributing the Work at any time; provided, however that any such election will not serve to withdraw this License (or any other license that has been, or is required to be, granted under the terms of this License), and this License will continue in full force and effect unless terminated as stated above.
8. Miscellaneous

Each time You Distribute or Publicly Perform the Work or a Collection, the Licensor offers to the recipient a license to the Work on the same terms and conditions as the license granted to You under this License.
Each time You Distribute or Publicly Perform an Adaptation, Licensor offers to the recipient a license to the original Work on the same terms and conditions as the license granted to You under this License.
If any provision of this License is invalid or unenforceable under applicable law, it shall not affect the validity or enforceability of the remainder of the terms of this License, and without further action by the parties to this agreement, such provision shall be reformed to the minimum extent necessary to make such provision valid and enforceable.
No term or provision of this License shall be deemed waived and no breach consented to unless such waiver or consent shall be in writing and signed by the party to be charged with such waiver or consent.
This License constitutes the entire agreement between the parties with respect to the Work licensed here. There are no understandings, agreements or representations with respect to the Work not specified here. Licensor shall not be bound by any additional provisions that may appear in any communication from You. This License may not be modified without the mutual written agreement of the Licensor and You.
The rights granted under, and the subject matter referenced, in this License were drafted utilizing the terminology of the Berne Convention for the Protection of Literary and Artistic Works (as amended on September 28, 1979), the Rome Convention of 1961, the WIPO Copyright Treaty of 1996, the WIPO Performances and Phonograms Treaty of 1996 and the Universal Copyright Convention (as revised on July 24, 1971). These rights and subject matter take effect in the relevant jurisdiction in which the License terms are sought to be enforced according to the corresponding provisions of the implementation of those treaty provisions in the applicable national law. If the standard suite of rights granted under applicable copyright law includes additional rights not granted under this License, such additional rights are deemed to be included in the License; this License is not intended to restrict the license of any rights under applicable law.

=cut


use strict;
use LWP::UserAgent;
use HTML::SimpleLinkExtor;
use Data::Dumper;
use Encode qw(encode_utf8);
use LWP::Simple();
use HTML::Entities;
use Digest::MD5 qw(md5_hex md5);
use WWW::Mechanize;
use HTML::WikiConverter;
use utf8;
use Text::Unidecode;


# variables that can be changed easily
use constant USERNAME => '';
use constant PASSWORD => '';
use constant USE_PROXY => 0; # use the non-open source proxy (or write your own)
use constant DELAY => 5; # how many seconds do i delay
use constant MAX_REDIRECT => 0; # how many redirects do i follow
use constant READ_QUEUE_FROM_FILES => 1; # start where left off?
use constant USER_AGENT => 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0';
use constant QUEUEFILE => '/tmp/w_queue.txt';
use constant QUEUETODOFILE => '/tmp/w_queue_f.txt';
use constant PDFCONVERT => 0; # you need at LEAST version .40 (pdf 1.5 format is fun)
use constant PDFTOHTML => '/usr/local/bin/pdftohtml';
use constant GATHER_LINKS => 1; # mostly a debug thing for me... SHOULD ALWAYS BE 1

  

#########
# Unless you're a perl coder... i wouldn't do anything below this... or you could
# 


$| = 1; # if tailing to a log... let's not buffer it... gaa.

print "Logging into Wikia!\n";
# let's login to WIKI #.. not checking to see if the login/password works above.

my $wc = new HTML::WikiConverter( dialect => 'MediaWiki' );
my $mech = WWW::Mechanize->new();
$mech->get('http://whitehouse.wikia.com/index.php?title=Special:UserLogin');

$mech->form_number(2);
$mech->field( 'wpName'     => USERNAME );
$mech->field( 'wpPassword' => PASSWORD );
$mech->submit('wpLoginattempt');



my $proxy;
if (USE_PROXY == 1) {
	eval {
		require Yxes::Proxy;
		1;
	};
	# to be sneaky.
	$proxy = Yxes::Proxy->new( proxy_file => '/usr/share/proxies/current.txt' );
}

# to be not sneaky
my $ua = new LWP::UserAgent;    #user agent
$ua->max_redirect(MAX_REDIRECT); # let's not redirect... it's 
$ua->agent(USER_AGENT);

my @visit_array;
my $to_be_visited;
my $visited;

if (READ_QUEUE_FROM_FILES == 0) {
	@visit_array = ('/');        # populate initial URL
} else {
	print "Populating initial URL's from file\n";
	open(QF, QUEUEFILE);
	while(my $visit = <QF>) {
		chomp $visit;
		push(@visit_array, $visit);
		$to_be_visited->{$visit} = 1;
	}
	close(QF);
	# don't revisit what we have already visited.
	open(QF, QUEUETODOFILE) || die $!;
	while(my $visit = <QF>) {
		chomp $visit;
		$visited->{$visit} = 1;
	}
	close(QF);
		
}

open(QF, "> " . QUEUETODOFILE) || die $!;
close(QF);

while (1) {
    my $how_many_left = @visit_array;    # how many are left..
    last if ( $how_many_left == 0 );     # done

    my $visit = pop @visit_array;        # get a url
    print "Queue ($how_many_left): Grabbing $visit\n";
    grab_links( $ua, $visit, \@visit_array );    # grab it

    # i want to see what's going on .. plus i can restart from here
    open(QUEUE, "> " . QUEUEFILE);
    foreach my $visit_me (@visit_array) {
	print QUEUE $visit_me . "\n";
    }
    close(QUEUE);

}

sub grab_links {
    my ( $ua, $url, $visit_array ) = @_;

    my @links;
    print " --- Fetching http://whitehouse.gov" . $url . "\n";

    my $content;
    if (USE_PROXY == 1) {
	# XXX DOES NOT HANDLE PDF's YET IF USING THE PROXY
	$content = $proxy->fetch( 'http://whitehouse.gov/' . $url );
    } else {
	sleep DELAY; # grab a page every 5 seconds;
	if ((PDFCONVERT == 1) && ($url =~ m/pdf$/)) {
		$ua->max_redirect(2);
	}
	my $req = new HTTP::Request GET => 'http://whitehouse.gov' . $url;
	my $response = $ua->request($req);    # get a url

	if ( $response->is_success ) {
		$content = $response->decoded_content;    # decode gzip'd cocntent
		if ((PDFCONVERT == 1) && ($url =~ m/pdf$/))  {
			print " --- --- It's a PDF!\n";

			# write the pdf out
			open(PDF, "> /tmp/$$.pdf");
			binmode(PDF);
			print PDF $content;
			close(PDF);

			processpdf("/tmp/$$.pdf",$url);
			$ua->max_redirect(MAX_REDIRECT); # set that back
			next;
		}
	} else {
		print "ERROR " . $response ->status_line;    # no idea what happened.. deal with this later.
		next;
	}
    }
    print " --- Done Fetching\n";

####

    $visited->{$url} = 1;    # it's visited... yay.

    
    print " --- Parsing Page\n";
    parse( \$content, $url );                    # parse it

    print " --- Finding Hrefs\n";
    my $extor = HTML::SimpleLinkExtor->new();    # extor extracts things
    $extor->parse($content);                     # get all the a href
    @links = $extor->a;                          # get a ref links.

    print " --- Going through the links\n";
    foreach my $link (@links) {                  # for each link
        $link =~ s|^http://whitehouse.gov||;            # strip off the beginning url
	$link =~ s/\#.*$//; 			# dang anchors
	next if ( GATHER_LINKS == 0); 		# this it turned off
	next if ( $link eq "" );		# blank?
	next if ( $link =~ m/^javascript/ );
	next if ( $link =~ m/slideshow/);
	next if ( $link =~ m/mailto:/ );
	next if ( $link =~ m/\.mp4/ );
	next if ( $link =~ m/\.mov/ );
	next if ( $link =~ m/\.jpg/ );
	next if ( $link =~ m/\.png/ );
	next if ( $link =~ m/\.gif/ );
	next if (( $link =~ m/\.pdf/ ) && (PDFCONVERT == 0));
        next if ( $link =~ m/^http/ );           # offsite
        next if ( exists $visited->{$link} );          # we already visited
        next if ( exists $to_be_visited->{$link} );    # it's in the queue
        $to_be_visited->{$link} = 1;                   # in the queue
        push( @$visit_array, $link );                  # put in the queue
    }
    open(QF, ">> " . QUEUETODOFILE) || die $!;
    print QF $url . "\n";
    close(QF);
    print " --- DONE ---\n";

}

sub parse {
    my ( $page, $url ) = @_;
    if ( my ($real_content) = ($$page =~ /\<\!\-\- start breadcrumb \d+\-\-\>(.*)\<div class=\"grdspan4 rightrail-blue\"\>/s) )
    {
        print " --- --- This has content I can read\n";
        $real_content = decode_entities($real_content);

        # change <a href="/cheese/burger.html">Cheese Burger</a> to [[cheese_burger.html]]Cheese Burger



        $real_content = $wc->html2wiki(html => encode_utf8(unidecode($real_content)));
        $real_content =~ s/style="background: .*?"//;

#	$real_content = unidecode($real_content);




        my $fix = sub { my ($link) = @_;
                        $link =~ s|^http://whitehouse.gov/||;
                        if ($link =~ m/^http:/) {
                                return '[' . $link . ']';
                        }
                        my ($first,@rest) = split(/\s+/,$link);
                        $first =~ s/^\///;
                        $first =~ s/\/$//;
                        $first =~ s/\//_/g;    # replace / with _
                        $first =~ s/\#.*$//;   # replace # sign anchors
                        my $outside;
                        foreach my $word (@rest) {
                                $outside .= $word . " ";
                        }
                        return '[[' . $first . ']] ' . $outside;
                        };
        $real_content =~ s/\[(.*?)\]/$fix->($1)/ges;

	print " --- --- Wikizing\n";

	
        my $md5         = md5_hex( $real_content );
        my $wikized_url = $url;
        $wikized_url =~ s/^\///;     # remote beginnign slash
	$wikized_url =~ s/\/$//;     # ending /
        $wikized_url =~ s/\//_/g;    # replace / with _
	$wikized_url =~ s/\#.*$//;   # replace # sign anchors

	print " --- --- Fetching current wiki page\n";
        my $status = wikia_fetch( $wikized_url, $md5 );    # 0 = DNE, 1 = CHANGED, 2 = SAME

        if ( $status == 0 ) {
            new_page( \$real_content, $wikized_url, $md5, $url );
        }
        elsif ( $status == 1 ) {
            change_page( \$real_content, $wikized_url, $md5, $url );
        } else {
	    print " --- Page did not change\n";
        }
    } else {
        print " !!! No Content... need to write a decent parser for this odd page\n";
    }
}

sub wikia_fetch {
    my ( $wikized_url, $md5 ) = @_;

    my $req = new HTTP::Request GET => 'http://whitehouse.wikia.com/wiki/' . $wikized_url;
    my $response = $ua->request($req);    # get a url

    if ( $response->is_success ) {
        my $content = $response->decoded_content;    # decode gzip'd cocntent
        if ( $content =~ m/MD5: (\w+)/ ) {
            my $old_md5 = $1;
            if ( $old_md5 ne $md5 ) {
		print " --- THE PAGE CHANGED!\n";
                return 1;                            # changed page
            } else {
		print " --- SAME PAGE!\n";
                return 2;                            # same page
            }
        } else {
	    print " --- NEW PAGE!\n";
            return 0;                                # new page
        }
    } else {
        print "ERROR " . $response ->status_line;    # no idea what happened.. deal with this later.
    }
}

sub new_page {
    my ( $content, $wikized_url, $md5, $original_url ) = @_;


	
    print " --- Posting new page to $wikized_url\n";
    $mech->get( 'http://whitehouse.wikia.com/index.php?title=' . $wikized_url . ' &action=edit' );
    $mech->form_number(1);
    $mech->field( 'wpTextbox1',
            encode_utf8($$content) . "\n\n" . ' MD5: ' . $md5 . "\n"
          . ' Original URL: [http://whitehouse.gov' . $original_url . ' http://whitehouse.gov' . $original_url . ']' );

    $mech->field( 'wpSummary', 'Initial Upload' );
    $mech->submit('wpSave');

    print " --- LOCKING the page\n";
    # protect the cheerleader^H^H^H^H^H^H^H^H^H^H^H^Hpage
    $mech->get( 'http://whitehouse.wikia.com/index.php?title=' . $wikized_url . '&action=protect' );
    $mech->form_number(1);
    $mech->field( 'mwProtect-level-edit', 'sysop' );
    $mech->field( 'mwProtect-level-move', 'sysop' );
    $mech->submit();
}

sub change_page {
    my ( $content, $wikized_url, $md5, $original_url ) = @_;

	# don't ask me why.. .mech breaks on wikia.. so i need to login each time.. *shrugs*


    print " --- Updating new page to $wikized_url\n";
    $mech->get( 'http://whitehouse.wikia.com/index.php?title=' . $wikized_url . ' &action=edit' );
    $mech->form_number(1);
    $mech->field( 'wpTextbox1',
            $$content . "\n\n" . ' MD5: ' . $md5 . "\n"
          . ' Original URL: [http://whitehouse.gov' . $original_url . ' http://whitehouse.gov' . $original_url . ']' );

    $mech->field( 'wpSummary', 'Page Changed' );
    $mech->submit('wpSave');

}



sub processpdf {
	my ($tempfile,$url) = @_;

	my $original_url = $url;

	my $htmlfile = $tempfile;
	$htmlfile =~ s/\.pdf$/\.html/;
	
	# quick fix of url... i use this soo many times i probably should just write a routine.
	$url =~ s/\%20/_/g;
	$url =~ s/\//_/g;
	
	print " --- --- Running pdftohtml\n";
	system(PDFTOHTML . ' -q -noframes ' . $tempfile);
	
	if (-e $htmlfile) {
		print " --- --- --- Awesome... it created an HTML file\n";

		open(FILE, $htmlfile) || die $!;
		my $real_content;
		while(<FILE>) {
			$real_content .= $_;
		}
		close(FILE);

		$real_content = $wc->html2wiki(html => encode_utf8($real_content));
		my $md5 = md5_hex( encode_utf8($real_content) );
			
		my $status = wikia_fetch($url,$md5);
	
		if ($status == 0) {
			new_page( \$real_content, $url, $md5, $original_url );
		} elsif ( $status == 1 ) {
			change_page( \$real_content, $url, $md5, $original_url );
		} else {
			print " --- Page did not change\n";
		}

		# cleanup
		unlink($htmlfile);
		unlink($tempfile);

	} else {
		print " !!! !!! hmm... htmltopdf did not generate anything\n";
	}
}


Ad blocker interference detected!


Wikia is a free-to-use site that makes money from advertising. We have a modified experience for viewers using ad blockers

Wikia is not accessible if you’ve made further modifications. Remove the custom ad blocker rule(s) and the page will load as expected.