CyberD.org
C:\ Home » Blog » Life » Week 13 - Intense & Nomentionable

Week 13 - Intense & Nomentionable

Blog life has been pretty calm, just this plug (check it out!) since last week. Regular life? I ran through five letters on my expanding NG AP collection. At least a thousand artists per day. Tens of thousands of tracks. Long hours. Drive space is slowly corroding away (which I take as a sign of accomplishment!), but progress is slow.

Some mentionable things did happen though: I signed a contract for future lyrical collaboration (woot!), I worked, I cycled, I hoarded sunshine in preparation for rainy days (like today, all day a haul of rainfall and painful hail bailing out the shady gray - on plated panes they play away) and... that's about it. The media's been buzzing about the recent plane crash all week, making me a bit doubtful as to transportation form of choice up North and down. Who knows who's sitting behind that bolted blastdoor at the top of the plane anyway?! But on the other hand, who knows who's sitting in the driver's seat in each and every car that blazes by you in the opposite lane. O_o

Oh, there's been some game development as well, some games, some same(routine tasks as alway)s. And yesterday marked the start of summer time, so suddenly I'm one hour more tired than usual. Until next week, keep fighting the good fight! And Gouda Knight - a cheesy warrior.

Comments

Keep track of the discussion via rss? Read about comment etiquette? Or type in something below!
  1. S3C
    Monday Mar/30/2015

    Congrats!! Details on contract?? Details on NGAP collection?? Two members of flight personnel in cockpit at all times as flight law mandates in the U.S. (and now Germany iirc) = problem solved

  2. Cyber
    Monday Mar/30/2015

    Thanks! Details on contract are confidential I think. :) Though, I suppose I can reveal it's with a buddy, for an EP; online distribution only. And one verse. No HUGE thing, but I am pretty excited about it!

    As for the NG collection, I'll be going through as much of K as I can today (ca 1,700 artists on that letter), aiming for another five letters this week, but I think LMNO might be a bit more popular. I've been averaging around 1,000 artists each day, and so far I'm up at around 20,000 artists total and over 200,000 tracks (at least 1/3 of the AP, excluding potential dupes). I'm only going for scouted artists btw, via a list of artists from the end of 2013, so it's not an actual complete collection. Also turns out you can't download tracks from unscouted artists, so it's a good thing I don't have all of those on my list (can't remember if it's always been this way, as long as we've had scouting?). Also, this is expanding on my older collection of AP tracks, which had been slowly building up over the years; was closing in on 100k last year. Thought I'd finally get this done once and for all, before too many old masterpieces disappear! Don't want to post too much about it before I do so though, if I persevere through the ladder of letters... might become like all those new year; excessive resolutions that never get done once I announce them.

    Ah yeah, two airlines have voluntarily started with those same regulations here, hope the rest jump aboard soon. Oh, and I will get to that track some time soon I hope! My priorities are getting a bit sidetracked with this file collecting thing.

  3. S3C
    Tuesday Mar/31/2015

    Sounds good, bro. I think I know what track you're mentioning??

    I hope you're not downloading all those tracks manually lol!! But using some kind of script. Btw I think you should be able to download unscouted artists, just open up the source of the page, find the .mp3 and grab it. Before the redesign you could just swap the "listen" in the track url with "download" (or "reviews" if you wanted to see the reviews) to automatically download any track, even the ones that the mods/admins removed if the track violated the rules of the AP (but never got officially wiped from the site).

    Just lyrics, or vocals too?? Also what kind of style of music

  4. Cyber
    Tuesday Mar/31/2015

    I think you do?? Thinking of the latest one, though it's really turning into a bunch of tracks by now, I need/want/should focus more on music! Well, making my own, not downloading others haha.

    It's semi-automatic. ;) I'm using a bunch of scripts to make it easier, but haven't found any that'd actually let me scrape the entire AP automatically. First, I use scripts to generate links, lists of names and folders, then open batches of links via linkrr.com, and download files artist by artist, with a userscript to transform /listen/ to /download/ in all URLs (that way I can grab all tracks directly from any user's main audio page) and a download manager/accelerator to DownThemAll! (TM) I do click the links manually if there's just one or two though, they usually download faster than the plugin loads the downloads then. Coincidentally, the most common amount of tracks/user seems to be just the 1. I'm guessing a lot of them are leftover casualties from that time when NG was expanding fast and superpopular and couldn't keep up with demands and users were waiting months to get approved for the portal so they could submit more. Those who didn't persevere through those waiting times!

    Anyway, on with the download process: files are saved to desktop, from which I then quickly drag and drop them to corresponding usernamed folders (using a nifty little freeware utility named FileNexus instead of the built-in explorer - speeds up transfers and folder opening a little), switch folder, drag desktop files to trash, download new tracks, drag and drop, rinse and repeat! I do try to manage other tasks amidst all these file operations, like writing stuff, but switching between steering a train of thought and moving files breaks my focus way too easily. Right now I'm mostly just watching very choppy episodes of Hawaii Five-0 between the moves.

    This is definitely much faster than how I used to download tracks (manually lol), but if you do have any ideas on how I could further automate the process I'm all ears!

    Yeah, I've done that with a bunch of tracks that didn't allow downloads, but it's pretty time-consuming. The URL is also embedded in JS with a bunch of slashes the wrong way (a \ before every /), so you have to get rid of those too. I've been looking around for a good file sniffer plugin that fetches such links for you, but apparently almost all media sniffers broke with the latest FF update, and the one I did find modifies the filename upon download. So... maybe later. Really is a shame you can't just swap the /listen/ in those links, that'd be so much easier. Oh, didn't know you could download even deleted submissions once upon a day! Should've (ab)used that feature a bit whilst it existed!

    One guest verse, both lyrics and vocals. :) Not sure about musical style overall, but this one track will probably be hiphop/punk... where I assume I'll be contributing that hiphop element. It's still a ways away though!

  5. S3C
    Wednesday Apr/1/2015

    What language is used in scripts anyway?? Javascript?? Writing a program in BASIC would be fairly easy for me, but I have no idea what functions would be used in a web-accessible language to download files, view and search the source code.

    FOR i = 1 TO 616821 #(the current latest track on the AP)
    DOWNLOAD "http://www.newgrounds.com/audio/download/", i
    RETURN i

    Probably not what your script looks like if you are downloading tracks per artist at a time, but this one is very simple and should allow you to scrape the AP in its entirety. But since you are using an external source to download you would replace the DOWNLOAD pseudofunction with an output write to a text file (the complete one which would be massive, more than halfway to a million submitted songs on the AP!!) You could also write in other functions, like if you get a 404 the script would parse the page source for the reason you get a 404- which is either the track has disabled downloads (either because the artists or unscouted or a more recent option of the artists disabling direct downloads) or the submission has been deleted altogether (which then you can refer to a lookup on archive.org to see if the submission and/or submission page is archived in an unlikely probability- unlike .swf files, archive.org sometimes saves audio!!). Then you could compile statistics as to the actual number of tracks on the AP, and how many of them are downloadable.

    I would surmise you're taking unnecessary steps in downloading per artist, and all the manual file transferring. Rather it would be more efficient to write a batch file that reads the downloaded files, creates a folder for the respective artist, and cuts and pastes them respectively. If the file information isn't provided in the .mp3 then you could just parse the submission page and gather the details and write it to the .mp3 file (or simply just rename it) before you download. The process is a little bit hairy, but the logic and programming behind it should be fairly simple to implement.

    No, once a track is deleted it's gone forever. But before the redesign tracks that infringed upon the TOS were not always deleted, just had their submission page (the one with the "listen" url) removed. You could still view the reviews and download the track by simply replacing the "listen" term in the url.

  6. Cyber
    Wednesday Apr/1/2015

    Yupp, userscripts use basically regular JS with a few special tags. Can't have them activated all the time btw since they do occasionally conflict with regular site functionality, like the formatting buttons on the BBS. Been wondering if that'd be considered a bug with the site itself or just the way it is with unofficial 'hacks' hmm.

    Woah, that seems... astoundingly simple! Would that code work? How to use it (does it need to be compiled or anything)? If nothing else, it'd definitely be useful to grab certain arrays of newer files, though sorting those files by username would be a problem. If folders aren't created at the time of download, it's often impossible to know who made what without looking up the tracks online (assuming online reference exists at that point - not all audio pages seem to be popular enough to get cached by the Wayback Machine in case they disappear). It used to be that uploaded tracks were tagged automatically with artist name (along with track name, ID, etc), but seems this finesse was removed somewhere along the 400-500k submission border, now users need to take care of tags themselves. Useful for those who do, but for those who don't you can't fetch artist info via tags any longer. I thought about having a collection that worked like that a while, with all tracks indexed and instantly sortable by title, name, length, etcetc, but alas, no longer!

    Well, I'm using the browser directly, with one userscript for link transformation and one plugin for ease of download. The links I have are to artist audio pages (user.newgrounds.com/audio/), not to the individual files. I'd have to scrape those pages first (which should be easy enough with HTTrack) and then somehow extract only the download links. Which there should be a program for, I suppose, and if files download as they do via browser, the ones that 'fail' would generate an .htm file with file ID rather than the .mp3 (at least they do via download manager), so those would be easy to find even without any form of log. Though, we come back to the issue of folders again.

    Ah yeah, stats like that could be interesting! Least once this is done we'll know around how many tracks remain out of all those originally submitted... err, well no, scratch that, we'd just know how many submissions were attainable at the time of download. Would it be easy to modify the script above so it simply checks if the file exists for each download, to garner such useless but somewhat intriguing info? Good tip on the Wayback's potential audio storage too! Feels like I've known that before, deja vu. Gotta check for some of my older missing submissions some time.

    Hmm if I knew such programming! Would that be easy/doable with BASIC? I'd probably still save time if I spent downloading time learning how to program such a program... but a task not instantly achievable and potentially unachievable somehow doesn't seem that tempting. In that regard, clicking links and buttons offer immediate compensation... and mild mindless monotony.

    Ah, good knowing!

  7. S3C
    Thursday Apr/2/2015

    Hell yeah it works...(code only needs to be compiled if you want to run a standalone program). In fact I just ran that code and generated a file of all NG tracks up to 618,000 (leaving some space for future submissions, the current submission is at 617090 (and the user deleted it before I even finished this message. don't even remember what it was called), that's 269 submissions since my last post yesterday! The AP is not as dead as I thought!). The file is only 22.2 MB and was generated in like 10 seconds, smaller than I expected. I was thinking I would need to use MATLAB or VisualBASIC but QBASIC did the trick! Open the text file in a word document that isn't Notepad though as Notepad isn't intended to process larger files.

    You could also try inputting the download link through an online javascript fusker to create an entire compiled list of all the download links, but those tools tend to crash when it has to run through thousands of iterations in a single input (I waited a minute before electing to restart my frozen browser)

    How do you get the usernames though? Because that wouldn't be a systematic way of downloading the entire AP unless there is a list of all the AP artists somewhere? Here's one possible one, but it's restricted access: http://www.newgrounds.com/audio/member

    Well like I said above, a batch script (or just QBASIC code like above) could simply create folders for the individual artists and move the tracks appropriately. Although, with older versions of DOS output file names could not be longer than 8 characters (this may be different in newer versions, I haven't checked) so you would have abbreviated folder names for artists that have names more than 8 characters. Folder creation and file transferring is easily doable in QBASIC, except for the part that it can only open up files as large as 32.767 KB LOL. Just a mere fraction of an audio file. But the program is over 25 years old so...There may be ways to work around this. I don't think batch files ran via a DOS command prompt have the same limitations.

    Would it be easy to modify the script above so it simply checks if the file exists for each download?? Not in QBASIC with any ease as it cannot directly interact with the internet, but it should be fairly simple with using Javascript (which I don't know). I did something like that once a long time ago, I forget for exactly what. I compiled a list of links like above, and which was read by a program that would take a .jpg screenshot of said page. 404s would be smaller in image size than non-404s which were more colorful. Not fool proof or good programming by any means, but I just used that logic as a buffer, if the image was over a pre-set filesize the link was valid.

    Well I checked the web archive and 150,277 pages are archived under (https://web.archive.org/web/*/http://newgrounds.com/audio/listen/*) - your browser will probably freeze if you leave the page open for more than a few seconds. The download link is protected by robots.txt and therefore not archived. HOWEVER, the actual .mp3 is not protected from web-crawlers!

    Search for .mp3s here: https://web.archive.org/web/*/http://audio.ngfiles.com/*

    There's a total of 22,265 mp3s archived!! Not all of the links archive the mp3 unfortunately (i've only tested a few) but that is way more than I expected, and 22,265 more than google cache has! (iirc google doesn't publicly cache mp3s and the cache pages disappear after a while). nonetheless this calls for a .gif meme celebration: http://i.imgur.com/Kf2Us.gif

    I also noticed the flash works on web.archive on the newer(post redesign maybe?)pages but not the old ones. I wonder why that is? NewGrounds visual media doesn't even seem 100% flash based anymore. The players seem like they are hybrid with some other type?

    Also, there seems to be a glitch in the audio system- check this page out: http://www.newgrounds.com/audio/browse/sort/views/page/7777

    It seems these tracks the users deleted (gone from NG permanently), but the entry remains cataloged! Up to about page 7771 or so, perhaps they were so unpopular that they mysteriously exited the AP lol. Also, seems like the audio portal browsing pages is a more convenient start to systematically "scrape" the AP for links and rename as appropriate, as the download links, artist name, track name, and genre are all neatly organized in the source.

  8. Cyber
    Thursday Apr/2/2015

    Here I thought the script would actually download the files! A generated list hmm, could use this to generate the numbers: http://textmechanic.com/Generate-List-of-Numbers.html

    Then use this: http://textmechanic.com/Add-Prefix-Suffix-to-Text.html

    ...to append the generic download link before all numbers. Then use download manager of choice to download 'em all! I am kinda attached to a folder-based structure though. Btw, that BASIC script, how DO you run it? Add to a .txt file, change extension...? Run via CMD? Another program? I kinda... start suspecting you may be April Fooling me a bit, but the timestamp does read Apr/2 hmm. If you do run via .BAT, as the comment lower down (well, higher up now) seems to suggest, where is the info saved? How do you specify output file?

    I got the usernames via this: http://www.newgrounds.com/bbs/topic/1358510

    Scraped the source code for the username list and via textmechanic somehow got rid of everything that wasn't a username. Then created another list with appended URL parts surrounding names (http://*name*.newgrounds.com/audio/), used one list to generate folders, the other to open pages, and here we are now!

    Ah, how'd you find http://www.newgrounds.com/audio/member ?

    "don’t think batch files ran via a DOS command prompt have the same limitations"... so basically, Basic runs via .BAT files? Or maybe .BAT is simply an extension used to run a multitude of programming forms? I really know nothing about this stuff, just stumbled upon a .BAT file somewhere to print directory content and since then I've been searching up useful snippets of .BAT code to perform certain tasks, like this one I use to generate folders via lines in a specified textfile:

    @echo off
    for /f "tokens=*" %%a in (file.txt) do (
    mkdir "%%a"
    )

    Is that 'Basic' I wonder? There's no restriction on 8 characters per folder though. lol, speaking of filesize restrictions, I downloaded a series of podcasts @ 150MB each a while ago, with the recent limit increase they really jumped a level. I remember when everything was WAV and usually loops and usually just a few hundred KB.

    Mmm, using download links the program would need to check only which file format is served, before breaking connection with said file. I wonder if that can be done. IOW if the file format can be checked without the file being downloaded if you don't have a direct link to said file. Well, all this is way above my potential making anyway. Was that page-via-image-checking thing in Basic as well? Interesting method.

    Nice research man! The best case scenario 22,265 is a small fraction of the bigger picture, but maybe some of those are files I do not yet have! When I've completed this collection I'm thinking I should be able to find some nifty little program to show all files, regardless of (sub)folder structure, sort those files alphabetically (via ID), and skim through them looking for missing numbers, then see if some of those exist via the Wayback! Will definitely give my own files a search before that, though. If I can access the individual page for each track, I'll be able to check for the embedded file, and get the actual filename, then check that via that other link. My old audio page has a pretty good overview of what I'm missing, which is fortunately not that much... some of the oldest ones, though. Nostalgic stuff.

    Hmm, NG could be using a better way of embedding the files, or maybe the machine just recently started indexing them? It should be HTML5/JS if it's supposed to work via all devices, but if I right click a media player I still get the Flash menu... which seems strange considering the media player was a step away from Flash. So not sure what's happenin' there.

    Interesting... could use that bug to scrape up a list on missing audio too, and I spot some pretty popular names in that last page! All those great tracks missing. Fo shame. As for scraping btw, I'm thinking if I use HTTrack to save all those pages (7777 of them), then search and replace them with /listen/ to /download), then extract all links, then sort links alphabetically, then cut out the /download/ portiot of it all... well uh... at this point of my idea it dawned on me I'd still have no usernames or anything linked to said files so... disregard that. Would have to be a program with said purpose. Question is: would it be worth the time to make? Or the money to hire someone to make it? Do you happen to know anyone who'd have an interest in making something like this at no cost? Just sent a PM to the guy who made that web player I linked to btw, maybe he's interested.

    Speaking of page bugs, I found a bug on my art page a few days back, it suddenly displayed 1065 pages. Was about to report it, but checked again and they'd fixed it already... so maybe these page things is a side-effect of some ongoing update. I did get a picture: http://cyberd.org/img/3/NG-B-Buggin.png

  9. S3C
    Friday Apr/3/2015

    The text mechanic tool essentially does the same thing as what I was generating in QBASIC, yes. More convenient too. An offline tool would still be my go-to choice for creating massive lists, though.

    Lol no April Foolery, and it was April 1st when I wrote my last post, as I'm 9 hours behind you. To clarify: BASIC is just a simple programming language that started out in DOS and eventually became QBASIC and then later progressed to a more sophisticated VisualBASIC with several other BASIC languages being developed. They all have the same general principles with minor changes in syntax, GUI, and whatnot. The language can be used to make decent programs but no serious programmer really uses the language to develop stuff- it's more so just a language to teach programming logic to starters.

    Batch (.bat) files are like QBASIC, but don't operate under the same commands. The example code you posted for .bat would be different than the code in QBASIC. They are essentially command prompt codes to be executed without having to type each line in to the prompt one at a time manually. Command prompt essentially executes batch files which can be used to run other programs (DOS programs such as QBASIC or Windows) or any other DOS or Windows process.

    You would use the specific BASIC program to run the BASIC script (in my case QBASIC runs the QBASIC files). The QBASIC files are .bas files, but you can write the code in just plain text and open that in QBASIC as well. You can run the QBASIC code through CMD (command prompt) but you would have to open QBASIC through command prompt first or use some sort of command to sequentially open and run the program in QBASIC. At the point it would probably just be more convenient to compile the QBASIC code as an .exe and directly run that through the CMD.

    A .bat file would trigger the QBASIC code which could specify the naming of the output file and folders. Anyways scrap QBASIC for now, as it was just means of generating a list. I first mentioned using a .bat as means to reading all the files you downloaded, using the name of the track to create a specific folder, and move the tracks to that specific folder. I don't remember the specific commands used in a batch files to do all that, I just know that it's easily doable.

    The restricted NG member page is just a parent folder of the unscouted artists list: http://www.newgrounds.com/audio/member/unscouted_list

    In regards to the screenshot script dealy: I used BASIC to generate the lists, and batch file to read the links and execute the subsequent link into the program that takes the screenshot...I think.

    Would it be worth it to make?? I think so, it's good to have large archives of content just for the sake of documentation if not nostalgia, and to observe trends of the past that may get erased. Nothing to spend TOO much time on, though.

    Sounds like a solid method.
    -You could use HTTrack to create all the links, which have the artist name, track name, genre, and ID associated with them.
    -Assuming all the tracks in your collection have an artist name assigned to them, and you've downloaded each track from the artist, you could make a .batch file or QBASIC script to scan all the subfolders containing the NG audio to create a master text file of the all tracks you have downloaded already.
    -another set of a code (not QBASIC as files would be too large for it too handle-perhaps MySQL?) to use the artist name in the master text file, search the artist name in your newly made list that links all the tracks via HTTrack, and delete any duplicates so you're not downloading tracks that you already have in your collection.
    -Use the download manager to automatically download files from your HTTrack list
    -The files you downloaded may not have the artist or track names, but they all should have the track ID. Create another set of a code that cross references the track ID with the ID stored in the master list, and then rename the file with the track name and artist.

    Hehe...that actually wasn't an error. NewGrounds became sentient, used your headphones to transmit a bug that scanned your brain for all the art you've made over the years. Just one more step in NewGrounds transition to AIGrounds. But once you realized all the 1000s of images that were brain-loaded to the portal, NewGrounds made the art that you didn't publish private. Just don't toggle the adult option, as a man creates a sexual image in his head 6 times a minute.

  10. Cyber
    Friday Apr/3/2015

    The browser's the limit!

    Aha, a concise summary of the Basics! Got it. Compiled as an .exe though, wouldn't you be able to run it directly?

    Ah, one folder for each file. Doesn't sound fully as functional that one.

    Aha, that's where. 'member' might be just a placeholder for the one section though, not an active part of the URL.

    Aha.

    Mmm (my affirmative responses contain way too many 'Ah's I realize), I'd like to get something like that going for scraping Flash submissions or art as well, as those are all embedded it's something doing manually would just not be worth doing. Which reminds me... submission #666,666 is closing in pretty fast for both Flash and Audio...

    Sounds like that'd work for renaming all existing files to include author and track title, but that main problem of how to assign new files to corresponding artist folders remains, and though the artist; submission name is associated with the link in the source files I scrape, how would I be able to tell HTTrack what to do with any information not part of the actual link, how could such information be applied to actively alter the download destination of each file? Seems complicated.

    lmao man, the 1st of April is no longer :) *toggles the Adult option...*

  11. S3C
    Friday Jun/19/2015

    Finally dug up this post...holy crap its been pushed all the way to page 20.

    Yeah, you can just run the compiled BASIC code (.exe) directly...not sure why I said to use command prompt. Maybe run the .bat through a command prompt to run the BASIC .exe but then you could always just convert the .bat to .exe and run the whole thing as one standalone.

    I've been using HTTtrack recently for some "proprietary" work and it works like a charm to grab websites! Yesterday I just retrieved some 400 html files (just the code, not any of the media or related web elements) out of 15000 links in about two hours. A bit slow though if you're going to use it to grab media files. Anyways, I think I may have devised a solution to scrape the AP...stay tuned...

  12. Cyber
    Friday Jun/19/2015

    You sure took your time! :P

    Good knowing!

    Yeah, I've been using HTTrack for a few pages too, most recently I downloaded an archive of all my BBS posts (good thing they're all browsable separately as they are) fearing I'll lose some really nostalgic thought-provoking worth-saving ones that I'll probably never actually look back at... but it feels good to have them saved anyway. So much stuff being removed all the time. Ah, nice! Definitely something I'd like to try! I'll be waiting right here then!

  13. S3C
    Wednesday Jul/8/2015

    I hearby dub this comment space web storage projects /general

    nice, I suppose you set the spider to bypass the robots.txt file, which I believe the NG forums have enabled (hence why they are not searchable on archive.org). But my newest HTTrack project: cataloging all the Bump music played on Adult Swim that is archived over at bumpworthy.com. Hope to publish it soon, but it would be much more useful to actually configure HTTtrack to scrape youtube for the actual full tracks if available. Better yet, feed the links into a youtube mp3 converter. Though I've yet to find a (free) tool that does this en masse.

  14. Cyber
    Monday Jul/13/2015

    Good dub. :P

    Hmm, the BBS should still be indexed though, it's not blocked via robots.txt: http://www.newgrounds.com/robots.txt

    The 'pages' that are blocked are mostly not pages, just functions masked via URL, like when you post, search, etc. Though they did block the rankings page. :/ Which is a shame since I was using another service to monitor the latest deposits page earlier to scrape in new usernames for the hexlist. Fetching new names now is a bit more difficult... with a program you should be able to manage that, but all the online services have to respect such files and abide by such laws I suppose... and I can't have a program running 24/7 without a server (which I do have, though I've no idea how I'd do something like that with linux). Wonder why they've blocked the news and playlist directories too btw, maybe people were somehow mass-submitting to, scraping; abusing those...

    I'll have to give bumpworthy a visit soon as I get back! How much music would that be? Yeah I haven't found any YT conversion service that really makes things easy for you either, though via plugins and/or programs... you might be able to speed that up a bit!

  15. S3C
    Friday Jul/17/2015

    so do they even have Cartoon Network/Adult Swim in Sweden?

    hmm let me check...so there is a total of 7240 bumps (and growing, the real number of bumps must be much larger, than that, as the program has been going on since September 2001 and features 1-4 unique(?) bumps between commercial breaks (2 per half hour, for 10 hours of programming 7 days a week O_O)) on the site, but 1941 unique identifiable songs (many of the music is still unknown). it's mostly independent stuff too. wonder if there is a chance to get NG music on there...

    Well google returns plenty of methods/tools to mass download youtube content, but I'm quite skeptical of how well it works. Speed is not so much as an issue, thoroughness and minimizing tediousness is the most important aspect to me...

    Wasn't there a page on NG that listed the most recent users at one point?

    Huh, wonder why archive.org says the forums are protected then...the ultimate question here though being is this planet itself protected by a robots.txt file in certain regions so Google Maps/Earth cannot display certain sensitive regions (Area 51, Illuminati Hideout, secret NG HQ, Asian Breeding Grounds, etc.) and how easy would it be to bypass using non-government regulated satellites...

  16. Cyber
    Thursday Aug/27/2015

    With the right set of subscriptions, you do get Cartoon Network here if you do desire! Though I didn't realize at first what the topic had to do with the convo at hand, bumpworthy seems... interesting.

    Mmm the programs ought to offer better functionality than the online services do for all parts, though I've no specific reccomendations. When the time comes (been planning to batch download all the music videos I've ever posted here, though it's not really a priority yet) I'll let you know what works how/well!

    There still is: http://www.newgrounds.com/rankings/signups

    Could scrape that page to get a complete list of all users too. :P Which... come to think of it might be very easy to use for malicious PM bombing purposes, etc. At least it would've been when it wasn't blocked via robots.

    Hah yeah, no doubt there's some censored areas... at least you can find Google's gigantic secret mountain compound. :P It's pretty cool.

  17. S3C
    Sunday Jul/31/2016

    yo, I got the NG AP catalogued in a large 25 MB spreadsheet just for general purpose . 412014/695022 tracks, not including unscouted. You're right, a lot of users with just 1 audio submission: 15,468 users WTF! Looks like my top ten one hit AP wonders list is a no go now lmao

  18. S3C
    Monday Aug/1/2016

    speaking of robots.txt, did you notice that NewGrounds disallows the archiving of userpages? But if you have submitted content, or made a newspost, the robots.txt command that blocks crawlers is ignored.

    Ironically, if you delete your account, there will be no robots.txt page and your userpage will continue to be archived. If a blank account is created, the robots.txt file will be regenerated and once crawled by the wayback-machine the website will retroactively remove any previous pages that it saved

    wasn't always this way...for the case of NewGrounds, it's understandable, to give users a certain privacy.

    in general though, the policy is complete BS...as a hacker can simply inject a robots.txt file onto a website they hijacked, deleting it's past existence...if someone on NewGrounds was hacked, and had all their content deleted, their would be no means of recovering any of it.

    and there's somewhat of an outrage over at archive.org:
    https://archive.org/post/406632/why-does-the-wayback-machine-pay-attention-to-robotstxt

    there are other public archival services that ignore robots.txt, but they aren't nearly as robust as archive.org

  19. S3C
    Monday Aug/1/2016

    well I did some reading and it seems that once the robots.txt file is removed, the previously archived pages reappear on the Wayback Machine. Good to know they aren't permanently deleted.

    I made a top 100 audio submitters list on the W/HT forum...you're sitting at ten-way tie at #205 with 148 submissions, despite the hijacker deleting most of your audio submissions in 2009. I'm at spot #463, a twelve way tie with 100 submissions. didn't realize I had that many...

  20. Cyber
    Thursday Sep/1/2016

    Just a few days after I leave and you've got all this interesting stuff going on already! :O I'd love to have a copy of that file, if possible... though my archival project is at a standstill until whenever they remove that pesky 3 downloads/minute limit, or I can override it, or more realistically: I actually have the time to continue.

    I didn't notice that about robots.txt, interesting. And a relief that stuff isn't permanently deleted!! Ideal would be to have a privacy option available for each user, so they can choose if they want their content archived or not. I know I do...

    On mah way to check out that list now! :D Surprising how many users there are there that I didn't know about. Thought Chronamut would be a lot higher up, and I had TheComet
    and BowserThedestructive pegged for top spots. PERVOK hmm, namechange maybe? If I didn't have that limit this'd be the perfect place to start archiving too. Most tracks with least required amount of userpages to get to. Aaand now that there's a list I'd better get back on it again! :P Though I did decide a while back to stop submitting stuff just for quantity hmm, it's tempting...

  21. S3C
    Saturday Sep/3/2016

    PERVOK is formerly Zenon.

    ever thought of using a remote or proxy downloader?

    alright, I uploaded the list to the Hip Hop Folder.

  22. Cyber
    Saturday Sep/3/2016

    Ah, there's a name I recognize. I think I knew about that namechange too, bad memory...

    Hmm the idea sounds intriguing, though I'm not sure how that'd work in practice. It's not an automated process as is - I need to visit each page manually to grab the links for all audio files, and use a userscript to do so. Ideas? Services you know of? I tried the VPN approach, but that didn't make a difference, so a regular proxy seems skeptical.

    Ooh, on route.

  23. S3C
    Monday Jan/27/2020

    so I'm picking this up again, and as an easy start I scanned the first 95 missing IDs at newgrounds.com/audio/listen. My goal here is to just get the name of the submission and author. Only 1 page is on Wayback (submission 25) and that's just the page mind you, didn't check if the audio itself was archived. About 15 or so have the NewGrounds 404 archived on WayBack, and the other don't exist within the archive at all.

    As noted above, this is just a quick and dirty way of doing it- and is not expected to retrieve much results. There's scanning caches of the actual AP homepage, userpages, favorites, the forums, etc.

    Hey, do you remember the link to the audio pages before the 2007 redesign, all that I can find on Wayback is the dates after that.

    Also, Wayback is much improved. Something like this https://web.archive.org/web/*/http://www.newgrounds.com/audio/listen/* with 100,000+ pages doesn't crash your browser, and is searchable. I also know of an (inefficient) way to crawl it, and scrape all the links...

  24. S3C
    Tuesday Jan/28/2020

    hmm, I'm trying to crawl this https://web.archive.org/web/2003*/newgrounds.com/audio and it's harder than expected because the page is dynamic and I know little about webdesign. I'm trying to scrape the links to each snapshot, but the elements are nested and the client I'm using only seems to collect the 'outermost' elements. IOW, it only retrieves what you would get by viewing source, and not what you would retrieve by inspecting page elements.

    There's an element called ''
    that seems to encapsulate all the dynamic content. I'm trying to get the nested class 'calendar-layout' but my client cannot access it.

    I believe HTTrack will automatically crawl and get these links- but I'm trying to be more elegant about it, by using a Python script on Google's cloud, for speed.

  25. Cyber
    Wednesday Jan/29/2020

    Oh nice! Hope you make even smoother progress this time. I actually thought about hiring a freelancer for my total collection thing, but not sure it's really an investment I'd feel proud of... stalling. Just would be nice to actually get it all done.

    As for the user audio pages before the redesign, if that's what you mean - they didn't really have individual pages, it was just the dropdown, and profile URLs by ID, for example https://web.archive.org/web/20050307062650/http://newgrounds.com/gold/profile/template.php3?id=1

    That is pretty cool! Bookmarking. Would that inefficient crawl method involve HTTrack too?

  26. Cyber
    Wednesday Jan/29/2020

    Regarding the element called ''...? Suppose there might be supposed to be an HTML tag there that the comment system cut out?

    There's a lot of nesting there hmm. First div id react-wayback-search, then by class: calender-layout > calender-grid > month > month-body > month-week > month-day-container > /web/20030228/newgrounds.com/audio

    ...it's the relative link too, if that's any trouble with HTTrack.

    Interesting. Can't really advice on programming there but hope you figure it out.

  27. Cyber
    Wednesday Jan/29/2020

    Oh actually: maybe you could have a script that generates or goes through the URLs day by day, without crawling for them first? The format seems simple enough, just /web/YEARMODA/newgrounds.com/audio

    ...would then also need a way to quickly discard crawls on the ones that show no results.

    Maybe the normal way's the fastest though.

  28. S3C
    Thursday Jan/30/2020

    you certainly could brute force day by day, though it would be slower and you'd miss additional hits that occurred on the same day needless to say. For days without a capture, Wayback actually redirects you to the most recent cached date.

  29. S3C
    Thursday Jan/30/2020

    So I think i'll try that brute force method^

    gonna start with the 2007 redesign AP first. I'll also collect the weekly top 5's and top 30s so the list has more purpose then just being an archive of entries. The weekly top 5s should be archived in the audio forum somewhere, but AFAIK the the weekly top 30 isn't documented anywhere. Done with the top5 script...

  30. Cyber
    Thursday Jan/30/2020

    Ah yes, additional captures would have a different URL format I suppose, and accounting for those way more crawl time...

    Might work fast enough if you could skip duplicate entries or redirects somehow, but sure, first way seems like a better one if doable.

  31. Cyber
    Thursday Jan/30/2020

    Ah alright, just posted that and I see you settled on it already. XD Good luck.

    Didn't realize there was an alternative P-Bot-like Top 5 thread in the AP forum either hmm, bout to check that out now...

  32. S3C
    Thursday Jan/30/2020

    the full URL format is web/YYYYMMDDhhmmss/newgrounds...but I think incrementing by 1 day will suffice, as on average it looks like the Wayback archives the AP on two different days a week so there won't be that many duplicates due to redirecting back to a previous date.

    yeah, it was manually archived by users though unlike for flash portal submissions.

  33. Cyber
    Thursday Jan/30/2020

    Sounds good.

    Similar to threads on backgrounds and banners hmm, wish they'd all been kept alive through the years too. Big commitments though. Anyway, adding in those audio forum links here too for posterity:

    https://www.newgrounds.com/bbs/topic/1128764/1
    https://www.newgrounds.com/bbs/topic/998501/1

    Thanks again!

  34. S3C
    Sunday Feb/2/2020

    so I got 2007 scraped...surprisingly, took only five minutes, even with the redirects

  35. Cyber
    Sunday Feb/2/2020

    Ooh. :O Including files? Indexes only? That's pretty fast.

  36. S3C
    Tuesday Feb/4/2020

    just indexes of course...and thus, not really informative if all there is just the track ID and name. And if they made the top 30/top 5. Anyway scraped what I could from 2007-2011, recovered a total of 5189 entries. The next step: see which of these had submission pages archived by Wayback and lastly, see if the .mp3 was archived.

  37. S3C
    Tuesday Feb/4/2020

    so I was skimming the list and already found a good one, by kelwynshade (sometimes the author's put their name in the title), we have him as mutual favorites

    https://web.archive.org/web/20080313173804/http://www.newgrounds.com/audio/listen/106339

    wonder why he deleted it...eh hopefully he doesn't mind, but he doesn't come here, or NewGrounds either for that matter

    EDIT: well he did post on the Audio forum in the summer.

    EDIT x2: guess there's no point in specifying EDIT...unless you get notifications automatically and read within the short edit time frame...

    EDIT x3: can i post this when edit abilities expire in 5...4

  38. Cyber
    Tuesday Feb/4/2020

    Hmm that's a lot of scraping for a small amount of tracks, but if but a portion of those happen to be ones no longer online it seems worth it after all...

    Oh nice! Thanks for that. Shall check if it's in my downloaded archive already when I dig up the drive, DL in the meantime. That's kewl. I've tried to intentionally download repertoires of artists who've intentionally had their stuff deleted before too, maybe they have their reasons but it still just seems wrong to me. Gotta archive what's possible. Which reminds me it's been some time now since the last 'delete my account' topic popped up in the Wi/ht... wonder if those are just routed through PM or support email now or if less people are actually leaving, which would be cool. If we're just not getting those threads it's a bummer though, gives a little archiving time before they go...

    ...3, 2, 1...? Did you try going even further? I don't get notifications, but usually check in on comments via the admin panel every once in a while. Especially when you're active. :P

  39. S3C
    Monday Jan/4/2021

    wow I've learnt a lot about coding in the last five years.

  40. Cyber
    Monday Jan/4/2021

    As in: now so much more than these earlier posts show or: that these earlier posts in particular now show? :) Those five years sure moved fast...

    Any NG related archival progress since last?

  41. S3C
    Tuesday Jan/26/2021

    EDIT: web archive wildcard search/listing

    "Also, Wayback is much improved. Something like this https://web.archive.org/web/*/http://www.newgrounds.com/audio/listen/* with 100,000+ pages"

    Actually, the wildcard search results cap out at exactly 100,000- even if there are more pages on the archive. For example, audio ID 63000 is archived, but wont appear in search with the above URL. (seems like the top 100k results in alphabetical order are what the engine pulls).

    The workaround is just to conduct another listing using the leading digits of what you want to search for For example, to get 63000, just search this page: https://web.archive.org/web/*/http://www.newgrounds.com/audio/listen/63*

  42. S3C
    Tuesday Jan/26/2021

    Looks like links to unpublished or deleted audios on the forums have been replaced with 'Unpublished Submission' and 'deleted', leaving no trace. Sad!

  43. Cyber
    Tuesday Jan/26/2021

    Mmm, or simply navigate to said ID automatically then if you already know what ID you're looking for... but good note on the limitations/structure thereof. Seems like a lot of people might be browsing the archives lately too. Was down for a fair amount of time earlier today, when I was about to check out that audio-which-I-reviewed-now-removed link.

  44. Cyber
    Tuesday Jan/26/2021

    Dang. :/ It makes sense, but yeah, for ones that aren't archived there won't even be a trace as to what each submission was then, and all the more difficult to go digging for particular things elsewhere...

    I noticed that with regular submissions a while back. Occasionally you get the unpublished submission in the middle of a P-Bot post now. No longer any broken links or thumbs there for missing ones, just 'unpublished submission'.



The Comment Form

Your email address will not be published. Required fields are marked *

Your email is saved only to approve your future comments automatically (assuming you really are a human). ;) It's not visible or shared with anyone. You can read about how we handle your info here.

Question   Razz  Sad   Smile  Redface  Biggrin  Surprised  Eek   Confused   Cool  Mad   Twisted  Rolleyes   Wink  Idea  Neutral

Privacy   Copyright   Sitemap   Statistics   RSS Feed   Valid XHTML   Valid CSS   Standards

© CyberD.org 2021
Keeping the world since 2004.