Posts Tagged ‘elmcity’

Elmcity Meeting Notes

October 7, 2009

Nikita, Jory, and I had a quick meeting today, Wednesday, at 6:30. That looks like a good time for everyone but Diane, we’ll have to figure out a way to better include her.


Nikita and Jory are working on making the site as a RESTful service.

Diane is researching natural language processing as applied to generalizing the scraper.

I’ll be working on making what we have a little more user-friendly. That starts with documentation, but I’ll be blogging about other ideas for making the service accessible to less technical users in the near future. It’s a worthwhile conversation to have early, so throw out some ideas on FriendFeed if you’ve got ’em.

We’re debating whether to try to entirely separate the backround application and database from the web front end – now Django, but we’d presumably switch to a thinner one if we made the switch. Nikita will post a proposal on ReviewBoard to kick off the conversation.

[6:35:23 PM] *** Jory Graham added nikita.pchelin, sarah-strong ***
[6:36:07 PM] Jory Graham: Hey both of you.
[6:36:50 PM] Nikita Pchelin: âllo!
[6:37:05 PM] sarah-strong: hey
[6:37:32 PM] Nikita Pchelin: and we are missing two other people =( but then again, they kinda did not respond to the email
[6:37:37 PM] sarah-strong: i’m not at home and having mic trouble, but i can listen in or just chat by text
[6:37:55 PM] Jory Graham: So far I think I’ve only used skype for text.
[6:38:05 PM] sarah-strong: heh, alright
[6:38:20 PM] Nikita Pchelin: text is fine
[6:38:29 PM] Jory Graham: Also, this should be an okay time for everyone, since it is basically the time we decided on the Sunday when Diane wasn’t there.
[6:39:22 PM] sarah-strong: so, i’ve been putting this project on the back burner but i’ve been keeping up with friendfeed traffic. you guys are both concentrating on db modifications and programmatic access friendliness?
[6:39:33 PM] Nikita Pchelin: yes
[6:39:45 PM] Jory Graham: Yep. Basically allowing it to be used as a service.
[6:39:50 PM] Nikita Pchelin: actually Jory creates an abstract layer for the db and changes the db itself
[6:39:58 PM] Nikita Pchelin: i am concentrating on the actual communication part
[6:40:09 PM] Nikita Pchelin: diane was working on the AI aspect
[6:40:16 PM] sarah-strong: cool.
[6:40:37 PM] Nikita Pchelin: and that’s all I know, I am not sure who is doing what apart from the abovementioned
[6:41:16 PM] Jory Graham: Meghan had a post about writing unit tests, but it wasn’t aggregated into the dev room.
[6:41:29 PM] sarah-strong: ok. still need: interim/incremental/

alternative solutions for the parsing problem, but that could wait until diane gets back with her feasibility results, maybe?
[6:41:38 PM] Nikita Pchelin: oh right, I vaguely remember that
[6:41:51 PM] Nikita Pchelin: yes and no
[6:41:54 PM] sarah-strong: oh? i’ll add her blog to my feed
[6:42:09 PM] Jory Graham: Parsing is definitely a longer term piece.
[6:42:14 PM] Nikita Pchelin: to be quite honest I am not sure how much FuseCal was AI-fied
[6:42:14 PM] Jory Graham: But trying to find the general patterns is a good step in the right direction.
[6:43:06 PM] sarah-strong: totally. and to find general patterns, having a few more site-secific parsers could help.
[6:43:42 PM] Nikita Pchelin: alright then
[6:43:47 PM] Nikita Pchelin: the other big topic is documentation
[6:44:24 PM] Nikita Pchelin: how are we going to maintain that? do we want a little wiki page on the site?
[6:44:34 PM] sarah-strong: i’d like to do a bit of research on the users semantically marking up a page and translating that into a plugin problem, if only because four developers i’ve mentioned it to claim they’ve seen something like it, but can’t remember what exactly it was
[6:45:08 PM] Jory Graham: Did you check out Autopager?
[6:46:35 PM] sarah-strong: ah! you were one of them, i didn’t remember who mentioned that. no, i didn’t have the proper name for it. i’ll look into it and post about it by friday
[6:47:03 PM] Jory Graham: Cool. It’s definitely related in concept, if not in execution.
[6:47:57 PM] sarah-strong: i’d actually like to start in on making what we have understandable to people outside our team this week. i’ll try to put together a blog post with basically what we would need so that someone with a myspace band could stumble across our site and use it without further explanation
[6:48:26 PM] sarah-strong: start with super general documentation (wtf is ics, how do i use this file)
[6:48:54 PM] sarah-strong: then a lighter bit of documentation on how to use the current version
[6:49:16 PM] Nikita Pchelin: right, my suggestion would be, that you would also want to add to the documentation page of the website (or maybe just in docs folder in src, so it’s all in one place)?
[6:49:42 PM] Jory Graham: Agreed.
[6:50:01 PM] Jory Graham: Also, maybe eventually it could be destined for a wiki, but for now we’ll be the only people writing it, so flat files are probably fine.
[6:50:04 PM] sarah-strong: then look into feasibility of extra stuff like fuzzy url matching or site-specific search bars to do myspace band page –> all dates, for instance
[6:50:36 PM] Nikita Pchelin: okey, flat files it is then
[6:50:54 PM] Nikita Pchelin: I am just writing Michael so that we can get a mirror page running
[6:51:03 PM] Nikita Pchelin: the one will be for the developing
[6:51:13 PM] Nikita Pchelin: the other one will be for testing )
[6:51:40 PM] Nikita Pchelin: I talked to Jory about it the other day, it seems to be a good idea, because adding new features often breakes the front end
[6:52:02 PM] sarah-strong: stable/dev branches? it’s a very good thought, we’ll have to get in much better communication with the greoup when we implement it
[6:52:11 PM] Jory Graham: Indeed. I think he’s out of the office for most of this week, but he should be able to get something up fairly soon.
[6:53:13 PM] sarah-strong: jory, since you’re more comfortable with hg, could you maybe decide on how that’ll work?
[6:53:29 PM] Nikita Pchelin: oh yes that’s a good question
[6:53:36 PM] Nikita Pchelin: I meant to talk to you about it
[6:53:39 PM] Jory Graham: I don’t think it’s necessarily a matter of branches, so much as deciding which releases count as stable.
[6:53:52 PM] Nikita Pchelin: I had an idea
[6:54:03 PM] Nikita Pchelin: of separating frontend into a separate directory
[6:54:13 PM] Nikita Pchelin: sorry  separate repo*
[6:54:21 PM] sarah-strong: cool, that’s the simplest way i could think of, but again, i’m not so versed in dvc
[6:54:25 PM] Nikita Pchelin: because what happens now on the server is
[6:54:40 PM] Nikita Pchelin: we have a frontend folder, which is apache’s rooy direcotry
[6:54:59 PM] Nikita Pchelin: and then within this folder we have ./src directory which has all code
[6:55:14 PM] Nikita Pchelin: including the frontend (yes, it’s almost recursive!)
[6:55:22 PM] Nikita Pchelin: which is kinda ugly
[6:55:55 PM] sarah-strong: separate repos is way more ugly imo. Is there a utility hit to keeping it as is?
[6:56:42 PM] sarah-strong: since that’d mean keeping two repos synced and making sure to update each at once, which should be simple but since we haven’t settled on structure would probably get mucked up fast
[6:56:55 PM] Nikita Pchelin: no well
[6:57:18 PM] Nikita Pchelin: you dont have to keep them synced
[6:57:40 PM] Nikita Pchelin: two repos won’t have any shared files
[6:57:41 PM] Nikita Pchelin: we are just separating fronend (django) from the rest of the system
[6:58:03 PM] Jory Graham: But there’s shared knowledge between them. Django related stuff still needs to know which methods need to be called, etc.
[6:58:15 PM] sarah-strong: synced in that undergraduate coders will likely introduce unfortunate dependencies between the two repos as we radically change plans as we go
[6:59:12 PM] sarah-strong: on the other hand, if we have great confidence in our ability to do this welll, separating them would encourage good encapsulation
[6:59:24 PM] sarah-strong: i.. don’t 😛
[6:59:40 PM] Nikita Pchelin: lol 😀
[7:00:10 PM] Jory Graham: Regardless of which web technology we implement our service on, the fact that there’s a web frontend is fairly core to the project.
[7:00:10 PM] Nikita Pchelin: well I am just throwing it out there, let’s think about it
[7:00:28 PM] Nikita Pchelin: that’s true
[7:00:38 PM] Nikita Pchelin: but we ideally
[7:00:47 PM] Nikita Pchelin: the service should work as a service
[7:00:53 PM] Nikita Pchelin: once we have a protocol
[7:01:22 PM] Nikita Pchelin: if let’s say we strip the project off it’s frontend, we can still add a 10lines wsgi script to serve out pluginFinder through the web
[7:01:58 PM] Jory Graham: I’d still call that a web frontend.
[7:02:08 PM] Jory Graham: Not necessarily a pretty one, but I think it still fits the bill.
[7:02:28 PM] Nikita Pchelin: true
[7:02:48 PM] Nikita Pchelin: but the question is if we want to be dependent on django as our friend
[7:03:55 PM] Nikita Pchelin: for example, I did not use django database models in my python code, because doing that assumed that any setup of the project has to use django as its frontend, as opposed to be using a mysql database with any sort of frontend technology, like wsgi in the simpliest case
[7:03:56 PM] sarah-strong: ok, i’ve got a thought. This seems like a big, integral to the project debate, and one that reviewboard is designed to mediate well
[7:04:13 PM] Jory Graham: True. Plus we haven’t used it at all yet.
[7:04:35 PM] Nikita Pchelin: alright, then we can talk about it there, I’ll post
[7:04:52 PM] sarah-strong: maybe we should try this discussion on there? it’s apparently good for UMLish diagrams with modifications and comments
[7:05:05 PM] Nikita Pchelin: Is there anything else we want to discuss, or can we adjourn the meeting?
[7:05:09 PM] sarah-strong: if i understood meghan correctly
[7:05:34 PM] Nikita Pchelin: yep we’ll talk there
[7:05:39 PM] sarah-strong: i’m fine with adjournment
[7:05:56 PM] Jory Graham: Cool. Well it sounds like we all know what we should do for the rest of the week.
[7:06:03 PM] sarah-strong: nikita: could you post to friendfeed when you’ve posted to reviewboard (unless it has internal syndication?)
[7:06:03 PM] Nikita Pchelin: awesome
[7:06:20 PM] Nikita Pchelin: talk to you later; in an asynchronous way! 😀
[7:06:30 PM] Jory Graham: Even if it has syndication, I doubt it’s been added to the room yet.
[7:06:35 PM] sarah-strong: oh, i can post the transcript and a quick synopsis if that’s ok with you two
[7:06:43 PM] Jory Graham: But of course.
[7:06:46 PM] Nikita Pchelin: yep
[7:06:50 PM] Jory Graham: One last thing, a light note/
[7:07:07 PM] Jory Graham: There was a joke on The Big Bang theory on monday night about iCal syndication.
[7:07:20 PM] sarah-strong: we’ve hit the big time, guys
[7:07:22 PM] Jory Graham: And if your technology has made it to a prime-time show, that’s always a good sign.
[7:07:23 PM] sarah-strong: group hug!
[7:07:33 PM] Nikita Pchelin: aah I have to watch it, I missed a couple of episodes
[7:08:25 PM] Jory Graham: Alright, well talk to you both later.

Academic Link for Elmcity

September 29, 2009

The elmcity team had our three day code sprint and face-to-face meet this weekend, and we’ve got a little demo up.

It’s a framework for users to input URLs pointing to websites that contain calendar data, and get back .ics files ready to import into their calendar app. The meat of the program will be a generalized web-scraper if possible, and if not, a large set of site-specific scrapers. The site-specific scrapers are small plugins with rules for gleaning meaningful calendar data from a website with particular formatting.

We hope to offer elmcity plugin generation as a ready made, real life project for first or second year programming courses.

Each plugin is just about the right size for a small early university level project. I had a similar assignment in CSC207 at UofT last year, one intended to exercise regular expression skills and demonstrate the idea of markup.

What students get

  • A highly motivating real life project
  • Specific success criteria and a useful end-product (the feed, maybe one of a sports team they’re on, for instance)
  • A website where prospective employers can see their plugin in action

What professors get

  • A ready made, motivating assignment without existing solutions out on the web
  • One with very specific success criteria, including automated testing of plugins
  • A flexible assignment: you could specify that regular expressions must be used or allow students to try using a page parser. You could require a variety of coding styles. We could potentially call and accept input from plugins in any language, as well, by setting string standards for communicating calendar data.

What elmcity gets

  • Real life users give us specific motivation to code this project well:
    • Make sure that testing the code and running the server locally is a one-liner, so students can easily try their work
    • Include clear, concise documentation with illuminating examples
    • Make a clean and easy to understand plugin interface
    • Make a professional-looking, easy to browse project website
  • Longer legs for the project by ensuring there’s motivation to continually improve the set of plugins available, and by introducing the project to a new group of potential developers each semester.
  • It also means that the UCOSP group would be partially freed up from repetitive plugin writing, so we can concentrate on the core features and the standards mentioned above.

Other possible sources of web scrapers:

  • Users who can code a bit. This would be a lot like the academic source, but it would require a more formal set of requirements for human review, because the server would be executing code from out in the wild without TAs in the middle to ward off malicious bits. We’d also have to be awesome enough to inspire user loyalty to convince people to lend a hand.
  • If we have a lot of extra time and expertise on our hands, we could investigate the possibility of a browser plugin that would allow non-technical users to mark up a page to generate a scraper. I’m envisioning turning it on and highlighting or circling the time/date, name, and location of two events on the page, and marking them as such. The plugin would find the corresponding text in the source and generate rules on where it’s located in the document tree. The user would get back immediate results for the rest of the events they can check. I haven’t thought this through and I have no idea whether it’s at all feasible, but it would be cool.

Like it? Hate it? Need a whole lot of questions answered before you have an opinion? Please leave comments if you have any thoughts about the idea. Thanks!


September 25, 2009

Diane Tam points out a great, simple mercurial tutorial. Jory noted that the instructions here neglected to update the local repository after pulling changes. Thanks to their hard work, we’re up and running.

Thanks, guys!

/broken instructions removed so as not to be a trap

Elmcity: Reviewing the first meeting

September 14, 2009

I was the only new member of the Elmcity team who couldn’t make the first meeting, so I thought I’d catch up by posting my thoughts on the transcript.

They discussed two projects for civic-minded student programmers, making a scraper for calendar-like web pages, and a system for finding implicit recurring events in online plain text (like “meets every Thursday…”)

We’ll all be tackling the scraper problem first.

Event scraper specs:

These are very preliminary and informal.

Users can flag a MySpace site and get back an ICal feed of events, optionally keyword filtered.

Flagging means bookmarking to a delicious account with tags monitored by elmcity.

The .ics feed will be found on the elmcity site (for now, a simple local dummy version).

The plan now

The plan is to start off by implementing scrapers for MySpace and LibraryThing. They’ll both use the same service component (referred to above as “elmcity”). We’ll all be working as a team on this test run, but since there was no discussion of a shared repository or another collaboration location, I believe we’ll each start our own version of this in close communication with each other.

Other things discussed

The implicit recurring event finder would need significant user intervention to verify generated events. Jon mentioned Amazon’s Mechanical Turk service as a model.

The best solution would be to obviate a need for our scraper service by convincing content curators to release a feed. Jon tried to convince MySpace folk to do so without success.

My thoughts

Having a small project to work on individually to get acquainted with the problem seems like a good idea. I’d be a little happier if I had some dated milestones, particularly about when we’ll try to merge our code to start on the group project in earnest, but that’ll come.

What are the barriers to releasing feeds for calendar-like pages?

  • An ICal feed logo could raise awareness. Maybe the RSS logo with a date in the corner, something very obvious and visual. People use podcast and rss links, we should perhaps aim for a firefox plugin (?) that makes the scraper fit into that established workflow of “go to content page -> click logo -> autorun/copypaste feed into program”
  • Presumably for-profit content curators don’t want to lose pageviews by releasing feeds. If we figure out a place for both the content source link and site name+slogan, we might be able to convince curators like the free weeklies that having their tiny ad show up on everyone’s calendar several times a day cements their place as the canonical source for all events in their niche.

I should ask about how free our work will be. The scraper idea is based on a now defunct company by the name of FuseCal. It would be frustrating if our end project was another venture that could die off and need to be recreated because the source is unavailable.


September 10, 2009

I’m Sarah Strong, a third year undergraduate student at the University of Toronto, and I’ll be working on the Elmcity project this semester with UCOSP.

Free astronaut shirt!

I spent the summer of 2008 working at The Centre for Global E-Health Innovation on their remote patient monitoring system. It’s a rails app that listens for home recorded medical readings and sends out alerts to both doctor and patient if something’s wrong. It was a fantastic crash course in agile methods and the practical side of software development.

The next summer, I worked on TracSNAP, or Trac Social Network Analysis Plugin. It’s written in python and flare, and it’s designed to help developers on large, disorganized code bases connect with colleagues. I learned about program design and the challenges of making a novel app useful and easy to use.

In the past, I’ve worked as an anti-homophobia workshop facilitator, a conversational English teacher in China, a corporate ghostwriter, a copy editor, and a screen printer, and ran my own business helping people find positive expectation bonuses at online casinos.