Web Engineering Lecture One
On Web Engineering  Software Engg vs Web Engg  Web technologies: hypertext, hypermedia, client/server, etc  Search engines: searching, indexing, crawlers, etc  Search Engine Optimization  Web matrices and quality Web engineering  Systematic, scientific, engineering and management approach  Develop, deploy and maintain qualitative Web applications  focuses on sound methodologies, techniques, and tools for developing web apps  Web engineering focuses on methodologies, techniques or tools for developing web apps.  Web engineering is defined as ...the use of scientific, engineering, and management principles and systematic approaches with the aim of successfully developing, deploying and maintaining high quality Web-based systems and applications...  Web development has an important artistic side.  Web apps Vs traditional software devt/IS/computer application devt?  Characteristics of Web apps  Web apps constantly evolve. Unlike conventional software that goes through a planned and discrete revision at specific times in its lifecycle, Web applications continuously evolve in terms of their requirements and functionality (instability of requirements). Managing the change and evolution of a Web application is a major technical, organizational and management challenge  much more demanding than a traditional software development.  Web apps are inherently different from software. The content, which may include text, graphics, images, audio, and/or video, is integrated with procedural processing. Also, the way in which the content is presented and organized has implications on the performance and response time of the system.  Web applications are meant to be used by a vast, variable user community - a large number of anonymous users with varying requirements, expectations, and skill sets. Therefore, the user interface and usability features have to meet the needs of a diverse, anonymous user community to whom we cannot offer training sessions, thus complicating human-Web interaction (HWI), user interface, and information presentation.  In general, many Web-based systems demand a good look and feel, favoring visual creativity and incorporation of multimedia in presentation and interface. In these systems, more emphasis is placed on visual creativity and presentation.  Technology instability- new tools, technologies, languages, standards to cope with.  Web apps devt uses cutting-edge, diverse technologies and standards and integrates numerous varied components, including traditional and non-traditional software, interpreted scripting languages, HTML files, databases, images, and other multimedia components such as video and audio, and complex user interfaces.
 Delivery medium is different from traditional software.  Security and privacy needs of Web-based systems are more demanding than that of traditional software. Web Apps vs Conventional software  With respect to their development process, technologies, quality factors, and measures Web Hypermedia, Web Software, or Web Application?  Hypermedia  extension of hypertext  The Web is the best known example of a hypermedia system.  The Web has been used as the delivery platform for three types of applications: Web hypermedia applications, Web software applications, and Web applications  Web hypermedia application  a non-conventional application characterized by the authoring of information using nodes (chunks of information), links (relations between nodes), anchors, access structures (for navigation), and delivery over the Web .  Technologies: HTML, XML, JavaScript, and multimedia.  Web software application  A conventional software application that relies on the Web or uses the Web's infrastructure for execution .  Typical applications include legacy information systems such as databases, booking systems, e-commerce apps, etc  They employ development technologies (e.g. DCOM, ActiveX, etc.), database systems, and development solutions (e.g. J2EE).  Web application  An application delivered over the Web that combines characteristics of both Web hypermedia and Web software applications. Web Development vs. Software Development  Areas of difference for web devt and maintenance: People involved, intrinsic characteristics of web apps, and audience  Differences between Web and software development divided into 12 areas  application characteristics  primary technologies used  approach to quality delivered  development process drivers  availability of the application  customers (users/stakeholders)  update rate/maintenance cycles  people involved in development  architecture and network  disciplines involved
 legal, ethical and social issues  information structuring and design Application Characteristics Primary Technologies Used  Web apps use technologies such as Java solutions (JavaBeans, JSP, etc), HTML, XML, JavaScript, and databases.  Software devt uses technologies such as OO languages or procedural, databases, generators, CASE tools. Approaches to quality delivered  Web apps are expected to be high quality so that customers return to do repeat business.  Usability, accessibility, graphic design become very important  Competition is high over the users on the web  popularity is important Development Process Drivers  The dominant development process drivers for Web companies are composed of three quality criteria  Reliability  Usability  Security  With regards to conventional software development, the development process driver is time to market and not quality criteria Disciplines Involved  wide range of skills and expertise is required for web apps  Distinct disciplines such as software engineering (development methodologies, project management, tools), hypermedia engineering (linking, navigation), requirements engineering, usability engineering, information engineering, graphics design, and network management (performance measurement and tuning)  for conventional software, smaller disciplines such as software engineering, requirements engineering, and usability engineering are required. Information Structuring and Design  Web applications present structured and unstructured content, which may be distributed over multiple sites and use different systems (e.g. database systems, file systems, multimedia storage devices)  the design of a Web application, unlike that of conventional software applications, includes the organisation of content into navigational structures by means of hyperlinks  Suitable navigational structures
Technologies for Web Apps  The choice of appropriate technologies is an important success factor in the development of Web applications.  Markup/Hypertext/hypermedia/client-server/sockets  Define WHAT of a system: Define the requirements of web apps, identify the architecture, develop a design, etc  Define HOW: [implementation phase]  choice of appropriate technologies  Separation of content and presentation, is a central requirement to appropriately use technologies.  The specifics of implementation technologies for Web applications versus conventional software systems stem from the use of Web standards.  This concerns in particular the implementation within the three views: request (client), response (server), and the rules for the communication between these two (protocol).  Protocol: HTTP, SMTP, FTP  Client Technologies: HTML, Plug-ins, Java Applets, ActiveX Controls,  Server Technologies: Markup  instructions for document formatting. For example, we could write *Hello* to output Hello or /Hello/ to output Hello  This is text inserted in a document to add information as to how characters and contents should be represented in the document.  SGML  HTML/XML Hypertext and Hypermedia  Hypertext is understood as the organization of the interconnection of single information units.  Relationships between these units can be expressed by links .  Hypermedia is commonly seen as a way to extend the hypertext principle to arbitrary multimedia objects, e.g., images or video. Client/Server Communication on the Web  The client/server paradigm underlying all Web applications forms the backbone between a user (client or user agent) and the actual application (server)  2-layer architecture  SMTP, RTSP, SMTP  Simple Mail Transfer Protocol  SMTP combined with POP3 and IMAP allows us to send and receive e-mails  In addition, SMTP is increasingly used as a transport protocol for asynchronous message exchange based on SOAP 
RTSP  Real Time Streaming Protocol  A standard designed to support the delivery of multimedia data in real-time conditions.  In contrast to HTTP, RTSP allows the transmission of resources to the client in a timely context rather than delivering them in their entirety (at once) .  This transmission form is commonly called streaming  Streaming allows us to manually shift the audiovisual time window by requesting the stream at a specific time, i.e., it lets us control the playback of continuous media.  From Wiki
   The transmission of streaming data itself is not a task of the RTSP protocol Most RTSP servers use the Real-time Transport Protocol (RTP) for media stream delivery While similar in some ways to HTTP, RTSP defines control sequences useful in controlling multimedia playback
HTTP  HyperText Transfer Protocol  Text-based stateless protocol controlling how resources, e.g., HTML documents or images, are accessed.  Session Tracking  Interactive Web Applications must be able to distinguish requests by multiple simultaneous users and identify related requests coming from the same user  Session defines a sequence of related HTTP requests between a specific user and server within in a specific time window  Since HTTP is a stateless protocol, the Web server cannot automatically allocate incoming requests to a session  Two principal methods can be distinguished, to allow a Web server to automatically allocate an incoming request to a session:  In each of its requests to a server, the client identifies itself with a unique identification. This means that all data sent to the server are then allocated to the respective session.  All data exchanged between a client and a server are included in each request a client sends to a server, so that the server logic can be developed even though the communication is stateless.  Session tracking is normally implemented by URL rewriting or cookies. Client Technologies  Helpers and Plug-ins  Adobe reader, WinZip  Java Applets  ActiveX Controls  Document Specific Technologies  HTML  XML  XSL/XSLT  SVG  Scalable Vector Graphics - Allows describing two-dimensional graphics in XML - SVG recognizes three types of graphics objects: vector graphics consisting of straight
lines and curves, images, and text - Supports event-based interaction, e.g., responses to buttons or mouse movements - This format is suitable for all types of interactive and animated vector graphics. - Application examples include the representation of CAD, maps, and routes.  SMIL - Synchronized Multimedia Integration Language - Used to represent synchronized multimedia presentations .  Server Side Technologies  URI handlers  to process HTTP requests  Server Side Includes (SSI)  CGI  Server Side Scripting  Servlets  JSP  ASP.NET  Web Services  Middleware Technologies  Application Servers  Messaging Systems/Brokers
Web Application Architectures  The quality of a Web application is considerably influenced by its underlying architecture.  Components of a Generic Web Application Architecture  Components based on the request-response paradigm  Components Client  browser or user agent Firewall  A piece of software regulating the communication between insecure networks (e.g., the Internet) and secure networks (e.g., corporate LANs).  This communication is filtered by access rules. Proxy  A proxy is typically used to temporarily store Web pages in a cache  However, proxies can also assume other functionalities, e.g., adapting the contents for users (customization), or user tracking.  A proxy is used as an intermediate server to forward client requests for URLs to the (actual) server.  proxies are used to adapt and format links and contents to users Web Server  A Web server is a piece of software that supports various Web protocols like HTTP, and HTTPS, etc., to process client requests. Database Server  This server normally supplies an organizations production data in structured form, e.g., in tables Media Server  This component is primarily used for content streaming of non-structured bulk data (e.g., audio or video) Content Management Server  Similar to a database server, a content management server holds contents to serve an application. These contents are normally available in the form of semi-structured data, e.g., XML documents. Application Server
 An application server holds the functionality required by several applications, e.g., workflow or customization. Legacy Application  A legacy application is an older system that should be integrated as an internal or external component. Data Aspect Architectures  Data can be grouped into either of three architectural categories: (1) structured data of the kind held in databases; (2) documents of the kind used in document management systems; and (3) multimedia data of the kind held in media servers.  Architectures for Multimedia Data  The ability to handle large data volumes plays a decisive role when designing systems that use multimedia contents  Basically, multimedia data, i.e., audio and video, can be transmitted over standard Internet protocols like HTTP or FTP, just like any other data used in Web applications.  This approach is used by a large number of current Web applications, because it has the major benefit that no additional components are needed on the server.  Its downside, however, is often felt by users in that the media downloads are very slow.  We can use streaming technologies to minimize these waiting times for multimedia contents to play out.  Streaming in this context means that a client can begin playout of the audio and/or video a few seconds after it begins receiving the file from a server  This technique avoids having to download the entire file (incurring a potentially long delay) before beginning playout  Two protocols are generally used for the streaming of multimedia contents. One protocol handles the transmission of multimedia data on the network level, and the other protocol controls the presentation flow (e.g., starting and stopping a video) and the transmission of metadata.  RTP [real time protocol]  network protocol , RTSP [real time streaming protocol]  control protocol, MMS [Microsoft media server]
Fig 2: Streaming media architecture using point-to-point connections.
Search Engines
 Originally, the term search engine referred to some kind of search index, a huge database containing information from individual Web sites.  Help people find information on the Internet/on other sites.  Large search-index companies own thousands of computers that use software known as spiders or robots (or just plain bots) to grab Web pages and read the information stored in them .  These systems dont always grab all the information on each page or all the pages in a Web site, but they grab a significant amount of information and use complex algorithms  calculations based on complicated formulae  to index that information  General Operations of search engines: [Crawling, Indexing, Searching]  Search/crawl the Internet  Keep an index of the words they find, and where they find them  words: occurring in the title, subtitile, metatags, and other relevant positions.  Allow users to look for words or combinations of words found in that index Search/Crawl the Internet  Search engine employs special software robots, called spiders, to build lists of the words found on Web sites  The early Google system had a server dedicated to providing URLs to the spiders. Rather than depending on an Internet service provider for the domain name server (DNS) that translates a server's name into an address, Google had its own DNS, in order to keep delays to a minimum.  When a spider is building its lists, the process is called Web crawling  How does any spider start its travels over the Web?  The usual starting points are lists of heavily used servers and very popular pages.  The spider will begin with a popular site, indexing the words on its pages and following every link found within the site.  The Google spider was built to index every significant word on a page, leaving out the articles "a," "an" and "the." Other spiders take different approaches.  robot exclusion protocol: when a site's owner doesn't wish a spider to crawl its pages or links
Search Directory  A search directory is a categorized collection of information about Web sites instead of containing information from Web pages.  The most significant search directories are owned by Yahoo! (dir.yahoo.com) and the Open Directory Project (www.dmoz.org).  Directory companies dont use spiders or bots to download and index pages on the Web sites in the directory; rather, for each Web site, the directory contains information, such as a title and description, submitted by the site owner.  Directories are human-editable: People check your web site; people index your website etc.  Google also has a directory but the information comes from somebody else  from the Open Directory Project.
Building the Index  Once the spiders have completed the task of finding information on web pages, the search engine must store it in a way that makes it useful.  There are two key components involved in making the gathered data accessible to users:  the information stored with the data  the method by which the information is indexed.  In the simplest case, a search engine could just store the word and the URL where it was found.
Page rank/Ranking organic and paid search results  Search engines store more info that simple word/URL combinations.  An engine might store the number of times that the word appears on a page.  The engine might assign a weight to each entry, with increasing values assigned to words as they appear near the top of the document, in sub-headings, in links, in the meta tags or in the title of the page.  Ranking list tries to present the most useful pages at the top.  A search engine's organic ranking algorithm is one of the trickiest parts of designing a search engine, so let's start by examining the simplest kind of ranking algorithm.  Ranking is just another word for sorting, the act of collating results into a certain order. Shopping search engines typically use simple ranking algorithms that the searcher can choose. When the searcher is looking for a product to buy, the shopping search engine might start by ordering the results by price (lowest to highest), but the searcher can decide to sort the list by other columns, such as availability (in stock, within one week, and so on), or any other features of the product.  Term frequency, term placement, link popularity (link analysis)  Regardless of the precise combination of additional pieces of information stored by a search engine, the data will be encoded to save storage space.
 After the information is compacted, it's ready for indexing. An index has a single purpose: It allows information to be found as quickly as possible . There are quite a few ways for an index to be built, but one of the most effective ways is to build a hash table. In hashing, a formula is applied to attach a numerical value to each word. The formula is designed to evenly distribute the entries across a predetermined number of divisions. This numerical distribution is different from the distribution of words across the alphabet, and that is the key to a hash table's effectiveness. The hash table contains the hashed number along with a pointer to the actual data, which can be sorted in whichever way allows it to be stored most efficiently. The combination of efficient indexing and effective storage makes it possible to get results quickly, even when the user creates a complicated search.
Search and Display Results  Searching through an index involves a user building a query and submitting it through the search engine.  Displaying the results is a lot simpler than some other parts of the process  display can contain organic or paid results.  Organic results all use the title of the page followed by a snippet - a summary of the text from that page that contains the search terms.  Paid results also use similar methods to display the pages
Search Relationships  Search engines compete with each other, but they also collaborate  Many search engines use technology from their competitors to present results.  Understanding how each engine delivers its results helps you target the most effective search marketing efforts. 
"Spiders" take a Web page's content and create key search words that enable online users to find pages they're looking for.
Search Engine Optimization
 SEO is the process of improving the visibility of a website or a web page in search engines via
the "natural" or un-paid ("organic" or "algorithmic") search results.
 Search engine marketing  through paid listings
 In general, the earlier (or higher on the page), and more frequently a site appears in the search results list, the more visitors it will receive from the search engine. search engines. The process of editing a web sites content and code in order to improve visibility within one or more search engines
 The act of altering a web site so that it does well in the organic, crawler based listings of
  White hat vs Black hat SEO
 SEO techniques are classified by some into two broad categories: techniques that search
engines recommend as part of good design, and those techniques that search engines do not approve of and attempt to minimize the effect of, referred to as spamdexing. White hats are those website designers that play nice and try to follow all of the search engine guidelines to optimize their site  A SEO tactic, technique or method is considered white hat if it conforms to the search engines' guidelines and involves no deception.  White hat SEO is not just about following guidelines, but is about ensuring that the content a search engine indexes and subsequently ranks is the same content a user will see. White hat advice is generally summed up as creating content for users, not for search engines, and then making that content easily accessible to the spiders, rather than attempting to game the algorithm.
Black hats are where website designers use backdoors, cloaking/hiding, and other tricks to optimize sites. [keyword stuffing, hidden/invisible/unrelated, metatag stuffing, ] Black hat SEO attempts to improve rankings in ways that are disapproved of by the search
engines, or involve deception. One black hat technique uses text that is hidden, either as text colored similar to the background, in an invisible div, or positioned off screen.  Search engines may penalize sites they discover using black hat methods, either by reducing their rankings or eliminating their listings from their databases altogether